A fully on-device autonomous Android agent powered by the RunAnywhere SDK. Navigates your phone’s UI to accomplish tasks using local LLM inference via llama.cpp – no cloud dependency required.
X post demo — Qwen3 4B, 6 steps, 3 real LLM inferences, goal-aware element filtering, ~4 min end-to-end.
Full write-up and run trace: X_POST.md
The agent follows an observe-reason-act loop. Each step captures the current screen state, reasons about the next action using an on-device LLM, and executes that action via the Android Accessibility API.
+---------------------+
| User Goal |
| "Play lofi on YT" |
+----------+----------+
|
v
+---------------------+
| Pre-Launch |
| Intent-based app |
| opening (YouTube, |
| Settings, etc.) |
+----------+----------+
|
+===============+===============+
| AGENT LOOP |
| (max 30 steps / 10 min) |
+===============+===============+
|
+---------------v---------------+
| Self-Detection Guard |
| If foreground == agent app, |
| switch back to target app |
+---------------+---------------+
|
+---------------v---------------+
| Accessibility Tree Parsing |
| (ScreenParser) |
| Extracts interactive elements |
| as compact indexed list: |
| "0: Search (EditText) [edit]" |
| "1: Home (Button) [tap]" |
+---------------+---------------+
|
+---------------v---------------+
| Screenshot Capture |
| Base64 JPEG via Accessibility |
| API (half-res, 60% quality) |
+---------------+---------------+
|
+---------------v---------------+
| Optional VLM Analysis |
| LFM2-VL 450M analyzes the |
| screenshot for visual context |
+---------------+---------------+
|
+---------------v---------------+
| Prompt Construction |
| GOAL + SCREEN_ELEMENTS + |
| PREVIOUS_ACTIONS + LAST_RESULT |
| + VISION_HINT (if VLM active) |
+---------------+---------------+
|
+---------------v---------------+
| Samsung Foreground Boost |
| Bring agent to foreground |
| during inference to avoid |
| efficiency-core throttling |
+---------------+---------------+
|
+---------------v---------------+
| On-Device LLM Inference |
| RunAnywhere SDK + llama.cpp |
| Produces <tool_call> XML or |
| ui_func(args) function call |
+---------------+---------------+
|
+---------------v---------------+
| Tool Call Parsing |
| (ToolCallParser) |
| Parses: <tool_call>{...} |
| </tool_call>, ui_tap(index=5), |
| or JSON decision objects |
+---------------+---------------+
|
+---------------v---------------+
| Action Execution |
| (ActionExecutor) |
| Tap, type, swipe, back, home |
| via Accessibility gestures |
+---------------+---------------+
|
+---------------v---------------+
| Loop Detection + Recovery |
| Detects repeated actions, |
| dismisses blocking dialogs, |
| scrolls to reveal elements |
+-------+-----------+-----------+
| |
(continue)| (done) |
| v
| +-------+-------+
| | Task Complete |
| +---------------+
|
+----> (back to top of loop)
<tool_call> XML tags, ui_tap(index=5) function-call style, inline {"tool_call": {...}} JSON, and legacy JSON decision objects.AgentForegroundService keeps the CPU alive at full speed even when the agent navigates away from its own UI to control other apps.All benchmarks on Samsung Galaxy S24 (Snapdragon 8 Gen 3, 8GB RAM) with foreground boost active.
| Model | Size | Speed (per step) | Element Selection | Agent Compatible |
|---|---|---|---|---|
| Qwen3-4B (/no_think) | 2.5 GB | 72-92s | Correct | Yes (Recommended) |
| LFM2.5-1.2B | 731 MB | 8-10s | Always index 0-2 | No |
| LFM2-8B-A1B MoE | 5.04 GB | 31-41s | Partially correct | No (multi-action plans) |
| DS-R1-Qwen3-8B | 5.03 GB | ~267s | Smart but format issues | No (too slow) |
Note: Qwen3-4B is the only model that reliably selects correct UI elements and follows the single-action-per-turn contract. The /no_think suffix disables chain-of-thought to prevent the model from consuming the entire token budget on reasoning. For full benchmarking details, see ASSESSMENT.md.
app/src/main/java/com/runanywhere/agent/
|-- AgentApplication.kt # SDK init, model registry (LLM, STT, VLM)
|-- AgentForegroundService.kt # Foreground service + PARTIAL_WAKE_LOCK
|-- AgentViewModel.kt # UI state, voice mode, STT/TTS coordination
|-- MainActivity.kt # Entry point
|-- accessibility/
| +-- AgentAccessibilityService.kt # Screen reading, screenshot capture, gesture execution
|-- actions/
| +-- AppActions.kt # Intent-based app launching (YouTube, Spotify, etc.)
|-- kernel/
| |-- ActionExecutor.kt # Executes tap/type/swipe/etc. via accessibility
| |-- ActionHistory.kt # Tracks actions for loop detection
| |-- AgentKernel.kt # Main agent loop, LLM orchestration, foreground boost
| |-- GPTClient.kt # OpenAI API client (text + vision + tool calling)
| |-- ScreenParser.kt # Parses accessibility tree into indexed element list
| +-- SystemPrompts.kt # All LLM prompts (compact, tool-calling, vision)
|-- providers/
| |-- AgentProviders.kt # Provider mode enum (LOCAL, CLOUD_FALLBACK)
| |-- OnDeviceLLMProvider.kt # On-device LLM provider wrapper
| +-- VisionProvider.kt # VLM provider interface and implementation
|-- toolcalling/
| |-- BuiltInTools.kt # Utility tools (time, weather, calc, etc.)
| |-- SimpleExpressionEvaluator.kt # Math expression evaluator for calculator tool
| |-- ToolCallingTypes.kt # ToolCall, ToolResult, LLMResponse sealed class
| |-- ToolCallParser.kt # Parses <tool_call> XML, function-call, and JSON formats
| |-- ToolPromptFormatter.kt # Converts tools to OpenAI format or compact local prompt
| |-- ToolRegistry.kt # Tool registration and execution
| |-- UIActionContext.kt # Shared mutable screen coordinates per step
| |-- UIActionTools.kt # 14 UI action tools (tap, type, swipe, etc.)
| +-- UnitConverter.kt # Unit conversion utility
|-- tools/
| +-- UtilityTools.kt # Additional utility tool definitions
|-- tts/
| +-- TTSManager.kt # Android TTS wrapper
+-- ui/
|-- AgentScreen.kt # Main Compose UI (text + voice modes)
+-- components/
|-- ModelSelector.kt # Dropdown for model selection
|-- ProviderBadge.kt # Shows LOCAL / CLOUD indicator
+-- StatusBadge.kt # Shows IDLE / RUNNING / DONE status
All UI actions are registered as tools that the LLM can invoke:
| Tool | Description |
|---|---|
ui_tap(index) |
Tap a UI element by its index from the element list |
ui_type(text) |
Type text into the focused/editable field |
ui_enter() |
Press Enter to submit a search query or form |
ui_swipe(direction) |
Scroll up/down/left/right |
ui_back() |
Press the Back button |
ui_home() |
Press the Home button |
ui_open_app(app_name) |
Launch an app by name via intent |
ui_long_press(index) |
Long press an element by index |
ui_open_url(url) |
Open a URL in the browser |
ui_web_search(query) |
Search Google |
ui_open_notifications() |
Open the notification shade |
ui_open_quick_settings() |
Open quick settings |
ui_wait() |
Wait for the screen to load |
ui_done(reason) |
Signal that the task is complete |
Place RunAnywhere SDK AARs in libs/.
gradle.properties for cloud fallback:
GPT52_API_KEY=sk-your-key-here
./gradlew assembleDebug
adb install -r app/build/outputs/apk/debug/app-debug.apk
Enable the accessibility service: Settings > Accessibility > Android Use Agent.
Open the app, select a model (Qwen3-4B recommended), and enter a goal (e.g., “Open YouTube and search for lofi music”). Tap “Start Agent”.
(Optional) Load the VLM model from the UI for enhanced visual understanding.
ui_done, the maximum step count (30) is reached, or the 10-minute timeout expires.This project was inspired by android-action-kernel by Action State Labs.
Built by the RunAnywhere team. For questions, reach out to san@runanywhere.ai