Task: Post “Hi from RunAnywhere Android agent” on X (Twitter) using on-device LLM inference, navigating from the home feed with no cloud dependency.
Result: ✅ PASS — Qwen3 4B, 6 steps, 3 real LLM inferences, goal-aware element filtering, ~4 min total.
| Spec | Detail |
|---|---|
| Device | Samsung Galaxy S24 (SM-S931U1) |
| SoC | Snapdragon 8 Gen 3 — 1× Cortex-X4 @ 3.39 GHz + 3× A720 + 4× A520 |
| RAM | 8 GB LPDDR5X |
| OS | Android 16 (One UI 7) |
| Backend | llama.cpp (GGUF Q4_K_M) via RunAnywhere SDK |
Critical hardware quirk — Samsung background throttling: OneUI pins background processes to efficiency cores (A520 @ 2.27 GHz), capping inference at 0.19 tok/s. Bringing the app to the foreground during inference restores full CPU access: 2.4–25 tok/s (15–17× improvement). This foreground boost is mandatory for any on-device LLM on Samsung.
| Model | Size | Speed (fg) | Step Latency | Tool Format | UC1 Result |
|---|---|---|---|---|---|
| LFM2-350M Base | 229 MB | ~20 tok/s | 7–12s | ❌ Narrates instead of calls tools | FAIL |
| LFM2.5-1.2B Instruct | 731 MB | ~2.8 tok/s | 8–14s | ✅ Valid | FAIL — always picks index 0–2 |
Qwen3-4B (/no_think) |
2.5 GB | ~4 tok/s | 67–85s | ✅ Valid | PASS |
| LFM2-8B-A1B MoE | 5 GB | ~5 tok/s | 29–43s | ⚠️ Emits multi-action plans | FAIL — only 1st action runs |
| DS-R1-Qwen3-8B | 5 GB | ~1.1 tok/s | ~197s | ❌ Hallucinated inner agent loop | FAIL |
Key finding: There is a hard capability threshold around 4B parameters. Sub-2B models either can’t follow tool-call format (350M) or can’t reason about element selection (1.2B). 8B models have better reasoning but wrong output format or wrong speed. Qwen3-4B with /no_think is the only viable on-device model for this task.
/no_think matters: with chain-of-thought enabled, Qwen3-4B spends 95%+ of its 512-token budget on <think> and runs out of space before the tool call. /no_think forces a direct 18-token output.
| # | Strategy | Model | LLM Calls | Outcome |
|---|---|---|---|---|
| 1 | Pure LLM — no assists | LFM2.5 1.2B | All | ❌ FAIL — FAB at raw index 13, model always taps 0 |
| 2 | Keyword FAB tap, LLM handles compose | LFM2.5 1.2B | Partial | ❌ FAIL — ComposerActivity destroyed when agent steals foreground for inference |
| 3 | Deep link to compose + SINGLE_TOP + quick POST | LFM2.5 1.2B | 0 | ✅ PASS (~20s) — but skips home feed entirely, looks scripted |
| 4 | Full programmatic flow (FAB→compose→type→POST) | LFM2.5 1.2B | 0 | ✅ PASS (~27s) — visible navigation, zero AI reasoning |
| 5 | Goal-aware filter + SINGLE_TOP + targeted guards | Qwen3 4B | 3 | ✅ PASS (6 steps, ~4 min) — real LLM navigation decisions |
Approach 2 failure detail: When the agent steals the foreground for inference, getLaunchIntentForPackage() on return triggers X’s singleTask launch mode — this clears the back stack and destroys any open ComposerActivity. The composed tweet is lost before the model can post it. Fix: use FLAG_ACTIVITY_SINGLE_TOP when ComposerActivity is detected as open.
Approach 5 is the only one with genuine on-device reasoning. The model made 3 real navigation decisions; guards only cover two deterministic failure modes that LLM inference cannot solve reliably.
Both LFM2.5 1.2B and Qwen3 4B consistently output ui_tap(0) or ui_tap(1). On X’s home feed, “New post” is at raw accessibility index 13 — unreachable for a model that always picks low indices.
Solution — filterScreenForGoal(compactText, goal): Before every inference step, score and re-rank all interactive elements against the goal, take the top 5, and re-index them 0–4. The model sees a 5-element screen where the most relevant action is always at index 0.
Scoring:
Clicked element orig=13 (filtered=0) via accessibility actionHome feed example (goal = “post saying Hi from RunAnywhere Android agent”):
Raw index 13: New post (ImageButton) [tap] → score 8 → filtered index 0 ← model taps this
Raw index 0: Show navigation drawer [tap] → score 1 → filtered index 1
Raw index 1: Timeline settings [tap] → score 1 → filtered index 2
… 16 other elements hidden
ComposerActivity example (same goal):
Raw index 2: What's happening? (EditText) [tap,edit] → score 11 (editBonus) → filtered index 0
Raw index 3: Changes who can reply [tap] → score 2 → filtered index 1
… 10 other elements hidden
Model outputs ui_tap(0) both times — and both times it is the correct action.
Three guards cover specific failure modes that cannot be solved by LLM inference alone:
| Guard | Trigger | Why LLM can’t handle it | Action |
|---|---|---|---|
| X-NAV | FAB overlay expanded (“Go Live” + “Post Photos” visible) | Overlay collapses when agent steals foreground for inference — a ~70s window lost | Tap “New post” immediately, no inference |
| Recovery Strategy 0 | isXComposeOpen, tweet text not typed, loop detected |
Model calls ui_tap to focus EditText but never follows with ui_type |
Type xComposeMessage directly via accessibility |
| X-GUARD | Tweet text in compose field + POST button visible + step ≥ 3 | Prevents model from navigating away from a fully-composed tweet | Tap POST directly |
Model: Qwen3 4B Q4_K_M · Device: Samsung Galaxy S24 · Date: 2026-02-20
PRE-LAUNCH
extractTweetText("...post saying Hi from RunAnywhere Android agent")
→ xComposeMessage = "Hi from RunAnywhere Android agent"
twitter://timeline → X opens on clean home feed
STEP 1 [LLM inference — 65.3s — 0.3 tok/s]
Screen: X home feed, 19 elements
FILTER 5/19 → New post (ImageButton) at filtered index 0 [orig=13]
Foreground: Agent app (inference boost active), X in background
Model output: ui_tap({index: 0}) ✓ correct
Executor: filtered=0 → orig=13, tapped FAB
→ X foreground, agent background
STEP 2 [X-NAV guard — <1s — no inference]
Screen: FAB overlay expanded, 22 elements ("Go Live", "Post Photos" visible)
FILTER 5/22 → New post at filtered index 0 [orig=16]
Guard: overlay would collapse on foreground steal → tap immediately
→ ComposerActivity opens, isXComposeOpen = true
STEP 3 [LLM inference — 72.4s — 0.3 tok/s]
Screen: ComposerActivity, 12 elements, compose field empty
FILTER 5/12 → What's happening? (EditText) at filtered index 0 [orig=2, editBonus=10]
Foreground: Agent app (SINGLE_TOP keeps ComposerActivity alive in background)
Model output: ui_tap({index: 0}) ✓ focuses compose field
→ X ComposerActivity foreground (SINGLE_TOP, compose preserved)
STEP 4 [LLM inference — 78.3s — 0.3 tok/s]
Screen: identical (field focused but empty)
FILTER: same, EditText still at index 0
Model output: ui_tap({index: 0}) — taps EditText again (loop begins)
STEP 5 [Recovery Strategy 0 — <1s — no inference]
Loop detected: steps 3+4 both tapped filtered=0
isXComposeOpen=true, tweet text not in compactText
→ actionExecutor.execute(Decision("type", text="Hi from RunAnywhere Android agent"))
Log: "[RECOVERY] Compose field empty — typing tweet text directly"
STEP 6 [X-GUARD — <1s — no inference]
Screen: 13 elements, POST (Button) at raw index 1
"Hi from RunAnywhere Android agent" present in compactText → textTyped=true
→ tapped POST (orig=1)
Log: "[X-GUARD] Tweet ready — tapping POST at index 1"
Log: "Tweet posted successfully!"
Log: "Goal achieved: tweet posted"
Status: DONE ✅
SUMMARY
Steps: 6/30
LLM inferences: 3 (steps 1, 3, 4 — ~70s each)
Guard actions: 3 (X-NAV, Recovery, X-GUARD — <1s each)
Total time: ~4 min 10s (216s inference + ~34s UI transitions)
Inference speed: 0.3 tok/s (Qwen3 4B, foreground-boosted)
PIPELINE TIMELINE (elapsed from agent start)
T+0:00–0:02 PRE-LAUNCH App init, goal parsed, twitter://timeline fires ~2s
T+0:02–1:07 STEP 1 Agent app → foreground, LLM inference, FAB tapped 65s
T+1:07–1:08 STEP 2 X-NAV guard fires, overlay → "New post" → Composer <1s
T+1:08–1:20 (transition) Composer opens, agent detects ComposerActivity ~12s
T+1:20–2:32 STEP 3 Agent app → foreground, LLM inference, tap EditText 72s
T+2:32–2:42 (transition) X returns to foreground (SINGLE_TOP), field focused ~10s
T+2:42–4:00 STEP 4 Agent app → foreground, LLM inference, tap EditText 78s
T+4:00–4:01 STEP 5 Recovery Strategy 0 → tweet text typed directly <1s
T+4:01–4:02 STEP 6 X-GUARD → POST tapped → tweet live <1s
─────────────────────────────────────────────────────────────────────
Total 3 inferences (216s) + guards (<3s) + transitions ~4:10
| Step | Who decided | Outcome |
|---|---|---|
| Open X home feed | Hardcoded (twitter://timeline) |
Clean entry point |
| Step 1: tap FAB | LLM — filter put FAB at index 0 | ✅ Correct |
| Step 2: tap “New post” in overlay | Guard (X-NAV) | ✅ Correct — LLM would have collapsed the overlay |
| Step 3: focus compose field | LLM — filter put EditText at index 0 | ✅ Correct |
| Step 4: focus compose field | LLM — same decision | ✅ Valid (loop trigger) |
| Step 5: type tweet text | Guard (Recovery) — loop detected | ✅ Correct — model can’t chain ui_tap → ui_type |
| Step 6: tap POST | Guard (X-GUARD) | ✅ Correct — safety net |
The model made 3 real navigation decisions. All 3 were correct — not coincidentally, but because goal-aware filtering placed the right element at index 0 before the model saw the screen.
Tweet posted live during screen recording. Agent status: DONE (green), logs: “Tweet posted successfully! / Goal achieved: tweet posted”.
Hi from RunAnywhere Android agent — @RunAnywhereAI · Feb 20, 2026
Built by the RunAnywhere team · san@runanywhere.ai