runanywhere-sdks

Android Use Agent — X Post Demo Report

Task: Post “Hi from RunAnywhere Android agent” on X (Twitter) using on-device LLM inference, navigating from the home feed with no cloud dependency.

Result: ✅ PASS — Qwen3 4B, 6 steps, 3 real LLM inferences, goal-aware element filtering, ~4 min total.


Device

Spec Detail
Device Samsung Galaxy S24 (SM-S931U1)
SoC Snapdragon 8 Gen 3 — 1× Cortex-X4 @ 3.39 GHz + 3× A720 + 4× A520
RAM 8 GB LPDDR5X
OS Android 16 (One UI 7)
Backend llama.cpp (GGUF Q4_K_M) via RunAnywhere SDK

Critical hardware quirk — Samsung background throttling: OneUI pins background processes to efficiency cores (A520 @ 2.27 GHz), capping inference at 0.19 tok/s. Bringing the app to the foreground during inference restores full CPU access: 2.4–25 tok/s (15–17× improvement). This foreground boost is mandatory for any on-device LLM on Samsung.


Models Benchmarked (UC1: “Open X and tap the post button”)

Model Size Speed (fg) Step Latency Tool Format UC1 Result
LFM2-350M Base 229 MB ~20 tok/s 7–12s ❌ Narrates instead of calls tools FAIL
LFM2.5-1.2B Instruct 731 MB ~2.8 tok/s 8–14s ✅ Valid FAIL — always picks index 0–2
Qwen3-4B (/no_think) 2.5 GB ~4 tok/s 67–85s ✅ Valid PASS
LFM2-8B-A1B MoE 5 GB ~5 tok/s 29–43s ⚠️ Emits multi-action plans FAIL — only 1st action runs
DS-R1-Qwen3-8B 5 GB ~1.1 tok/s ~197s ❌ Hallucinated inner agent loop FAIL

Key finding: There is a hard capability threshold around 4B parameters. Sub-2B models either can’t follow tool-call format (350M) or can’t reason about element selection (1.2B). 8B models have better reasoning but wrong output format or wrong speed. Qwen3-4B with /no_think is the only viable on-device model for this task.

/no_think matters: with chain-of-thought enabled, Qwen3-4B spends 95%+ of its 512-token budget on <think> and runs out of space before the tool call. /no_think forces a direct 18-token output.


Approaches Tried

# Strategy Model LLM Calls Outcome
1 Pure LLM — no assists LFM2.5 1.2B All ❌ FAIL — FAB at raw index 13, model always taps 0
2 Keyword FAB tap, LLM handles compose LFM2.5 1.2B Partial ❌ FAIL — ComposerActivity destroyed when agent steals foreground for inference
3 Deep link to compose + SINGLE_TOP + quick POST LFM2.5 1.2B 0 ✅ PASS (~20s) — but skips home feed entirely, looks scripted
4 Full programmatic flow (FAB→compose→type→POST) LFM2.5 1.2B 0 ✅ PASS (~27s) — visible navigation, zero AI reasoning
5 Goal-aware filter + SINGLE_TOP + targeted guards Qwen3 4B 3 PASS (6 steps, ~4 min) — real LLM navigation decisions

Approach 2 failure detail: When the agent steals the foreground for inference, getLaunchIntentForPackage() on return triggers X’s singleTask launch mode — this clears the back stack and destroys any open ComposerActivity. The composed tweet is lost before the model can post it. Fix: use FLAG_ACTIVITY_SINGLE_TOP when ComposerActivity is detected as open.

Approach 5 is the only one with genuine on-device reasoning. The model made 3 real navigation decisions; guards only cover two deterministic failure modes that LLM inference cannot solve reliably.


Goal-Aware Element Filtering

Both LFM2.5 1.2B and Qwen3 4B consistently output ui_tap(0) or ui_tap(1). On X’s home feed, “New post” is at raw accessibility index 13 — unreachable for a model that always picks low indices.

Solution — filterScreenForGoal(compactText, goal): Before every inference step, score and re-rank all interactive elements against the goal, take the top 5, and re-index them 0–4. The model sees a 5-element screen where the most relevant action is always at index 0.

Scoring:

Home feed example (goal = “post saying Hi from RunAnywhere Android agent”):

Raw index 13: New post (ImageButton) [tap]  → score 8  → filtered index 0  ← model taps this
Raw index  0: Show navigation drawer [tap]  → score 1  → filtered index 1
Raw index  1: Timeline settings [tap]       → score 1  → filtered index 2
… 16 other elements hidden

ComposerActivity example (same goal):

Raw index 2: What's happening? (EditText) [tap,edit]  → score 11 (editBonus)  → filtered index 0
Raw index 3: Changes who can reply [tap]               → score 2               → filtered index 1
… 10 other elements hidden

Model outputs ui_tap(0) both times — and both times it is the correct action.


Guards (Minimal Hardcoded Assists)

Three guards cover specific failure modes that cannot be solved by LLM inference alone:

Guard Trigger Why LLM can’t handle it Action
X-NAV FAB overlay expanded (“Go Live” + “Post Photos” visible) Overlay collapses when agent steals foreground for inference — a ~70s window lost Tap “New post” immediately, no inference
Recovery Strategy 0 isXComposeOpen, tweet text not typed, loop detected Model calls ui_tap to focus EditText but never follows with ui_type Type xComposeMessage directly via accessibility
X-GUARD Tweet text in compose field + POST button visible + step ≥ 3 Prevents model from navigating away from a fully-composed tweet Tap POST directly

Live Run Trace

Model: Qwen3 4B Q4_K_M · Device: Samsung Galaxy S24 · Date: 2026-02-20

PRE-LAUNCH
  extractTweetText("...post saying Hi from RunAnywhere Android agent")
    → xComposeMessage = "Hi from RunAnywhere Android agent"
  twitter://timeline → X opens on clean home feed

STEP 1  [LLM inference — 65.3s — 0.3 tok/s]
  Screen: X home feed, 19 elements
  FILTER 5/19 → New post (ImageButton) at filtered index 0  [orig=13]
  Foreground: Agent app (inference boost active), X in background
  Model output: ui_tap({index: 0})  ✓ correct
  Executor: filtered=0 → orig=13, tapped FAB
  → X foreground, agent background

STEP 2  [X-NAV guard — <1s — no inference]
  Screen: FAB overlay expanded, 22 elements ("Go Live", "Post Photos" visible)
  FILTER 5/22 → New post at filtered index 0  [orig=16]
  Guard: overlay would collapse on foreground steal → tap immediately
  → ComposerActivity opens, isXComposeOpen = true

STEP 3  [LLM inference — 72.4s — 0.3 tok/s]
  Screen: ComposerActivity, 12 elements, compose field empty
  FILTER 5/12 → What's happening? (EditText) at filtered index 0  [orig=2, editBonus=10]
  Foreground: Agent app (SINGLE_TOP keeps ComposerActivity alive in background)
  Model output: ui_tap({index: 0})  ✓ focuses compose field
  → X ComposerActivity foreground (SINGLE_TOP, compose preserved)

STEP 4  [LLM inference — 78.3s — 0.3 tok/s]
  Screen: identical (field focused but empty)
  FILTER: same, EditText still at index 0
  Model output: ui_tap({index: 0})  — taps EditText again (loop begins)

STEP 5  [Recovery Strategy 0 — <1s — no inference]
  Loop detected: steps 3+4 both tapped filtered=0
  isXComposeOpen=true, tweet text not in compactText
  → actionExecutor.execute(Decision("type", text="Hi from RunAnywhere Android agent"))
  Log: "[RECOVERY] Compose field empty — typing tweet text directly"

STEP 6  [X-GUARD — <1s — no inference]
  Screen: 13 elements, POST (Button) at raw index 1
  "Hi from RunAnywhere Android agent" present in compactText → textTyped=true
  → tapped POST (orig=1)
  Log: "[X-GUARD] Tweet ready — tapping POST at index 1"
  Log: "Tweet posted successfully!"
  Log: "Goal achieved: tweet posted"
  Status: DONE ✅

SUMMARY
  Steps:          6/30
  LLM inferences: 3  (steps 1, 3, 4 — ~70s each)
  Guard actions:  3  (X-NAV, Recovery, X-GUARD — <1s each)
  Total time:     ~4 min 10s  (216s inference + ~34s UI transitions)
  Inference speed: 0.3 tok/s (Qwen3 4B, foreground-boosted)

PIPELINE TIMELINE (elapsed from agent start)
  T+0:00–0:02   PRE-LAUNCH     App init, goal parsed, twitter://timeline fires    ~2s
  T+0:02–1:07   STEP 1         Agent app → foreground, LLM inference, FAB tapped  65s
  T+1:07–1:08   STEP 2         X-NAV guard fires, overlay → "New post" → Composer  <1s
  T+1:08–1:20   (transition)   Composer opens, agent detects ComposerActivity      ~12s
  T+1:20–2:32   STEP 3         Agent app → foreground, LLM inference, tap EditText 72s
  T+2:32–2:42   (transition)   X returns to foreground (SINGLE_TOP), field focused ~10s
  T+2:42–4:00   STEP 4         Agent app → foreground, LLM inference, tap EditText 78s
  T+4:00–4:01   STEP 5         Recovery Strategy 0 → tweet text typed directly     <1s
  T+4:01–4:02   STEP 6         X-GUARD → POST tapped → tweet live                 <1s
                ─────────────────────────────────────────────────────────────────────
  Total                        3 inferences (216s) + guards (<3s) + transitions    ~4:10

What Was LLM vs. Guard

Step Who decided Outcome
Open X home feed Hardcoded (twitter://timeline) Clean entry point
Step 1: tap FAB LLM — filter put FAB at index 0 ✅ Correct
Step 2: tap “New post” in overlay Guard (X-NAV) ✅ Correct — LLM would have collapsed the overlay
Step 3: focus compose field LLM — filter put EditText at index 0 ✅ Correct
Step 4: focus compose field LLM — same decision ✅ Valid (loop trigger)
Step 5: type tweet text Guard (Recovery) — loop detected ✅ Correct — model can’t chain ui_tapui_type
Step 6: tap POST Guard (X-GUARD) ✅ Correct — safety net

The model made 3 real navigation decisions. All 3 were correct — not coincidentally, but because goal-aware filtering placed the right element at index 0 before the model saw the screen.


Proof

Tweet posted live during screen recording. Agent status: DONE (green), logs: “Tweet posted successfully! / Goal achieved: tweet posted”.

Hi from RunAnywhere Android agent — @RunAnywhereAI · Feb 20, 2026


Built by the RunAnywhere team · san@runanywhere.ai