Agent Guide

You're an AI agent that just connected to a real iOS device. This page is for you.

You now have eyes and hands on a running app. You can see every pixel, tap any button, type into any field, scroll through any list, and read the console output — including print() statements you wrote yourself. This is the full feedback loop: write code, run it, see it, interact with it, debug it, fix it.

This guide teaches you how to be effective.

Prerequisite — Camera permission. The daemon refuses to start a session unless macOS Camera permission is granted to the haptix binary or the developer's terminal app. It's used to read the device screen over USB, not for an actual camera. If start_session returns cameraPermissionDenied, the developer needs to enable it in System Settings → Privacy & Security → Camera, then run haptix stop && haptix start.

The two rules — read this first

Almost everything else in this guide is mechanics. These two rules are the operating model. Haptix's MCP initialize response now teaches both to every connecting client at session start; this page is the user-facing mirror of that contract.

1. Act only on what is visibly on screen

What you can SEE in the most recent screenshot is what you can work with. The accessibility tree is a useful index, but it can lie about reachability — an element may report a valid center_x that falls within screen bounds, yet still be visually clipped, scrolled past the edge of a horizontal toolbar, or covered by another view. A coordinate inside the tree is not a promise that the target is hittable.

The screenshot is the authoritative ground truth. If you can't see your target in the latest frame, scroll or navigate until you can — then tap. Don't tap a coordinate just because the tree says it exists. When in doubt, use screenshot(annotated: true) and target the element by its visible identifier.

2. When you don't know — ask, don't loop

If you've tried two or three reasonable approaches and none worked, stop and ask the user. Three attempts is already too many; two is the natural moment to check in. The user can usually disambiguate in one sentence what you'd burn ten more attempts trying to brute-force. This isn't a hedge — it's the contract. The MCP tools are powerful enough that a confused agent can churn for a long time before failing visibly; the right shape is see → act → see → check, with a human escalation hop when see-act-see stalls.

What you have

You have a stable MCP tool surface. You don't need to memorize every tool — but you do need to understand the shapes.

Shape	What it does	Key tools
See	Look at the screen, read the UI structure	`screenshot`, `accessibility_tree`
Touch	Tap, swipe, scroll, drag, draw paths, long press	`tap`, `swipe`, `scroll`, `drag`, `draw_path`, `long_press`
Type	Enter and clear text in focused fields	`type_text`, `clear_text`
Navigate	Move between screens, dismiss things	`scroll`, `tap` (back buttons, tabs)
Debug	Read runtime output from the app	`get_console_output`
Remember	Store and recall knowledge across sessions	`store_note`, `recall_notes`, `consolidate_notes`

Every gesture you perform automatically returns a fresh screenshot of the result. You always see what happened. You never need to call screenshot after a gesture unless you want a different format or a higher-resolution capture.

Why this matters

Before Haptix, you wrote SwiftUI code and hoped it looked right. You asked the developer to describe what they saw. You couldn't verify layout, test interactions, or see runtime behavior.

Now you can:

Build → run → see — Verify your changes actually look correct on a real device
Tap through flows — Test navigation, form submissions, edge cases
Read the console — See print() output, errors, warnings, crash reasons
Write your own debug prints — Add print("DEBUG: \(someValue)") to the code, build, trigger the code path, and read the output back
Catch bugs visually — Misaligned views, missing data, broken layouts — you can see them yourself
Prove your work — The developer trusts you more because you can show that your changes work

You're not guessing anymore. You have the full picture.

The core loop

Everything you do follows one pattern:

Look → Act → Look

Take a screenshot or read the accessibility tree to understand the current state
Perform a gesture (tap, scroll, type)
The response includes a new screenshot — check it

If something unexpected happens, read the console. Errors often explain why.

How to see the screen

Screenshots

screenshot is your primary tool. Use the modes:

Screenshots are captured Mac-side. Haptix waits for a fresh frame from the selected device and fails clearly if it cannot get one. It does not fall back to device-side screenshots, and if the capture source drifts across attached phones it will return a clear error instead of another device's pixels.

Default — captures the app content
annotated: true — overlays bounding boxes with labels, identifiers, and coordinates on every UI element. Green boxes are interactive, blue are static. The overlay is rendered Mac-side on the captured frame, so it works on any iOS app — including apps the developer doesn't own. No SDK embed required.
filter: "interactive" — only shows tappable elements. Massive noise reduction on complex screens
highlight: "elementId" — spotlights a single element with a yellow box and crosshair

When reviewing a screen for the first time, use annotated with filter: "interactive" to see what you can tap. Use highlight when you know the identifier but need to see where it sits.

Accessibility tree

accessibility_tree(mode: "compact") returns a flat list of meaningful UI elements with labels, identifiers, values, traits, and frame coordinates. It costs ~2,000 tokens.

accessibility_tree(mode: "full") returns the complete nested view hierarchy. It costs ~20,000 tokens. Only use this when you need to understand parent-child relationships.

Prefer compact. It's 10x cheaper and usually gives you everything you need.

Tree reads are depth-bounded for speed: max_depth defaults to 2 (capped at 8), which keeps reads sub-second even on very large or deeply nested screens like infinite-scroll feeds, where a full walk used to stall. Raise max_depth when you need deeper structure; max_children and fidelity give you further control over how much of the snapshot is returned.

How to tap things

Always prefer identifier over label over coordinates.

tap(identifier: "submitButton")    ← best: survives layout changes
tap(label: "Submit")               ← good: works if the label is unique
tap(x: 210, y: 500)               ← last resort: breaks when layout changes

Identifiers resolve to the exact semantic element via accessibility APIs. Coordinates often land on generic container views like _UIMoreListTableView or UpdateCoalescingCollectionView — the tap technically "hits" but nothing happens.

Tab bars are especially problematic with coordinates. Always use tap(label: "Inbox"), tap(label: "Settings"), never coordinates. Tab bar buttons are not reliably hittable by position.

Read the hit feedback

Every tap response includes hit feedback:

[Hit] "Submit" [button] identifier: "submitButton" -- base layer

Read it. If it says something unexpected, you tapped the wrong thing. Re-assess before tapping again.

Parameter types

All coordinate parameters (x, y, startX, startY, endX, endY) must be numbers, not strings.

tap(x: 210, y: 500)      ← correct
tap(x: "210", y: "500")  ← fails

How to scroll

The `scroll` tool

For page navigation, use scroll. No coordinates needed.

scroll(direction: "down", amount: "medium")

Amounts: small (25%), medium (50%), large (75%), full_page (100%).

Scroll in medium steps. Check what's visible after each scroll. Scroll again if needed. Don't try to jump to the bottom in one go — you'll overshoot and miss things.

For nested scroll views (a list inside a tab, a form inside a sheet), use the identifier parameter to target the right one:

scroll(direction: "down", amount: "medium", identifier: "messageList")

When to use `swipe` instead

swipe takes start and end coordinates. Use it for:

Back navigation — swipe from the left edge: swipe(startX: 0, startY: 400, endX: 300, endY: 400)
Swipe-to-delete — horizontal swipe on a list row
Picker wheels — small vertical swipes directly on the wheel column
Custom gesture controls — anything that needs precise start/end positions

When to use `drag`

drag is like swipe but with a dwell phase — a brief pause at the start point before moving. This tells iOS "I want to move this thing, not scroll past it." Use it for:

Dismissing keyboards — drag down from the content area past the bottom of the screen (more on this below)
Slider manipulation
Reordering items (when it works — currently has limitations)

When to use `draw_path`

Use draw_path when one continuous stroke matters more than a straight line:

Drawing and signatures — multiple points create natural curves
Circles and closed shapes — use many waypoints, optionally with closed: true
Organic drags — freeform movement where separate calls would incorrectly lift the finger

draw_path(points: [...]) treats the first point as touch-down and the last point as lift. Separate calls imply a deliberate lift between strokes.

Long paths (many waypoints, multi-second strokes) emit MCP notifications/progress while the stroke runs. If your client subscribes to progress notifications you'll see in-flight state as it draws; if not, the tool still completes correctly — you just wait for the final response.

How to dismiss the keyboard

The keyboard doesn't go away on its own after type_text. Here are your options, in order of reliability:

1. Drag it away (most reliable)

Use drag to touch the scrollable content area above the keyboard and drag downward past the screen bottom. This triggers iOS scroll-to-dismiss behavior.

drag(startX: 200, startY: 400, endX: 200, endY: 900)

This works on views that use Form, List, ScrollView with .scrollDismissesKeyboard(.interactively) — which is most well-built apps.

2. Tap "Done" or "Return"

If the keyboard toolbar has a "Done" button, or the return key type is set to submit, tap it.

3. Tap the next field

If you're moving to another text field, just tap it. The keyboard stays but focus moves. No need to dismiss between fields.

What doesn't work

scroll does not dismiss the keyboard. It scrolls the content behind the keyboard but doesn't trigger the dismiss gesture. Don't waste time trying it.

How to navigate

Going back

Tap the back button — usually top-left. Find it via the accessibility tree if you're not sure what it's labeled.

Or swipe from the left edge of the screen:

swipe(startX: 0, startY: 400, endX: 300, endY: 400)

Tab bars

Always tap by label:

tap(label: "Inbox")
tap(label: "Settings")
tap(label: "Profile")

Never use coordinates for tab bar buttons.

Dismissing sheets and modals

Presented layers (sheets, alerts, popovers, confirmation dialogs) sit on top of everything. Once one appears, it intercepts all taps — even the tab bar underneath.

Always dismiss the current layer before trying to interact with content behind it.

Look for close buttons: "X", "Done", "Cancel", "Dismiss"
Swipe down on the sheet handle (if the sheet is dismissible)
For alerts: tap the action button ("OK", "Allow", "Delete")

If an alert or sheet gets stuck and you can't dismiss it, the entire app becomes untestable. Check the console for clues.

How to type

Tap the field — always first. type_text does nothing without a focused field
type_text — appends to existing text
clear_text then type_text — replaces all text

For multi-field forms, tap and type each field in sequence. The keyboard persists between fields — no need to dismiss and re-summon it.

Secure text fields (passwords) work the same way. type_text works even though the display shows dots.

How to debug

You have full access to the app's console output: print(), NSLog(), os_log(), errors, warnings, and crash messages.

Console output is already in your responses

Every gesture response auto-includes any console output captured during the action. You're already seeing it — just read the response.

Writing your own debug prints

This is your superpower. You can instrument the app yourself:

Add print("DEBUG: count = \(items.count)") to suspicious code
Build and run
Tap through the app to trigger that code path
Read the output in the gesture response or via get_console_output
Find the bug, fix it, remove the print

You can also dump(someObject) for full structure inspection — every property, nested value, and type.

Filtering console output

When you need to cut through noise:

get_console_output(contains: "ERROR")          — keyword search
get_console_output(level: "error")             — only errors and faults
get_console_output(since: "2026-02-22T10:00:00Z")  — only recent output
get_console_output(source: "stderr")           — framework warnings

If the app crashes

The console often captures the crash reason before the connection drops. After a crash, check the console output from your last interaction — the fatal error or assertion failure message is usually there.

The build-install-verify cycle

When you're iterating on code — writing a fix, building, checking on device — rebuilding the app kills the SDK connection. The Haptix session dies because the app process is replaced.

Follow this pattern:

end_session before building (clean teardown)
Build and install the updated app
Wait 2–3 seconds for the SDK to boot and connect over USB
start_session (new session — device auto-connects)
Take a screenshot to verify the new state

Do not ask the developer to reconnect MCP when this happens. The MCP connection (your agent to the Haptix daemon) is fine. It's the Haptix session (daemon to the device) that needs restarting. These are different failures.

Session errors and what they mean

Error	What happened	What to do
"Session already active"	Previous session wasn't ended	Call `end_session`, then `start_session`
"No matching session"	MCP transport died	The developer needs to reconnect MCP in their agent
"Device not found"	App was reinstalled, device identity changed	Call `select_device` to rebind

How to remember

Notes persist across sessions, tied to specific apps. They're your operational memory — things you figured out that you'll need again next time.

Recall first

Call recall_notes at the start of every session. If you've worked with this app before, your past notes tell you what you already know: which identifiers work, where things are, what's broken, what workarounds you found.

Write terse notes

Notes are not prose. They're short, factual, scannable. Write them like you'd write a comment in code — the minimum needed to jog your memory.

Good notes:

store_note(content: "Settings tab label: 'Preferences'", scope: "app")
store_note(content: "Login button identifier: 'auth_submit'", scope: "app")
store_note(content: "Keyboard dismiss: drag y:400→y:900 on Form", scope: "universal")
store_note(content: "Profile image tap opens sheet, not navigation push", scope: "app")

Bad notes:

store_note(content: "I discovered that the Settings tab is actually labeled Preferences, which was surprising because most apps call it Settings. I found this by using the accessibility tree.", scope: "app")

app scope — tied to the current app's bundle ID. Facts about this specific app.
universal scope — applies to all apps. General iOS patterns and workarounds.

Consolidate at the end

Call consolidate_notes at the end of a session. This replaces scattered observations with one clean summary. Keep it tight — future you will thank present you.

What doesn't work yet

Don't waste time retrying these — they're known platform limitations, not your mistakes.

What	Status	Detail
Context menu items	Broken	Long press opens the menu, but tapping items inside does nothing
Menu-style pickers	Broken	Default SwiftUI `.menu` picker style — can't select options
Pull-down menus	Broken	System pull-down menus reject synthetic touches
Share sheets	Broken	System-presented, rejects synthetic touches
Drag-and-drop	Broken	Long-press-then-drag patterns (reorderable lists work via `drag`; OS-level DnD does not)
Alerts / action sheets	Partial	Sometimes respond to `tap(label:)`, inconsistent
Pinch / rotate / drag	Works	IOHIDEvent multi-touch synthesis (since v1.4.x)

See Compatibility for the full matrix and root cause details.

Efficient token usage

Don't screenshot after gestures — every gesture already returns a fresh Mac-side screenshot
Use compact accessibility tree (~2K tokens) not full (~20K) unless you need hierarchy
Filter annotations — filter: "interactive" cuts out static labels and decorative elements
Use highlight to spotlight one element instead of annotating everything
Use recall_notes — don't re-discover what you already know from previous sessions
Use scroll over repeated swipe — one scroll(direction: "down") replaces a calculated swipe(startX:startY:endX:endY:)

Common recipes

Verify a UI change

Make the code change → build → end_session / start_session → screenshot → confirm it looks right.

Fill a form

tap(label: "Name") → type_text("Jane Doe")
tap(label: "Email") → type_text("jane@example.com")
tap(label: "Password") → type_text("secretpass")
drag(startX: 200, startY: 400, endX: 200, endY: 900)   ← dismiss keyboard
tap(label: "Sign Up")

Find an item in a long list

screenshot() → not visible
scroll(direction: "down", amount: "medium") → screenshot shows items 10-20, not here
scroll(direction: "down", amount: "medium") → screenshot shows items 20-30, found it
tap(label: "Target Item")

Test a multi-screen flow

Navigate step by step. Screenshot at each stage. Read the console for errors between steps. If something breaks, the last screenshot + console output tells you where.

Debug a visual bug

screenshot(annotated: true) → identify the misaligned element → read its accessibility properties → check the code → fix → rebuild → verify.