Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions docs/issues/cua-driver-v014-sync/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# CUA Driver v0.1.4 Sync Plan

## Approach

Apply the upstream `libs/cua-driver` delta from `cua-driver-v0.0.15` to
`cua-driver-v0.1.4`, then re-apply DeepChat fork patches where conflicts overlap.
Keep the DeepChat plugin MCP server configured as the normal `cua-driver` server.

## Compatibility

- Remove removed upstream tools from plugin policy and manifest:
`get_accessibility_tree`, `type_text_chars`.
- Keep `type_text` approved for text input and document its `delay_ms` fallback
path in local skills.
- Preserve DeepChat TCC ownership and app bundle identity throughout the driver,
helper build scripts, diagnostics, and permission flows.
- Preserve the DeepChat-owned update path; the bundled helper must not install
standalone upstream releases.

## Validation

- Build the Swift package from the vendored driver source.
- Run focused plugin presenter and packaging/signing tests.
- Run repository formatting, i18n, and lint checks after implementation.
- Inspect protected files for accidental skill or fork-patch regressions.
29 changes: 29 additions & 0 deletions docs/issues/cua-driver-v014-sync/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# CUA Driver v0.1.4 Sync Spec

## Goal

Sync the vendored DeepChat CUA driver fork with upstream `trycua/cua`
`cua-driver-v0.1.4` while preserving DeepChat-specific runtime behavior and local
skill customizations.

## Acceptance Criteria

- Vendored upstream metadata points to tag `cua-driver-v0.1.4` and commit
`d422294b848afec99b979ac1229446c83fa44807`.
- Removed upstream tools `get_accessibility_tree` and `type_text_chars` are no
longer advertised by the DeepChat plugin manifest or policy.
- `type_text` remains the text-entry tool and supports upstream automatic CGEvent
fallback behavior.
- DeepChat fork patches remain intact:
`DeepChatPermissionProbeCommand`, `DeepChat Computer Use.app`,
`com.wefonk.deepchat.computeruse`, non-blocking MCP/daemon startup,
DeepChat-managed update messaging, and window-scoped zoom contexts.
- DeepChat-owned plugin skill files receive only minimal compatibility edits for
removed tool names.

## Non-Goals

- Do not expose a second Claude Code compatibility MCP server in the DeepChat plugin.
- Do not rewrite or replace local DeepChat skill content with upstream skill text.
- Do not change user-facing plugin UX beyond the upstream driver compatibility
surface required for this sync.
9 changes: 9 additions & 0 deletions docs/issues/cua-driver-v014-sync/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# CUA Driver v0.1.4 Sync Tasks

- [x] Apply upstream driver source changes through `cua-driver-v0.1.4`.
- [x] Resolve `CuaDriverCommand.swift` conflicts while keeping DeepChat commands
and startup behavior.
- [x] Update upstream metadata, plugin manifest, and tool policy.
- [x] Patch local DeepChat skill files only for removed tool compatibility.
- [x] Verify protected fork strings and commands remain present.
- [x] Run Swift build, focused Vitest suites, format, i18n, and lint.
2 changes: 0 additions & 2 deletions plugins/cua/plugin.json
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,6 @@
"list_windows": "allow",
"get_screen_size": "allow",
"get_window_state": "allow",
"get_accessibility_tree": "allow",
"get_cursor_position": "allow",
"get_config": "allow",
"get_recording_state": "allow",
Expand All @@ -91,7 +90,6 @@
"scroll": "ask",
"move_cursor": "ask",
"type_text": "ask",
"type_text_chars": "ask",
"press_key": "ask",
"hotkey": "ask",
"set_value": "ask",
Expand Down
2 changes: 0 additions & 2 deletions plugins/cua/policies/tool-policy.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
"list_windows": "allow",
"get_screen_size": "allow",
"get_window_state": "allow",
"get_accessibility_tree": "allow",
"get_cursor_position": "allow",
"get_config": "allow",
"get_recording_state": "allow",
Expand All @@ -20,7 +19,6 @@
"scroll": "ask",
"move_cursor": "ask",
"type_text": "ask",
"type_text_chars": "ask",
"press_key": "ask",
"hotkey": "ask",
"set_value": "ask",
Expand Down
2 changes: 1 addition & 1 deletion plugins/cua/skills/cua-driver/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Use DeepChat's CUA plugin MCP tools for macOS app automation. Treat the tools ex
2. Start or reuse the target with `launch_app({ bundle_id })`. Use the returned `pid` when available.
3. Inspect windows with `list_windows({ pid })` when the launch result lacks a usable window.
4. Snapshot before every UI action with `get_window_state({ pid, window_id })`.
5. Act with the matching MCP tool: `click`, `right_click`, `double_click`, `drag`, `scroll`, `type_text`, `type_text_chars`, `press_key`, `hotkey`, `set_value`, `page`, or `launch_app` with `urls`.
5. Act with the matching MCP tool: `click`, `right_click`, `double_click`, `drag`, `scroll`, `type_text`, `press_key`, `hotkey`, `set_value`, `page`, or `launch_app` with `urls`. For web inputs that reject AX text insertion, call `type_text({ pid, text, delay_ms })` and let the driver use its CGEvent fallback.
6. Snapshot again after each action and verify visible evidence: selected state, changed text, playback progress, new panels, highlighted rows, or updated window content.

Element indices come from the latest `get_window_state` result for the same `pid` and `window_id`. Re-snapshot when an index is missing, stale, or from another window.
Expand Down
2 changes: 1 addition & 1 deletion plugins/cua/vendor/cua-driver/source/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.0.15
current_version = 0.1.4
commit = True
tag = True
tag_name = cua-driver-v{new_version}
Expand Down
18 changes: 18 additions & 0 deletions plugins/cua/vendor/cua-driver/source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,21 @@
Background computer-use driver for any agents. Speaks MCP over stdio; drives native macOS apps without stealing focus.

**[Documentation](https://cua.ai/docs/cua-driver)** - Installation, guides, and API reference.

## Claude Code computer-use compatibility

Standard Claude Code MCP registration:

```bash
claude mcp add --transport stdio cua-driver -- cua-driver mcp
```

If you want Claude Code's vision/computer-use-style flow to ground on CuaDriver window screenshots, register the compatibility mode:

```bash
claude mcp add --transport stdio cua-computer-use -- cua-driver mcp --claude-code-computer-use-compat
```

This keeps CuaDriver's normal MCP tools and changes only `screenshot`, which requires `pid` and `window_id` and captures that window only.

Use MCP for this Claude Code vision/computer-use-style path. CLI screenshots still work as CuaDriver calls, but they do not expose the `mcp__cua-computer-use__screenshot` tool name that Claude Code appears to use as the image-grounding cue.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ user explicitly asks to record — the skill does not auto-enable this.

`set_recording` turns on a session-scoped trajectory recorder. While
enabled, every action-tool call (`click`, `right_click`, `scroll`,
`type_text`, `type_text_chars`, `press_key`, `hotkey`, `set_value`)
`type_text`, `press_key`, `hotkey`, `set_value`)
writes a numbered turn folder under a caller-chosen output
directory. Read-only tools (`get_window_state`, `list_windows`,
`screenshot`, `list_apps`, permission probes, agent-cursor getters /
Expand Down Expand Up @@ -101,7 +101,7 @@ keyed on `(pid, window_id)`, so a recorded
resolve today — the pid is usually different, the window_id always
is. The call returns `Invalid element_index` or `No cached AX
state`. Pixel clicks (`click({pid, x, y})`) and keyboard tools
(`press_key`, `type_text_chars`, `hotkey`, `type_text` without
(`press_key`, `hotkey`, `type_text` without
element_index) replay cleanly; element-indexed actions require a
live snapshot that replay doesn't currently re-emit (read-only tools
like `get_window_state` aren't recorded). For a reliable replay, either
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -477,8 +477,8 @@ you're interacting with a long-lived process). In the default
get_window_state → reason over PNG → pixel click`. When you need
`element_index` dispatch (AX-addressable elements, backgrounded
clicks), flip to `som` first: `cua-driver set_config '{"key":
"capture_mode", "value": "som"}'`, or call `get_accessibility_tree`
directly. The rest of this section walks through `som` mode, which
"capture_mode", "value": "som"}'`, then call `get_window_state`
again. The rest of this section walks through `som` mode, which
is what you want once you've decided element-indexed addressing is
required.

Expand Down Expand Up @@ -573,7 +573,7 @@ anchor the conversion against a specific window):
| Focus + send key | `press_key({pid, key, window_id, element_index, modifiers})` | element_index sets AXFocused, then posts key |
| Send key to pid | `press_key({pid, key, modifiers})` | no focus change; key goes to pid's current focus |
| Modifier combo | `hotkey({pid, keys})` | e.g. `["cmd","c"]`; posted per-pid, not HID tap |
| Unicode keystrokes | `type_text_chars({pid, text, delay_ms})` | CGEvent-to-pid; reaches Chromium/Electron inputs |
| Unicode keystrokes | `type_text({pid, text, delay_ms})` | AX insertion with CGEvent-to-pid fallback; reaches Chromium/Electron inputs |

**All keyboard/text primitives require `pid`.** There is no
frontmost-routed variant — every key goes to the named target via
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,11 @@ These are where CuaDriver's AX-activation trio matters:
### 8. Slack — sparse Electron, hotkey fallback
**Prompt:** `In Slack, jump to the #pr-reviews channel using the quick switcher (⌘K).`

**Exercises:** "retry snapshot once, then `hotkey` + `type_text_chars`" fallback path.
**Exercises:** "retry snapshot once, then `hotkey` + `type_text` with `delay_ms`" fallback path.

**Success:**
- Re-snapshot shows Slack's channel title area (AX role `AXStaticText` or similar) contains `pr-reviews`.
- Claude uses `hotkey` and `type_text_chars` — NOT `simulate_click` / pixel coords.
- Claude uses `hotkey` and `type_text` with `delay_ms` — NOT `simulate_click` / pixel coords.

**Fail signals:** Claude drops to `simulate_click` (guardrail violation — pixel fallback is not allowed on sparse AX trees), wrong channel joined, typed text echoes into the message composer instead of the switcher.

Expand Down Expand Up @@ -168,7 +168,7 @@ These are where CuaDriver's AX-activation trio matters:
### 12. Chrome proper — omnibox
**Prompt:** `In Google Chrome, open a new tab and navigate to https://trycua.com.`

**Exercises:** `hotkey(["cmd","t"])` + `type_text_chars` + Return. Chrome's omnibox typically isn't AX-exposed even after activation.
**Exercises:** `hotkey(["cmd","t"])` + `type_text({pid, text, delay_ms})` + Return. Chrome's omnibox typically isn't AX-exposed even after activation.

**Success:**
- A new Chrome tab whose AX title contains `Cua` or `trycua` is present.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,9 @@ pixels:
`hotkey({pid, keys: ["cmd", "enter"]})`, `hotkey({pid, keys:
["cmd", "k"]})`, etc. Posted via `CGEvent.postToPid`, reaches the
target regardless of AX state, no activation required.
3. For typing into web inputs where `type_text` silently drops
(input doesn't implement `AXSelectedText`), use `type_text_chars`
— pure CGEvent keystrokes reach any focused keyboard receiver,
including Unicode / emoji.
3. For typing into web inputs where AX insertion is rejected or silently
dropped, use `type_text` with a small `delay_ms` so the driver can use
its CGEvent fallback, including Unicode / emoji.
4. If none of the above reaches the target, tell the user this
interaction isn't reachable from the driver today and ask for
guidance.
Expand Down Expand Up @@ -86,7 +85,7 @@ documented only as historical context:
```
# DON'T DO THIS — ⌘L steals focus. Use launch_app above.
hotkey({pid, keys: ["cmd", "l"]})
type_text_chars({pid, text: "https://cua.ai", delay_ms: 30})
type_text({pid, text: "https://cua.ai", delay_ms: 30})
get_window_state({pid, window_id})
click({pid, window_id, element_index: <suggestion>})
```
Expand Down Expand Up @@ -157,7 +156,7 @@ When the target window is **minimized** (genie'd into the Dock):
`AXFocused=true` on a minimized window's descendants doesn't
propagate to real keyboard focus). Symptom: macOS system-alert
beep, or silent no-op. Example: `hotkey cmd+L` +
`type_text_chars URL` + `press_key return` on minimized Chrome —
`type_text({pid, text: URL, delay_ms: 30})` + `press_key return` on minimized Chrome —
the URL lands in the omnibox AX value but Return doesn't commit
the navigation.
- **Primary workaround — use `set_value` to commit directly**: For
Expand Down Expand Up @@ -464,7 +463,7 @@ Playwright / a WebExtension with Native Messaging for Firefox.
type_text({pid, window_id, element_index: <input_field>, text: "…"})
```

If it silently drops (some web inputs don't implement
If AX insertion silently drops (some web inputs don't implement
`AXSelectedText`), click the field first, then use
`type_text_chars({pid, text})` — pure CGEvent keystrokes delivered
to the pid, reaching any focused keyboard receiver.
`type_text({pid, text, delay_ms})` so the driver falls back to
pid-targeted CGEvent keystrokes.
Loading
Loading