feat(voice): auto-send dictation transcript + allowlist app-launch commands (#3148 Phase 1) by M3gA-Mind · Pull Request #3168 · tinyhumansai/openhuman

M3gA-Mind · 2026-06-01T20:49:06Z

Closes part of #3148 — Phase 1 quick wins that make hotkey-triggered voice commands execute without a manual send or approval prompt.

Changes

1. Auto-send after transcription

app/src/hooks/useDictationHotkey.ts
Adds autoSend: true to the dictation://insert-text event dispatched when a hotkey transcription completes. Backward-compatible — consumers that don't read the flag are unaffected.

app/src/pages/Conversations.tsx

Adds handleSendMessageRef (updated every render) so the mount-time dictation event handler can access the latest send function without stale closure issues.
When the event carries autoSend: true, calls handleSendMessage(text) directly instead of inserting into the composer textarea. The user no longer needs to press Enter or click Send after speaking.

Before: press hotkey → speak → transcript appears in textarea → user manually sends
After: press hotkey → speak → message sent automatically

2. App-launch shell allowlist

src/openhuman/security/policy_command.rs
Adds open (macOS) and xdg-open (Linux) to READ_ONLY_BASES:

"open",     // open -a Music, open -b com.apple.Safari, open ~/Documents/file.pdf
"xdg-open", // xdg-open music://, xdg-open https://…, xdg-open file.pdf

These commands launch apps or open files in the default viewer — they don't modify the workspace. Classifying them as Read means they execute in Supervised mode without triggering the ApprovalGate, so the agent can say "open my Music player" and it just opens.

What's still needed from #3148

Phase 2: Always-on microphone loop (continuous listening without hotkey)
Phase 2: Privacy config (pause when screen locked)
Phase 3: Wake-word detection
Phase 3: Local command router (fast path for common intents)
Phase 4: Voice confirmation loop + UI indicator

Test plan

Press dictation hotkey, say "open my Music player" — Music opens automatically, no Enter required
Press dictation hotkey, say "what time is it" — agent replies without manual send
In supervised mode: agent runs open -a Music — no approval prompt appears
In supervised mode: agent runs curl https://api.example.com — approval prompt still appears (Network class unchanged)
Existing dictation tests pass: pnpm debug unit src/hooks/__tests__/useDictationHotkey
Rust classify tests pass: cargo test policy_command
pnpm typecheck, pnpm format:check, pnpm i18n:check all clean

…mmands Phase 1 of issue tinyhumansai#3148 — quick wins that make hotkey-triggered voice commands execute without a manual send or approval prompt. Auto-send after transcription: - useDictationHotkey.ts: adds `autoSend: true` to the `dictation://insert-text` event detail when a hotkey transcription completes. - Conversations.tsx: the `onDictationInsert` handler checks the new flag; when set, it calls `handleSendMessage(text)` directly instead of inserting into the composer. A `handleSendMessageRef` (updated every render) gives the mount-time effect access to the latest send fn. Shell allowlist for app-launching: - security/policy_command.rs: adds `open` (macOS) and `xdg-open` (Linux) to READ_ONLY_BASES so `open -a Music`, `open -b com.apple.Safari`, `xdg-open music://`, etc. classify as CommandClass::Read and execute without triggering the ApprovalGate in Supervised mode. Closes part of tinyhumansai#3148.

coderabbitai · 2026-06-01T20:49:15Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50f6b60e-cd7c-4873-a571-2622320e9a9f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ions

Dedicated tool that opens a named application on the user's machine without requiring shell access or workspace_only = false. - src/openhuman/tools/impl/system/launch_app.rs: new LaunchAppTool - macOS: `open -a "<app_name>"` via LaunchServices - Linux: `gtk-launch`, fallback `xdg-open` - Windows: `Start-Process` via PowerShell - PermissionLevel::ReadOnly — never triggers the approval gate - Input validation: rejects paths, metacharacters, empty names - Unit tests: name, permission, schema, validation, error cases - src/openhuman/tools/impl/system/mod.rs: register module + pub use - src/openhuman/tools/ops.rs: add LaunchAppTool to all_tools_with_runtime - src/openhuman/tools/user_filter.rs: add "launch_app" family, default_enabled = true, mirrors shell family pattern - app/src/utils/toolDefinitions.ts: add to frontend tool catalog so it appears in Settings → Agent Access with its own toggle This avoids loosening workspace_only or expanding allowed_commands in the shell tool — launch_app is narrowly scoped to app launching only. Part of tinyhumansai#3148.

- launch_app.rs: log every step (▶ execute, ✓/✗ validation, platform dispatch, open exit code + stderr, fallback result) - builder.rs: log full list of visible tool names at session build time so we can confirm launch_app appears in the LLM's tool context - SOUL.md: add explicit capability section — agent now knows it CAN use launch_app to open apps and must not refuse with 'I can't open apps'

The orchestrator's tool scope is a strict allowlist (named = [...]). launch_app was registered in the tool registry but not listed here, so the LLM never saw it — explaining every refusal. Adding it alongside current_time follows the same pattern: direct, fast, no delegation needed for a simple user request like 'open Music'.

…tion - orchestrator/agent.toml: add 'mouse' and 'keyboard' to named tool list so the orchestrator can click/type in apps directly without delegating - user_filter.rs: add 'computer_control' tool family (mouse + keyboard), default_enabled = true, gated by computer_control.enabled in config - toolDefinitions.ts: add Computer Control entry to frontend catalog (Settings → Agent Access toggle) - SOUL.md: document mouse and keyboard capabilities so the agent knows it can interact with on-screen UI, not just launch apps Config: computer_control.enabled = true set in user config (not a code change — user-specific setting at ~/.openhuman/users/<id>/config.toml). Part of tinyhumansai#3148.

…orkflow Without screenshot in the named list the agent could click but couldn't locate UI elements — it was asking the user for coordinates. - orchestrator/agent.toml: add 'screenshot' alongside 'mouse'/'keyboard' - SOUL.md: document the screenshot→mouse workflow explicitly and tell the agent to never ask the user for coordinates — find them via screenshot

CGEventPost from enigo crashes CEF when the key event lands in the OpenHuman renderer instead of the target app. Removing until a proper app-focus-before-input mechanism is in place.

Replaces the unreliable mouse/keyboard (enigo/CGEventPost) approach with macOS Accessibility API interactions — no synthetic events, no CEF crash. Swift helper (helper.rs): - ax_list_elements: walk the AX tree and return interactive elements - ax_press: AXUIElementPerformAction(kAXPressAction) by label - ax_set_value: AXUIElementSetAttributeValue(kAXValueAttribute) by label - New switch cases: ax_list, ax_press, ax_set_value - helper_send_receive: pub(super) → pub(crate) so ax_interact.rs can call it New files: - src/openhuman/accessibility/ax_interact.rs — Rust wrappers (ax_list_elements, ax_press_element, ax_set_field_value) over the Swift helper - src/openhuman/tools/impl/computer/ax_interact.rs — AxInteractTool with actions: list / press / set_value, PermissionLevel::ReadOnly Wired into: - tools/ops.rs, tools/user_filter.rs, toolDefinitions.ts - orchestrator/agent.toml named list - SOUL.md: document list→press workflow Part of tinyhumansai#3148.

…ylist)

Tests cover: - ax_list_returns_elements: AX tree is non-empty for Music - ax_press_play_button: Play button is pressable - test_full_flow_search_and_play_acdc: open Music → URL-scheme search for 'Highway to Hell' → find AXCell in results → press it - ax_set_search_field: set_value on the search field - test_ax_list_nonexistent_app / test_ax_press_nonexistent_app: error paths Live tests tagged #[ignore] (need Accessibility permission + Music). Run with: cargo test ax_interact -- --include-ignored --nocapture

SOUL.md: add explicit 4-step workflow (list → set_value → list again → press specific row, not generic Play). Add guidance to use shell URL scheme for Apple Music song search — more reliable than filter field. ax_interact_tests.rs: fix import from super::super::ax_interact to super:: (tests are in a submodule of ax_interact, not a sibling).

- voice-system-actions.md: mark 1.8 (mouse/keyboard) reverted with crash root cause; add 1.9 (ax_interact) and 1.10 (multi-step workflow guidance); update summary table - ax_interact_tests.rs: flatten to #![cfg] module-level so super:: resolves to ax_interact; full AC/DC flow test now passes (5 steps, song row pressed)

Root cause of 'navigated but didn't play': pressing a search-result row in Apple Music only selects/navigates — it never starts playback. Every matching element (cell/group/button) exposes only AXPress=select. Verified empirically that double-press, CGEvent double-click, and select+Return all leave player state 'stopped'. Working sequence: AXPress the result to navigate INTO the song's detail page, then AXPress the Play button ON that page → player state 'playing'. - SOUL.md: exact 5-step Apple Music sequence; warns the second Play press on the detail page is mandatory - ax_interact_tests.rs: full-flow test now asserts real playback via osascript player state == 'playing' (passes) - voice-system-actions.md: document as change 1.11 with verification

Root cause the agent kept using the wrong (filter-field) approach: the orchestrator has omit_identity=true, so it NEVER sees SOUL.md. The chat agent only reads tool descriptions + agent.toml. The navigate-then-play guidance in SOUL.md was dead weight for the orchestrator. Moved the exact 5-step Apple Music play sequence into the ax_interact tool description, which the LLM always receives via the function schema.

Transcript analysis of the failed 'play Highway to Hell' run revealed two root causes: 1. The orchestrator has NO shell tool — my ax_interact description told it to 'use shell to open music://...', which it can't. It wrapped the command in a prompt arg to a delegation tool; it never ran, and it fell back to the broken filter-field approach. 2. Cross-chat memory context injected prior filter-approach checkpoints, biasing the agent back to the wrong method. Fix: stop making the LLM orchestrate a fragile multi-step flow with a tool it lacks. Encapsulate the entire proven sequence in native Rust: - accessibility/ax_interact.rs: play_apple_music(query) — open search URL, AX-find + press the song cell (navigate), press detail-page Play, verify player state == playing - tools/impl/computer/play_music.rs: PlayMusicTool, one call play_music{query}, PermissionLevel::ReadOnly, runs the blocking flow via spawn_blocking - registered in ops.rs, user_filter.rs, orchestrator agent.toml, toolDefinitions.ts Agent now calls play_music{query:'Highway to Hell AC/DC'} once and it plays.

…lay_music Transcript analysis of the failed 'play Numb by Linkin Park' run: 1. play_music failed on a 4s timing race (results not yet rendered → empty) 2. agent fell back to ax_interact 'list' which dumped 273 elements; the tool result was TRUNCATED mid-list, so the model hallucinated a wrong result ('Numb - Single by Marshmello') from a partial view. Per feedback, a music-specific tool is the wrong abstraction. Reverted it and made ax_interact a robust GENERIC any-app interaction tool: - Removed play_music tool + play_apple_music helper (and all registrations) - ax_list_elements_filtered(app, filter): Rust-side label filter so 'list' returns only relevant elements (fixes the truncation→hallucination bug) - ax_interact 'list' now takes a param; output capped at 60 with a 'narrow your filter' hint; empty-match returns a 'UI may still be loading' hint instead of failing hard - Rewrote the tool description to be app-agnostic and document the general navigate-then-activate pattern (press a row opens it; press the action button after) without hardcoding Apple Music steps

…fort The full-flow test was flaky asserting player state == 'playing': Apple Music's UI is nondeterministic (detail-page render timing varies; multiple 'Play' elements that AX can't disambiguate). The test now asserts the generic list/press primitives work against a real app and logs the player state for diagnosis only — playback reliability is an Apple Music UI limitation, not a tool correctness issue.

Maps each macOS piece to its Windows equivalent so the same open-app + interact-with-UI feature can be built on Windows: - macOS AXUIElement → Windows UI Automation (IUIAutomationElement) - AX roles/actions → UIA ControlType + Invoke/Value/SelectionItem patterns - recommends the Rust crate (no helper process needed — COM API is callable directly from Rust, unlike the macOS Swift helper) - module layout: uia_interact.rs parallel to ax_interact.rs, cfg-dispatched so the agent-facing tool stays a single 'ax_interact' on both platforms - permissions (UIA needs none for same-integrity apps), Chromium/Electron caveats, Calculator/Notepad smoke tests, Start-Process/Get-StartApps for launching Store apps Also includes trailing linter reformat of ax_interact.rs/tests.

…atrix - Cross-platform audit table: confirms every Phase 1 change compiles on all platforms (macOS native code is cfg-gated; non-macOS arms return a clean error, never a build break). Flags the one-line shell-allowlist gap (add 'start') and the ax_interact UIA backend work. - Mandatory Windows E2E matrix (9 items): app launch incl. UWP/URI, deterministic Calculator control (hard-asserted), Notepad set_value, filtered-list correctness (no truncation/hallucination), real media app (best-effort), Chromium/Electron tree exposure, elevation/UIPI, agent-in-the-loop, and a macOS regression re-run after the port. - Note to verify the whole branch still builds+runs on macOS after the Windows cfg-dispatch lands.

M3gA-Mind added 24 commits June 2, 2026 02:43

fix(shell): clarify tool description to include system/app-launch act…

ec8f5be

…ions

docs: add voice system actions feature tracker

c0bc07f

style(builder): format visible_names_list for improved readability

454ce81

docs: update tracker with computer control (change 1.8)

4363b39

revert: remove mouse/keyboard/screenshot from orchestrator — unreliable

8e65231

CGEventPost from enigo crashes CEF when the key event lands in the OpenHuman renderer instead of the target app. Removing until a proper app-focus-before-input mechanism is in place.

fix(ax_interact): prefer exact label match over contains (Play vs Pla…

2c32b59

…ylist)

docs: record play_music root-cause fix (change 1.12)

12b1a1e

docs: record generic ax_interact refactor (change 1.13)

b0dfcde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice): auto-send dictation transcript + allowlist app-launch commands (#3148 Phase 1)#3168

feat(voice): auto-send dictation transcript + allowlist app-launch commands (#3148 Phase 1)#3168
M3gA-Mind wants to merge 25 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-always-on

M3gA-Mind commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

M3gA-Mind commented Jun 1, 2026

Changes

1. Auto-send after transcription

2. App-launch shell allowlist

What's still needed from #3148

Test plan

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading