An agent-native voice-and-vision framework. Turn any audio/visual device -- earbuds, smart glasses, pendants, phones -- into an always-on AI companion that can perceive, understand, and act on your behalf.
Built by Intentlabs.
Supported platforms: iOS (iPhone) and Android
Today's voice AI apps (ChatGPT Voice, Gemini Live, Sesame) are conversational but not agentic. They can talk to you, but they cannot act for you. When they try to do complex tasks (search, multi-step workflows, API calls), they go silent for 10-30 seconds -- broken UX.
Meanwhile, agent frameworks (OpenClaw, Manus, Claude Code) can execute complex tasks but have no real-time voice interface.
No consumer product today combines real-time voice conversation with general-purpose agent execution. Matcha fills this gap.
Matcha separates real-time voice interaction from asynchronous task execution, allowing both to run simultaneously without blocking each other.
+-----------------------------+
| MATCHA CORE |
| |
User ---- Audio ------> | +---------------------+ |
Device Stream | | VOICE AGENT | |
(glasses, | | (synchronous) | |
earbuds, | | | |
pendant, | | Real-time voice | |
phone) | | conversation. | |
<-- Audio --- | | Always responsive. | |
Response | | Never blocked. | |
| +----------+-----------+ |
| | |
| delegates tasks |
| | |
| +----------v-----------+ |
| | ACTION AGENT | |
| | (asynchronous) | |
User ---- Video ------> | | | |
Device Frames | | Web search, API | |
(camera (~1fps) | | calls, messaging, | |
on | | smart home, etc. | |
glasses, | | | |
phone) | | Reports results | |
| | back to Voice | |
| | Agent when ready. | |
| +----------------------+ |
| |
+-----------------------------+
Voice Agent -- maintains real-time bidirectional audio with the user. Sub-second latency. Never blocked by tasks. Powered by Gemini Live API or OpenAI Realtime API.
Action Agent -- receives task delegations from Voice Agent. Executes complex, multi-step tasks in the background via OpenClaw (56+ skills: web search, messaging, smart home, notes, reminders, etc.). Reports results back to Voice Agent when ready.
Example flow:
- User: "Find me the best ramen places in SF that are open late"
- Voice Agent: "Sure, let me search for late-night ramen spots."
- Action Agent begins web search in background
- User: "Oh also, I want somewhere with vegetarian options"
- Voice Agent: "Got it, I'll filter for vegetarian-friendly places too."
- Action Agent returns results
- Voice Agent speaks the answer conversationally
The user is never left in silence. The agent is never limited to shallow answers.
Matcha is device-agnostic. It connects to any audio I/O device:
| Device | Audio In | Audio Out | Video In | Status |
|---|---|---|---|---|
| Phone (built-in) | Mic | Speaker | Camera | Working |
| AirPods / earbuds | Mic | Speaker | -- | Working |
| Meta Ray-Ban glasses | Mic | Speaker | Camera (via DAT SDK) | Working |
| Any Bluetooth audio | Mic | Speaker | -- | Working |
| Sesame glasses | Mic | Speaker | Camera | Planned |
| Apple glasses | Mic | Speaker | Camera | Planned |
| Pendant devices | Mic | Speaker | Camera | Planned |
Matcha is model-agnostic:
| Provider | Model | Status |
|---|---|---|
| Gemini 2.0 Flash (Live API) | Working | |
| OpenAI | GPT-4o Realtime API | Planned |
git clone https://github.com/Intent-Lab/matcha.git
cd matcha/samples/CameraAccess
open CameraAccess.xcodeprojcp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swiftEdit Secrets.swift with your Gemini API key (required) and optional OpenClaw/WebRTC config.
Select your iPhone as the target device and hit Run (Cmd+R).
Without glasses (iPhone mode):
- Tap "Start on iPhone" -- uses your iPhone's back camera
- Tap the AI button to start a voice session
- Talk to the AI -- it can see through your iPhone camera and execute tasks
With Meta Ray-Ban glasses:
First, enable Developer Mode in the Meta AI app:
- Open the Meta AI app on your iPhone
- Go to Settings (gear icon, bottom left)
- Tap App Info
- Tap the App version number 5 times -- this unlocks Developer Mode
- Go back to Settings -- you'll now see a Developer Mode toggle. Turn it on.
Then in the app:
- Tap "Start Streaming"
- Tap the AI button for voice + vision conversation
git clone https://github.com/Intent-Lab/matcha.gitOpen samples/CameraAccessAndroid/ in Android Studio.
The Meta DAT Android SDK is distributed via GitHub Packages. You need a GitHub Personal Access Token with read:packages scope.
- Go to GitHub > Settings > Developer Settings > Personal Access Tokens and create a classic token with
read:packagesscope - In
samples/CameraAccessAndroid/local.properties, add:
github_token=YOUR_GITHUB_TOKENcd samples/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/
cp Secrets.kt.example Secrets.ktEdit Secrets.kt with your Gemini API key (required) and optional OpenClaw/WebRTC config.
- Let Gradle sync in Android Studio
- Select your Android phone as the target device
- Click Run (Shift+F10)
Without glasses (Phone mode):
- Tap "Start on Phone" -- uses your phone's back camera
- Tap the AI button to start a voice session
- Talk to the AI -- it can see through your phone camera and execute tasks
With Meta Ray-Ban glasses:
Enable Developer Mode in the Meta AI app (same steps as iOS above), then:
- Tap "Start Streaming" in the app
- Tap the AI button for voice + vision conversation
OpenClaw gives Matcha the ability to take real-world actions: send messages, search the web, manage lists, control smart home devices, and more. Without it, the AI is voice + vision only (no task execution).
Follow the OpenClaw setup guide. Make sure the gateway is enabled:
In ~/.openclaw/openclaw.json:
{
"gateway": {
"port": 18789,
"bind": "lan",
"auth": {
"mode": "token",
"token": "your-gateway-token-here"
},
"http": {
"endpoints": {
"chatCompletions": { "enabled": true }
}
}
}
}iOS -- In Secrets.swift:
static let openClawHost = "http://Your-Mac.local"
static let openClawPort = 18789
static let openClawGatewayToken = "your-gateway-token-here"Android -- In Secrets.kt:
const val openClawHost = "http://Your-Mac.local"
const val openClawPort = 18789
const val openClawGatewayToken = "your-gateway-token-here"Both iOS and Android also have an in-app Settings screen where you can change these values at runtime.
openclaw gateway restartsamples/CameraAccess/CameraAccess/
Core/ # Dual-agent framework
Protocols/
VoiceModelProvider.swift # Abstract voice model interface
AgentProtocol.swift # AgentTask, AgentResult types
Models/
GeminiLiveProvider.swift # Gemini Live API adapter
Agents/
VoiceAgent.swift # Real-time voice session manager
ActionAgent.swift # Async task executor (OpenClaw)
AgentCoordinator.swift # Dual-agent orchestrator
Gemini/ # Voice model infrastructure
GeminiLiveService.swift # WebSocket client for Gemini Live API
AudioManager.swift # Mic capture (PCM 16kHz) + playback (PCM 24kHz)
GeminiSessionViewModel.swift # Session lifecycle (delegates to AgentCoordinator)
GeminiConfig.swift # API keys, model config, system prompt
OpenClaw/ # Task execution
OpenClawBridge.swift # HTTP client for OpenClaw gateway
ToolCallRouter.swift # Tool call routing
ToolCallModels.swift # Tool declarations, data types
iPhone/ # Phone camera fallback
IPhoneCameraManager.swift
WebRTC/ # Live streaming (glasses POV to browser)
WebRTCClient.swift
SignalingClient.swift
Settings/
SettingsManager.swift
SettingsView.swift
- Input: Mic -> AudioManager (PCM Int16, 16kHz mono, 100ms chunks) -> Voice Model WebSocket
- Output: Voice Model WebSocket -> AudioManager playback queue -> Speaker
- Echo cancellation: Aggressive AEC (
voiceChat) when speaker is on phone; mild AEC (videoChat) when using glasses - Mic muting: Automatically mutes mic while AI speaks when speaker + mic are co-located
- User says "Add eggs to my shopping list"
- Voice Agent acknowledges: "Sure, adding that now"
- Voice Agent delegates
AgentTaskto Action Agent viaAgentCoordinator - Action Agent sends HTTP POST to OpenClaw gateway
- OpenClaw executes the task
- Action Agent returns
AgentResultto coordinator - Coordinator delivers result back to Voice Agent
- Voice Agent speaks the confirmation
The Voice Agent remains responsive throughout -- the user can continue talking while tasks execute.
- Dual-agent architecture (Voice Agent + Action Agent)
- VoiceModelProvider protocol (model-agnostic)
- Gemini Live provider
- OpenClaw integration for task execution
- iOS and Android apps
- OpenAI Realtime provider
- Device provider abstraction
- Camera-based intent inference
- Proactive assistance (auto-translate foreign text, surface contextual info)
- Cross-frame memory ("What was that sign I saw 2 minutes ago?")
- Gaze-based intent prediction (with eye-tracking hardware)
- iOS 17.0+
- Xcode 15.0+
- Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use iPhone mode for testing)
- OpenClaw on your Mac (optional -- for task execution)
- Android 14+ (API 34+)
- Android Studio Ladybug or newer
- GitHub account with
read:packagestoken (for DAT SDK) - Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use Phone mode for testing)
- OpenClaw on your Mac (optional -- for task execution)
AI doesn't hear me -- Check that microphone permission is granted. Speak clearly and at normal volume.
OpenClaw connection timeout -- Make sure your phone and Mac are on the same Wi-Fi network, the gateway is running (openclaw gateway restart), and the hostname matches your Mac's Bonjour name.
"Gemini API key not configured" -- Add your API key in Secrets.swift/Secrets.kt or in the in-app Settings.
Echo/feedback in iPhone mode -- The app mutes the mic while the AI is speaking. If you still hear echo, try turning down the volume.
Android: Gradle sync fails with 401 -- Your GitHub token is missing or doesn't have read:packages scope. Check local.properties. Generate a new token at github.com/settings/tokens.
For DAT SDK issues, see the developer documentation or the discussions forum.
See CONTRIBUTING.md.
This source code is licensed under the license found in the LICENSE file in the root directory of this source tree.