| name | runanywhere-ai |
|---|---|
| description | Integrate RunAnywhere on-device AI (LLMs, STT, TTS, voice agents, VLM) into applications. Use when implementing offline AI features, local LLM inference, on-device speech processing, privacy-first AI, vision language models, or when mentions of "RunanywhereAI", "on-device AI", "local inference", "offline AI", "GGUF models", "llama.cpp", "VLM", or "on-device vision" appear. Supports Swift (iOS/macOS), Kotlin (Android), Web (WebAssembly), React Native, and Flutter platforms. |
Comprehensive guide for integrating RunAnywhere on-device AI into your applications across all supported platforms.
RunAnywhere enables privacy-first, on-device AI inference for:
- LLM Text Generation - Run LFM2, Llama, Mistral, Qwen, SmolLM locally via llama.cpp
- Vision Language Models (VLM) - On-device visual understanding with camera/image input (iOS/Web)
- Speech-to-Text - Whisper-based transcription
- Text-to-Speech - Neural voice synthesis via Piper
- Voice Agent Pipeline - Complete VAD → STT → LLM → TTS orchestration
- Tool Calling & Structured Output - Function calling and JSON schema-guided generation
All processing happens locally—no cloud, no latency, no data leaves the device.
Select the platform-specific guide:
- Swift (iOS/macOS) → Read swift.md
- Kotlin (Android) → Read kotlin.md
- Web (Browser) → Read web.md
- React Native → Read react-native.md
- Flutter → Read flutter.md
Each guide contains complete installation, setup, and usage instructions.
Choose appropriate models for your use case:
Read models.md for:
- Model size vs quality tradeoffs
- Device-specific recommendations
- Quantization level guidance
- Download URLs
Quick recommendations:
- Lightweight (< 1GB RAM): LFM2 350M, SmolLM2 360M, Qwen 0.5B
- Balanced (1-4GB RAM): LFM2 1.2B Tool, Llama 3.2 1B
- High Quality (4GB+ RAM): Llama 3.2 3B, Mistral 7B
- Vision (VLM): LFM2-VL 450M (iOS/Web only)
All platforms follow the same three-step pattern:
1. Initialize SDK
↓
2. Download & Load Model
↓
3. Generate / Transcribe / Synthesize
Platform-specific implementation details are in each reference guide.
- Read swift.md
- Add RunAnywhere via Swift Package Manager
- Initialize SDK and register LlamaCPP module
- Download and load a model (e.g., smollm2-360m)
- Use
RunAnywhere.chat()orRunAnywhere.generate()for inference - Optional: Use streaming for better UX
- Read react-native.md
- Install
@runanywhere/coreand@runanywhere/onnx - Register ONNX module and add Whisper model
- Download and load STT model
- Use
RunAnywhere.transcribe()for audio transcription
- Read kotlin.md
- Add RunAnywhere Kotlin SDK dependencies
- Register LlamaCPP and ONNX modules
- Download LLM, STT, and TTS models
- Configure and start VoiceAgent with
RunAnywhere.startVoiceSession() - Handle voice session events (listening, transcribed, responded, speaking)
- Read web.md
- Install
@runanywhere/web,@runanywhere/web-llamacpp, and@runanywhere/web-onnxvia npm - Configure bundler (Vite/Webpack) for WASM files
- Set Cross-Origin headers for SharedArrayBuffer
- Register
LlamaCPPandONNXbackends - Register models with
RunAnywhere.registerModels() - Generate with streaming via
TextGeneration.generateStream()
- Read web.md
- Install 3-package Web SDK (
@runanywhere/web,@runanywhere/web-llamacpp,@runanywhere/web-onnx) - Create a VLM Web Worker with
startVLMWorkerRuntime() - Wire
VLMWorkerBridgetoRunAnywhere.setVLMLoader() - Use
VideoCaptureto capture camera frames - Process frames with
VLMWorkerBridge.shared.process(rgbPixels, width, height, prompt)
Models are compressed using quantization:
- Q4_0: Smallest size, fastest, lower quality (~3.5 bits/weight)
- Q5_K_M: Balanced size and quality (~5.5 bits/weight)
- Q8_0: Largest size, best quality, slower (~8 bits/weight)
- LLM: GGUF format (via llama.cpp) —
LLMFramework.LlamaCpp - VLM: GGUF format (model + mmproj files) —
LLMFramework.LlamaCpp - STT: ONNX format (Whisper models) —
LLMFramework.ONNX - TTS: ONNX format (Piper voices) —
LLMFramework.ONNX - VAD: ONNX format (Silero VAD v5) —
LLMFramework.ONNX
Rule of thumb: Device RAM should be 2× model size
- SmolLM2 360M (~400MB) → Need 800MB+ RAM
- Llama 3.2 1B (~1GB) → Need 2GB+ RAM
- Mistral 7B (~4GB) → Need 8GB+ RAM
Model Not Found
Solution: Download model first before loading
Check: Model exists at expected path
Insufficient Memory
Solution: Use smaller model or lower quantization
Check: Available device RAM vs model requirements
Slow Generation
Solution: Use streaming for better UX, lower quantization, or smaller model
Check: Tokens per second metric (target: 10+ tok/s)
Download Fails
Solution: Implement retry with exponential backoff
Check: Network connectivity, storage space
- Web: Use Q4_0 quantization (browser memory limits)
- Mobile: Use Q4_K_M or Q5_K_M (balanced)
- Desktop: Use Q5_K_M or Q8_0 (best quality)
- Use streaming to display tokens as they generate
- Show progress bars during model downloads
- Display generation metrics (tok/s) to users
- Implement cancellation for long-running operations
- Unload models when not in use
- Check available memory before loading
- Use smaller models on low-memory devices
- Monitor memory usage during inference
All platforms support development mode with verbose logging:
Swift:
try RunAnywhere.initialize(environment: .development)Kotlin:
RunAnywhere.initialize(environment = SDKEnvironment.DEVELOPMENT)Web:
await RunAnywhere.initialize({ environment: 'development', debug: true })React Native/Flutter: Same pattern as above
iOS:
log stream --predicate 'subsystem CONTAINS "com.runanywhere"' --info --debugAndroid:
adb logcat | grep "RunAnywhere"Web:
console.log() // Standard browser console- Website: runanywhere.ai
- Docs: docs.runanywhere.ai
- GitHub: github.com/RunanywhereAI/runanywhere-sdks
- Discord: discord.gg/N359FBbDVd
Each platform has a complete demo app:
- iOS: examples/ios/RunAnywhereAI/
- Android: examples/android/RunAnywhereAI/
- Web: web-starter-app (Chat, Vision, Voice tabs)
- React Native: examples/react-native/RunAnywhereAI/
- Flutter: examples/flutter/RunAnywhereAI/
| Task | Command/Method | Platform |
|---|---|---|
| Initialize SDK | RunAnywhere.initialize() |
All |
| Register Backend | LlamaCPP.register() / ONNX.register() |
Web |
| Register Models | RunAnywhere.registerModels(models) |
Web |
| Download Model | ModelManager.downloadModel(id) |
Web |
| Load Model | ModelManager.loadModel(id) |
Web |
| Generate Text | TextGeneration.generate(prompt) |
All |
| Stream Generation | TextGeneration.generateStream(prompt) |
All |
| VLM Process | VLMWorkerBridge.shared.process(rgb, w, h, prompt) |
iOS/Web |
| Capture Camera | VideoCapture.captureFrame(dim) |
Web |
| Transcribe Audio | STT.transcribe(audio) |
All |
| Synthesize Speech | TTS.synthesize(text) |
All |
| Voice Agent | VoicePipeline.start() |
All |
Next Steps:
- Choose your platform and read the corresponding reference guide
- Review models.md to select appropriate models
- Follow the platform-specific setup instructions
- Start with a simple text generation example
- Explore advanced features (STT, TTS, voice agents) as needed