Releases: imonoonoko/Bit-TTT-Engine
v1.0.0 — Final Release
v1.0.0 — Final Release
BitLlama v1.0.0. Development complete.
What is BitLlama?
A Pure Rust LLM inference engine with Soul learning and hierarchical memory.
- 7 model architectures: Llama-2/3, Gemma-2/3, Qwen2.5, Mistral, BitNet
- Soul learning: LoRA fine-tuning from conversations
- Memory system: 4-layer hierarchical memory + 7-stage Sleep consolidation
- Desktop GUI: Tauri 2.0 + Svelte 5, Japanese/English i18n
- Performance: 45.4 tok/s (7B), 90% of llama.cpp
- 1121 tests, quality score 9.0/10
Changes since v0.16.0
- CJK memory search fix (character bigram fallback for Japanese queries)
- Soul learning tests (warmup, chat template, VRAM guard)
- Chat template application fix for GGUF tokenizer fallback
- README/ROADMAP updated to reflect project completion
Install
# Homebrew
brew tap imonoonoko/bitllama && brew install bitllama
# winget
winget install imonoonoko.BitLlama
# Or download binaries belowBuilt with Rust by @imonoonoko
Full Changelog: v0.16.0...v1.0.0
v0.16.0
Full Changelog: v0.15.0...v0.16.0
v0.15.0: Inference Guards + GUI Quality + Install Scripts
What's New
Inference Safety Guards
- New
inference_guardmodule — NaN/Inf detection, severity classification, NaN-safe greedy decoding - All inference paths protected (BitLlama, Llama4Bit, Desktop sampling, speculative)
- Temperature validation (NaN/Inf/negative → greedy fallback)
- Input length validation (empty input, context length overflow)
- +16 robustness tests (softmax stability, RoPE boundaries, KV cache)
GUI Quality Improvements
- Error recovery UX: retry buttons for model load failures and generation errors
- Accessibility: WCAG AA contrast,
aria-live,aria-current, focus management - First-run experience: wizard → model browser direct flow
- ARIA: tablist roles (
role="tablist"/role="tab"/aria-selected)
TTT Integration
- TTT enable/disable for safetensors models (no longer GGUF-only)
TTTLayeruse_tttflag for runtime inner-loop control
One-Click Install
scripts/install.sh(Linux/macOS) +scripts/install.ps1(Windows)- Homebrew tap (
imonoonoko/homebrew-bitllama) + auto-update workflow - winget manifests submitted (CLI #338557, Desktop #338558)
Stats
- 480 lib tests + 42 desktop tests passing
- All clippy checks clean (main, web, desktop)
Full Changelog: v0.14.0...v0.15.0
v0.14.0
Full Changelog: v0.13.0...v0.14.0
v0.13.0
Full Changelog: v0.12.0...v0.13.0
v0.12.0
Full Changelog: v0.11.0...v0.12.0
v0.11.0
Full Changelog: v0.10.0...v0.11.0
v0.9.0 — BitLlama Desktop: Phase 14 Complete
v0.9.0 — BitLlama Desktop: Phase 14 Complete
BitLlama Desktop is now a fully-featured local LLM application with hardware auto-detection, model management, and Japanese UI.
New Features
Desktop GUI (Phase 14: 11/11 tasks complete)
- Hardware auto-detection: RAM/VRAM detection with model recommendations in sidebar
- Welcome wizard: Language → HW detection → model recommendation → download → first chat (3 minutes)
- Model browser: Local models tab + HuggingFace download tab with progress bar
- Model download manager: Background download with speed display and progress events
- Japanese/English i18n: Full UI translation — first local LLM tool with Japanese UI
- Chat history persistence: Conversations saved to localStorage
- Settings panel: GPU configuration, generation parameters, theme, language
- Custom branding: BitLlama icon set (dark circle + blue "B" + ternary dots)
- Error classification: User-friendly error messages with actionable guidance
Engine Improvements
- BitNet architecture foundation:
ModelArch::BitNet,ActivationType::ReLuSquared(relu(x)²) - Model size guards: Warning/block for post-training conversion of models < 7B
- Memory optimization: Pre-allocated tensors in learn command,
--max-tokensoption for memory-constrained devices (Issue #9) - GGUF variant support: Web module now handles
UnifiedModel::Ggufin all match expressions
CI/Quality
- CI fully green:
cargo fmt+clippy(main + web + desktop) +cargo audit+ 4-platform builds /checkcommand mirrors CI exactly (8-step verification)- Security: Updated
bytes(RUSTSEC-2026-0007) andtime(RUSTSEC-2026-0009)
Upgrade Notes
No breaking changes from v0.8.0. All existing CLI commands work as before.
New CLI option for learn command:
# Limit token count for memory-constrained devices (e.g. Termux/Android)
bitllama learn "text" --model model.gguf --max-tokens 128Full Changelog
30 commits since v0.8.0 — see compare
v0.8.0 - Pure Rust CLI & Soul Learning
Bit-TTT-Engine v0.8.0
The Soul Edition — Pure Rust CLI with personality learning capabilities.
🎉 Highlights
Pure Rust CLI
bitllama run llama3 # Ollama-style inference
bitllama learn "My name is Onoko" # Teach your AI
bitllama serve # OpenAI-compatible API
bitllama pull meta-llama/... # Download from HuggingFaceSoul Learning
- In-context learning: Teach facts with
bitllama learn - Cross-session persistence: Knowledge survives restarts
- Minimal overhead: Only 3.8% speed impact
Multi-turn Conversations
- Full conversation history support
- All chat templates (Llama-2/3, Gemma, Mistral, Qwen)
📊 Performance
| Model | Speed | vs llama.cpp |
|---|---|---|
| Llama-2 7B Q4_K_M | 45.4 tok/s | 90% |
| Gemma-2 2B Q4_K_M | 75.1 tok/s | 74% |
📦 Installation
# From source (recommended)
cargo install --path crates/bit_llama
# Python bindings
pip install cortex_rustWhat's New
bitllama run— Interactive chatbitllama learn— Soul learningbitllama soul— Soul managementbitllama serve— OpenAI API serverbitllama pull— HuggingFace model downloadbitllama list— List local models- True SSE streaming (mpsc channels)
- Multi-turn conversation support
Full Changelog: v0.7.0...v0.8.0
v0.6.0 - Performance & Python/CUDA
Release Notes: v0.6.0
Release Date: 2026-01-31
Theme: Performance Optimization & Python/CUDA Support
🎯 Highlights
This release focuses on performance optimization and ecosystem expansion:
- 🐍 Python Bridge: Use Bit-TTT from Python with
pip install - 🎮 CUDA GPU: 22x faster inference on NVIDIA GPUs
- 🦊 Gemma Support: Run Gemma and Gemma2 models
- ⚡ Flash Attention: Memory-efficient attention for long sequences
✨ New Features
🐍 Python Bridge (PyO3)
Install and use Bit-TTT directly from Python:
from bit_ttt_engine import BitLlama
# Load model
model = BitLlama.load("gemma-2-2b-it-Q4_K_M.gguf")
# Generate text
output = model.generate("Hello, how are you?", max_tokens=100)
print(output)Installation:
cd crates/rust_engine
pip install maturin
maturin develop --release🎮 CUDA GPU Acceleration
- Automatic GPU detection
- 22x faster matmul on RTX 4060 Ti
- Hybrid CPU/GPU inference support
Build with CUDA:
cargo build --release --features cuda🦊 Gemma/Gemma2 Architecture Support
- Auto-detect Gemma, Gemma2 from GGUF metadata
- GeGLU activation for Gemma models
- Tied embeddings support
- Verified with
gemma-2-2b-it-Q4_K_M.gguf
⚡ Performance Optimization (Phase 5)
| Feature | Benefit |
|---|---|
| Flash Attention | O(n) memory vs O(n²) for long sequences |
| Continuous Batching | Multi-request server deployments |
| Speculative Decoding | 2-3x speedup framework (draft model ready) |
📊 Benchmarks
GPU vs CPU (RTX 4060 Ti)
| Operation | CPU | GPU | Speedup |
|---|---|---|---|
| MatMul (4096x4096) | 45ms | 2ms | 22x |
| Inference (TinyLlama) | 2.4 tok/s | TBD | - |
Memory Usage
| Model | Format | VRAM |
|---|---|---|
| TinyLlama 1.1B | Q4_K_M | 1.5 GB |
| Gemma2 2B | Q4_K_M | 2.5 GB |
🧪 Testing
- 14 new tests for Flash Attention, Scheduler, Speculative Decoding
- All CI checks passing
- E2E tested with Gemma2 GGUF
📦 Installation
From Source (Rust)
git clone https://github.com/imonoonoko/Bit-TTT-Engine.git
cd Bit-TTT-Engine
cargo build --releaseFrom Source (Python)
cd crates/rust_engine
pip install maturin
maturin develop --releasePre-built Binaries
Download from GitHub Releases.
⚠️ Known Issues
- CUDA + VS 2022 18.x: Requires
-allow-unsupported-compilerflag - Flash Attention GPU: CPU-only for now, GPU kernel coming in v0.7.0
🔮 What's Next (v0.7.0)
- TTT effect validation benchmark
- GPU Flash Attention kernel
- LoRA fine-tuning support
model.adapt()API
🙏 Contributors
Thanks to everyone who contributed to this release!
Full Changelog: CHANGELOG.md
Documentation: README.md