Skip to content

Releases: imonoonoko/Bit-TTT-Engine

v1.0.0 — Final Release

18 Feb 19:13

Choose a tag to compare

v1.0.0 — Final Release

BitLlama v1.0.0. Development complete.

What is BitLlama?

A Pure Rust LLM inference engine with Soul learning and hierarchical memory.

  • 7 model architectures: Llama-2/3, Gemma-2/3, Qwen2.5, Mistral, BitNet
  • Soul learning: LoRA fine-tuning from conversations
  • Memory system: 4-layer hierarchical memory + 7-stage Sleep consolidation
  • Desktop GUI: Tauri 2.0 + Svelte 5, Japanese/English i18n
  • Performance: 45.4 tok/s (7B), 90% of llama.cpp
  • 1121 tests, quality score 9.0/10

Changes since v0.16.0

  • CJK memory search fix (character bigram fallback for Japanese queries)
  • Soul learning tests (warmup, chat template, VRAM guard)
  • Chat template application fix for GGUF tokenizer fallback
  • README/ROADMAP updated to reflect project completion

Install

# Homebrew
brew tap imonoonoko/bitllama && brew install bitllama

# winget
winget install imonoonoko.BitLlama

# Or download binaries below

Built with Rust by @imonoonoko

Full Changelog: v0.16.0...v1.0.0

v0.16.0

18 Feb 05:47

Choose a tag to compare

Full Changelog: v0.15.0...v0.16.0

v0.15.0: Inference Guards + GUI Quality + Install Scripts

13 Feb 10:15

Choose a tag to compare

What's New

Inference Safety Guards

  • New inference_guard module — NaN/Inf detection, severity classification, NaN-safe greedy decoding
  • All inference paths protected (BitLlama, Llama4Bit, Desktop sampling, speculative)
  • Temperature validation (NaN/Inf/negative → greedy fallback)
  • Input length validation (empty input, context length overflow)
  • +16 robustness tests (softmax stability, RoPE boundaries, KV cache)

GUI Quality Improvements

  • Error recovery UX: retry buttons for model load failures and generation errors
  • Accessibility: WCAG AA contrast, aria-live, aria-current, focus management
  • First-run experience: wizard → model browser direct flow
  • ARIA: tablist roles (role="tablist" / role="tab" / aria-selected)

TTT Integration

  • TTT enable/disable for safetensors models (no longer GGUF-only)
  • TTTLayer use_ttt flag for runtime inner-loop control

One-Click Install

  • scripts/install.sh (Linux/macOS) + scripts/install.ps1 (Windows)
  • Homebrew tap (imonoonoko/homebrew-bitllama) + auto-update workflow
  • winget manifests submitted (CLI #338557, Desktop #338558)

Stats

  • 480 lib tests + 42 desktop tests passing
  • All clippy checks clean (main, web, desktop)

Full Changelog: v0.14.0...v0.15.0

v0.14.0

11 Feb 11:20

Choose a tag to compare

Full Changelog: v0.13.0...v0.14.0

v0.13.0

11 Feb 11:35

Choose a tag to compare

Full Changelog: v0.12.0...v0.13.0

v0.12.0

07 Feb 15:46

Choose a tag to compare

Full Changelog: v0.11.0...v0.12.0

v0.11.0

07 Feb 15:05

Choose a tag to compare

Full Changelog: v0.10.0...v0.11.0

v0.9.0 — BitLlama Desktop: Phase 14 Complete

06 Feb 16:27

Choose a tag to compare

v0.9.0 — BitLlama Desktop: Phase 14 Complete

BitLlama Desktop is now a fully-featured local LLM application with hardware auto-detection, model management, and Japanese UI.

New Features

Desktop GUI (Phase 14: 11/11 tasks complete)

  • Hardware auto-detection: RAM/VRAM detection with model recommendations in sidebar
  • Welcome wizard: Language → HW detection → model recommendation → download → first chat (3 minutes)
  • Model browser: Local models tab + HuggingFace download tab with progress bar
  • Model download manager: Background download with speed display and progress events
  • Japanese/English i18n: Full UI translation — first local LLM tool with Japanese UI
  • Chat history persistence: Conversations saved to localStorage
  • Settings panel: GPU configuration, generation parameters, theme, language
  • Custom branding: BitLlama icon set (dark circle + blue "B" + ternary dots)
  • Error classification: User-friendly error messages with actionable guidance

Engine Improvements

  • BitNet architecture foundation: ModelArch::BitNet, ActivationType::ReLuSquared (relu(x)²)
  • Model size guards: Warning/block for post-training conversion of models < 7B
  • Memory optimization: Pre-allocated tensors in learn command, --max-tokens option for memory-constrained devices (Issue #9)
  • GGUF variant support: Web module now handles UnifiedModel::Gguf in all match expressions

CI/Quality

  • CI fully green: cargo fmt + clippy (main + web + desktop) + cargo audit + 4-platform builds
  • /check command mirrors CI exactly (8-step verification)
  • Security: Updated bytes (RUSTSEC-2026-0007) and time (RUSTSEC-2026-0009)

Upgrade Notes

No breaking changes from v0.8.0. All existing CLI commands work as before.

New CLI option for learn command:

# Limit token count for memory-constrained devices (e.g. Termux/Android)
bitllama learn "text" --model model.gguf --max-tokens 128

Full Changelog

30 commits since v0.8.0 — see compare

v0.8.0 - Pure Rust CLI & Soul Learning

05 Feb 15:35

Choose a tag to compare

Bit-TTT-Engine v0.8.0

The Soul Edition — Pure Rust CLI with personality learning capabilities.

🎉 Highlights

Pure Rust CLI

bitllama run llama3              # Ollama-style inference
bitllama learn "My name is Onoko"  # Teach your AI
bitllama serve                   # OpenAI-compatible API
bitllama pull meta-llama/...     # Download from HuggingFace

Soul Learning

  • In-context learning: Teach facts with bitllama learn
  • Cross-session persistence: Knowledge survives restarts
  • Minimal overhead: Only 3.8% speed impact

Multi-turn Conversations

  • Full conversation history support
  • All chat templates (Llama-2/3, Gemma, Mistral, Qwen)

📊 Performance

Model Speed vs llama.cpp
Llama-2 7B Q4_K_M 45.4 tok/s 90%
Gemma-2 2B Q4_K_M 75.1 tok/s 74%

📦 Installation

# From source (recommended)
cargo install --path crates/bit_llama

# Python bindings
pip install cortex_rust

What's New

  • bitllama run — Interactive chat
  • bitllama learn — Soul learning
  • bitllama soul — Soul management
  • bitllama serve — OpenAI API server
  • bitllama pull — HuggingFace model download
  • bitllama list — List local models
  • True SSE streaming (mpsc channels)
  • Multi-turn conversation support

Full Changelog: v0.7.0...v0.8.0

v0.6.0 - Performance & Python/CUDA

30 Jan 19:56

Choose a tag to compare

Release Notes: v0.6.0

Release Date: 2026-01-31
Theme: Performance Optimization & Python/CUDA Support


🎯 Highlights

This release focuses on performance optimization and ecosystem expansion:

  • 🐍 Python Bridge: Use Bit-TTT from Python with pip install
  • 🎮 CUDA GPU: 22x faster inference on NVIDIA GPUs
  • 🦊 Gemma Support: Run Gemma and Gemma2 models
  • Flash Attention: Memory-efficient attention for long sequences

✨ New Features

🐍 Python Bridge (PyO3)

Install and use Bit-TTT directly from Python:

from bit_ttt_engine import BitLlama

# Load model
model = BitLlama.load("gemma-2-2b-it-Q4_K_M.gguf")

# Generate text
output = model.generate("Hello, how are you?", max_tokens=100)
print(output)

Installation:

cd crates/rust_engine
pip install maturin
maturin develop --release

🎮 CUDA GPU Acceleration

  • Automatic GPU detection
  • 22x faster matmul on RTX 4060 Ti
  • Hybrid CPU/GPU inference support

Build with CUDA:

cargo build --release --features cuda

🦊 Gemma/Gemma2 Architecture Support

  • Auto-detect Gemma, Gemma2 from GGUF metadata
  • GeGLU activation for Gemma models
  • Tied embeddings support
  • Verified with gemma-2-2b-it-Q4_K_M.gguf

⚡ Performance Optimization (Phase 5)

Feature Benefit
Flash Attention O(n) memory vs O(n²) for long sequences
Continuous Batching Multi-request server deployments
Speculative Decoding 2-3x speedup framework (draft model ready)

📊 Benchmarks

GPU vs CPU (RTX 4060 Ti)

Operation CPU GPU Speedup
MatMul (4096x4096) 45ms 2ms 22x
Inference (TinyLlama) 2.4 tok/s TBD -

Memory Usage

Model Format VRAM
TinyLlama 1.1B Q4_K_M 1.5 GB
Gemma2 2B Q4_K_M 2.5 GB

🧪 Testing

  • 14 new tests for Flash Attention, Scheduler, Speculative Decoding
  • All CI checks passing
  • E2E tested with Gemma2 GGUF

📦 Installation

From Source (Rust)

git clone https://github.com/imonoonoko/Bit-TTT-Engine.git
cd Bit-TTT-Engine
cargo build --release

From Source (Python)

cd crates/rust_engine
pip install maturin
maturin develop --release

Pre-built Binaries

Download from GitHub Releases.


⚠️ Known Issues

  1. CUDA + VS 2022 18.x: Requires -allow-unsupported-compiler flag
  2. Flash Attention GPU: CPU-only for now, GPU kernel coming in v0.7.0

🔮 What's Next (v0.7.0)

  • TTT effect validation benchmark
  • GPU Flash Attention kernel
  • LoRA fine-tuning support
  • model.adapt() API

🙏 Contributors

Thanks to everyone who contributed to this release!


Full Changelog: CHANGELOG.md
Documentation: README.md