18 Feb 19:13

imonoonoko

8e4e2cf

v1.0.0 — Final Release Latest

Latest

v1.0.0 — Final Release

BitLlama v1.0.0. Development complete.

What is BitLlama?

A Pure Rust LLM inference engine with Soul learning and hierarchical memory.

7 model architectures: Llama-2/3, Gemma-2/3, Qwen2.5, Mistral, BitNet
Soul learning: LoRA fine-tuning from conversations
Memory system: 4-layer hierarchical memory + 7-stage Sleep consolidation
Desktop GUI: Tauri 2.0 + Svelte 5, Japanese/English i18n
Performance: 45.4 tok/s (7B), 90% of llama.cpp
1121 tests, quality score 9.0/10

Changes since v0.16.0

CJK memory search fix (character bigram fallback for Japanese queries)
Soul learning tests (warmup, chat template, VRAM guard)
Chat template application fix for GGUF tokenizer fallback
README/ROADMAP updated to reflect project completion

Install

# Homebrew
brew tap imonoonoko/bitllama && brew install bitllama

# winget
winget install imonoonoko.BitLlama

# Or download binaries below

Built with Rust by @imonoonoko

Full Changelog: v0.16.0...v1.0.0

Assets 7

18 Feb 05:47

github-actions

v0.16.0

157a015

v0.16.0

Full Changelog: v0.15.0...v0.16.0

Assets 7

13 Feb 10:15

imonoonoko

v0.15.0

c05ae65

v0.15.0: Inference Guards + GUI Quality + Install Scripts

What's New

Inference Safety Guards

New inference_guard module — NaN/Inf detection, severity classification, NaN-safe greedy decoding
All inference paths protected (BitLlama, Llama4Bit, Desktop sampling, speculative)
Temperature validation (NaN/Inf/negative → greedy fallback)
Input length validation (empty input, context length overflow)
+16 robustness tests (softmax stability, RoPE boundaries, KV cache)

GUI Quality Improvements

Error recovery UX: retry buttons for model load failures and generation errors
Accessibility: WCAG AA contrast, aria-live, aria-current, focus management
First-run experience: wizard → model browser direct flow
ARIA: tablist roles (role="tablist" / role="tab" / aria-selected)

TTT Integration

TTT enable/disable for safetensors models (no longer GGUF-only)
TTTLayer use_ttt flag for runtime inner-loop control

One-Click Install

scripts/install.sh (Linux/macOS) + scripts/install.ps1 (Windows)
Homebrew tap (imonoonoko/homebrew-bitllama) + auto-update workflow
winget manifests submitted (CLI #338557, Desktop #338558)

Stats

480 lib tests + 42 desktop tests passing
All clippy checks clean (main, web, desktop)

Full Changelog: v0.14.0...v0.15.0

Assets 7

11 Feb 11:20

github-actions

v0.14.0

344f62d

v0.14.0

Full Changelog: v0.13.0...v0.14.0

Assets 7

11 Feb 11:35

github-actions

v0.13.0

8e2ba98

v0.13.0

Full Changelog: v0.12.0...v0.13.0

Assets 7

07 Feb 15:46

github-actions

v0.12.0

ae70f19

v0.12.0

Full Changelog: v0.11.0...v0.12.0

Assets 6

07 Feb 15:05

github-actions

v0.11.0

f5f6f3a

v0.11.0

Full Changelog: v0.10.0...v0.11.0

Assets 6

06 Feb 16:27

imonoonoko

v0.9.0

4121414

v0.9.0 — BitLlama Desktop: Phase 14 Complete

BitLlama Desktop is now a fully-featured local LLM application with hardware auto-detection, model management, and Japanese UI.

New Features

Desktop GUI (Phase 14: 11/11 tasks complete)

Hardware auto-detection: RAM/VRAM detection with model recommendations in sidebar
Welcome wizard: Language → HW detection → model recommendation → download → first chat (3 minutes)
Model browser: Local models tab + HuggingFace download tab with progress bar
Model download manager: Background download with speed display and progress events
Japanese/English i18n: Full UI translation — first local LLM tool with Japanese UI
Chat history persistence: Conversations saved to localStorage
Settings panel: GPU configuration, generation parameters, theme, language
Custom branding: BitLlama icon set (dark circle + blue "B" + ternary dots)
Error classification: User-friendly error messages with actionable guidance

Engine Improvements

BitNet architecture foundation: ModelArch::BitNet, ActivationType::ReLuSquared (relu(x)²)
Model size guards: Warning/block for post-training conversion of models < 7B
Memory optimization: Pre-allocated tensors in learn command, --max-tokens option for memory-constrained devices (Issue #9)
GGUF variant support: Web module now handles UnifiedModel::Gguf in all match expressions

CI/Quality

CI fully green: cargo fmt + clippy (main + web + desktop) + cargo audit + 4-platform builds
/check command mirrors CI exactly (8-step verification)
Security: Updated bytes (RUSTSEC-2026-0007) and time (RUSTSEC-2026-0009)

Upgrade Notes

No breaking changes from v0.8.0. All existing CLI commands work as before.

New CLI option for learn command:

# Limit token count for memory-constrained devices (e.g. Termux/Android)
bitllama learn "text" --model model.gguf --max-tokens 128

Full Changelog

30 commits since v0.8.0 — see compare

Assets 2

05 Feb 15:35

imonoonoko

v0.8.0

bb3d041

v0.8.0 - Pure Rust CLI & Soul Learning

Bit-TTT-Engine v0.8.0

The Soul Edition — Pure Rust CLI with personality learning capabilities.

🎉 Highlights

Pure Rust CLI

bitllama run llama3              # Ollama-style inference
bitllama learn "My name is Onoko"  # Teach your AI
bitllama serve                   # OpenAI-compatible API
bitllama pull meta-llama/...     # Download from HuggingFace

Soul Learning

In-context learning: Teach facts with bitllama learn
Cross-session persistence: Knowledge survives restarts
Minimal overhead: Only 3.8% speed impact

Multi-turn Conversations

Full conversation history support
All chat templates (Llama-2/3, Gemma, Mistral, Qwen)

📊 Performance

Model	Speed	vs llama.cpp
Llama-2 7B Q4_K_M	45.4 tok/s	90%
Gemma-2 2B Q4_K_M	75.1 tok/s	74%

📦 Installation

# From source (recommended)
cargo install --path crates/bit_llama

# Python bindings
pip install cortex_rust

What's New

bitllama run — Interactive chat
bitllama learn — Soul learning
bitllama soul — Soul management
bitllama serve — OpenAI API server
bitllama pull — HuggingFace model download
bitllama list — List local models
True SSE streaming (mpsc channels)
Multi-turn conversation support

Full Changelog: v0.7.0...v0.8.0

Assets 2

30 Jan 19:56

imonoonoko

v0.6.0

aa1e086

v0.6.0 - Performance & Python/CUDA

Release Notes: v0.6.0

Release Date: 2026-01-31
Theme: Performance Optimization & Python/CUDA Support

🎯 Highlights

This release focuses on performance optimization and ecosystem expansion:

🐍 Python Bridge: Use Bit-TTT from Python with pip install
🎮 CUDA GPU: 22x faster inference on NVIDIA GPUs
🦊 Gemma Support: Run Gemma and Gemma2 models
⚡ Flash Attention: Memory-efficient attention for long sequences

✨ New Features

🐍 Python Bridge (PyO3)

Install and use Bit-TTT directly from Python:

from bit_ttt_engine import BitLlama

# Load model
model = BitLlama.load("gemma-2-2b-it-Q4_K_M.gguf")

# Generate text
output = model.generate("Hello, how are you?", max_tokens=100)
print(output)

Installation:

cd crates/rust_engine
pip install maturin
maturin develop --release

🎮 CUDA GPU Acceleration

Automatic GPU detection
22x faster matmul on RTX 4060 Ti
Hybrid CPU/GPU inference support

Build with CUDA:

cargo build --release --features cuda

🦊 Gemma/Gemma2 Architecture Support

Auto-detect Gemma, Gemma2 from GGUF metadata
GeGLU activation for Gemma models
Tied embeddings support
Verified with gemma-2-2b-it-Q4_K_M.gguf

⚡ Performance Optimization (Phase 5)

Feature	Benefit
Flash Attention	O(n) memory vs O(n²) for long sequences
Continuous Batching	Multi-request server deployments
Speculative Decoding	2-3x speedup framework (draft model ready)

📊 Benchmarks

GPU vs CPU (RTX 4060 Ti)

Operation	CPU	GPU	Speedup
MatMul (4096x4096)	45ms	2ms	22x
Inference (TinyLlama)	2.4 tok/s	TBD	-

Memory Usage

Model	Format	VRAM
TinyLlama 1.1B	Q4_K_M	1.5 GB
Gemma2 2B	Q4_K_M	2.5 GB

🧪 Testing

14 new tests for Flash Attention, Scheduler, Speculative Decoding
All CI checks passing
E2E tested with Gemma2 GGUF

📦 Installation

From Source (Rust)

git clone https://github.com/imonoonoko/Bit-TTT-Engine.git
cd Bit-TTT-Engine
cargo build --release

From Source (Python)

cd crates/rust_engine
pip install maturin
maturin develop --release

Pre-built Binaries

Download from GitHub Releases.

⚠️ Known Issues

CUDA + VS 2022 18.x: Requires -allow-unsupported-compiler flag
Flash Attention GPU: CPU-only for now, GPU kernel coming in v0.7.0

🔮 What's Next (v0.7.0)

TTT effect validation benchmark
GPU Flash Attention kernel
LoRA fine-tuning support
model.adapt() API

🙏 Contributors

Thanks to everyone who contributed to this release!

Full Changelog: CHANGELOG.md
Documentation: README.md

Assets 2

Releases: imonoonoko/Bit-TTT-Engine

v1.0.0 — Final Release

v1.0.0 — Final Release

What is BitLlama?

Changes since v0.16.0

Install

Uh oh!

v0.16.0

Uh oh!

v0.15.0: Inference Guards + GUI Quality + Install Scripts

What's New

Inference Safety Guards

GUI Quality Improvements

TTT Integration

One-Click Install

Stats

Uh oh!

v0.14.0

Uh oh!

v0.13.0

Uh oh!

v0.12.0

Uh oh!

v0.11.0

Uh oh!

v0.9.0 — BitLlama Desktop: Phase 14 Complete

v0.9.0 — BitLlama Desktop: Phase 14 Complete

New Features

Desktop GUI (Phase 14: 11/11 tasks complete)

Engine Improvements

CI/Quality

Upgrade Notes

Full Changelog

Uh oh!

v0.8.0 - Pure Rust CLI & Soul Learning

Bit-TTT-Engine v0.8.0

🎉 Highlights

Pure Rust CLI

Soul Learning

Multi-turn Conversations

📊 Performance

📦 Installation

What's New

Uh oh!

v0.6.0 - Performance & Python/CUDA

Release Notes: v0.6.0

🎯 Highlights

✨ New Features

🐍 Python Bridge (PyO3)

🎮 CUDA GPU Acceleration

🦊 Gemma/Gemma2 Architecture Support

⚡ Performance Optimization (Phase 5)

📊 Benchmarks

GPU vs CPU (RTX 4060 Ti)

Memory Usage

🧪 Testing

📦 Installation

From Source (Rust)

From Source (Python)

Pre-built Binaries

⚠️ Known Issues

🔮 What's Next (v0.7.0)

🙏 Contributors

Uh oh!