lilbee: a single-executable local search engine you can talk to, running every model on llama.cpp #24625
tobocop2
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for llama.cpp. It's the inference engine lilbee runs every local model on. lilbee is a local-first search engine you can talk to: it runs and manages its own models, indexes your files and code, crawls the websites you point it at into a searchable library, and answers with a citation to the source. The model manager is built on llama.cpp, so one program browses Hugging Face, pulls models, and runs them on Metal, Vulkan, or CUDA. Without llama.cpp there is no lilbee.
lilbee ships as a single executable that bundles its own llama.cpp build. Because Hugging Face has far more GGUF architectures than the pinned runtime supports at any moment, lilbee reads the architecture before downloading and tags incompatible models in the catalog, so you don't wait through a multi-GB pull only to hit "unsupported architecture" at load time. I'm now moving to a llama-server and llama-swap setup so the same one binary scales from a laptop to a multi-GPU machine. I recently verified MiniMax M2 split across three graphics cards, all managed by lilbee.
Site: https://lilbee.sh. Repo: https://github.com/tobocop2/lilbee. Multi-GPU serving is here: tobocop2/lilbee#267
Beta Was this translation helpful? Give feedback.
All reactions