Perform intelligent research over document collections using hybrid search and LLMs.
Contents
- Hybrid search: Combines lexical and semantic search.
- Retrieve with chunks, generate with documents: Finds relevant documents via chunks but lets you chat with whole documents.
- User control: Choose which documents to include in the LLM context.
- Multiple LLM support: Easily switch between hundreds of commercial and open-source models via OpenRouter.
- Local usage: Uses local embedding models and can be adapted for local LLMs.
To install the project and its dependencies:
git clone https://github.com/machinelearningZH/document-research-tool.git
cd document-research-tool
pip3 install uv
uv sync
-
Fill in your configuration values:
- Copy
.env_exampleto.envand set yourOPEN_ROUTER_API_KEY(get one from OpenRouter). - Edit
config.yamlto choose your preferred models, embedding settings, and search parameters. - Set
embedding.platforminconfig.yamltompsfor Apple Silicon,cudafor Nvidia GPUs, orcpufor CPU-only systems. If you getRuntimeError: PyTorch is not linked with support for mps/cuda deviceschange tocpu.
- Copy
-
Start the app:
uv run shiny run research_app.py
The app will be available at http://127.0.0.1:8000/.
Tip
To disable logging of user interactions, comment out the log_interaction function call in research_app.py.
- The app works out of the box with provided sample data (several hundred «Kantonsratsprotokolle» available as Open Government Data).
- To use your own documents, run
01_index_data.ipynbto preprocess your data and create a Weaviate search index. - By default, Weaviate index data is stored in
.local/share/weaviate/. - For remote deployment, copy the index data to the same path or adjust
paths.weaviate_index_dirinconfig.yaml.
Many cantonal customers have large, mixed-source document collections and need smarter research tools. This template enables rapid prototyping and adaptation to different corpora.
- LLMs make mistakes, so we focus on assistive systems that help domain experts with search and answer drafting, but not do their work. Therefore we want to give users maximum control and transparency.
- Most RAG templates combine retrieval and generation, limiting user control. Our approach separates search and answer generation, letting users select which sources to submit to the LLM, resulting in higher-quality answers.
- We never ingest document collections blindly. Most project time is spent understanding user needs and preparing data.
- Intelligent search (lexical/semantic) alone often delivers significant value.
- We search via chunks but provide full documents as context. We find that this improves answers particularly in legal workflows where often whole documents like decisions need to be taken into account for good results.
- Most administrative data is confidential, so our pilots (and this template too) are designed for easy local, self-hosted deployment.
- OpenRouter allows quick comparison of many hundreds commercial and open-source LLMs, helping users assess answer quality for local deployments too.
Chantal Amrhein, Patrick Arnecke – Statistisches Amt Zürich: Team Data
We welcome feedback and contributions! Email us or open an issue or pull request.
We use ruff for linting and formatting.
This project is licensed under the MIT License. See the LICENSE file for details.
This software (the Software) incorporates models (Models) from spacy.io and others and has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.
