A local-first arXiv feed with AI-written plain-language summaries — running entirely on your own hardware.
Curate research interests, and Research Digest keeps a local corpus of matching papers, each with a plain-English summary, a layman explanation, topical tags, and semantically-related papers. Desktop grid for deep reading, mobile feed for quick scrolling, full-text filtering on both. No API keys, no cloud. Summaries and embeddings are generated by turbolab, a self-hosted OpenAI-compatible model server.
- 🧠 Real AI summaries — summary, layman explanation, difficulty, and tags from your own LLM via turbolab. Nothing leaves your network.
- 🔎 Search & relate — client-side keyword filter on every page; "related papers" precomputed from e5 embeddings.
- 🗃 Corpus-first — SQLite is the source of truth. The site renders from the DB, so a failed arXiv fetch never wipes good output.
- 🛡 Resilient pipeline — 429 backoff, upsert-never-delete, and a render step that refuses to publish an empty digest.
- 📱 Desktop + mobile — multi-column grid and a full-screen swipeable feed.
- ⚙️ Configurable — JSON interests, keyword scoring, look-back window.
SQLite holds the corpus. The pipeline is a chain of independent, idempotent stages — and only the first one touches the network:
| Stage | Network? | Does |
|---|---|---|
fetch.py |
arXiv | Upsert new papers (original abstract). 429 backoff; never deletes. |
summarize.py |
turbolab | Fill missing summaries/layman/difficulty/tags via /v1/chat/completions. |
embed.py |
turbolab | Fill missing vectors via /v1/embeddings (e5, with passage: prefix). |
relate.py |
— | Precompute nearest-neighbour papers (cosine) for "related". |
render.py |
— | Build the static site from the DB. Atomic writes; refuses to publish empty. |
run.sh chains them; a fetch failure is logged and the rest still run on the existing corpus.
Because everything after fetch is offline, a multi-day arXiv rate limit just means "no new
papers" — the site stays fully live.
You need a reachable turbolab server (chat model for summaries, e5 model for embeddings).
git clone https://github.com/usr-wwelsh/research-digest.git
cd research-digest
python3 -m venv venv && ./venv/bin/pip install -r requirements.txt
# point at your turbolab server (kept out of git)
echo 'export TURBOLAB_URL=http://YOUR_HOST:7860' > .env
./run.sh # fetch -> summarize -> embed -> relate -> render
# or, no network:
./run.sh --offline # rebuild the site from the existing corpusOpen index.html (landing), latest.html (digest), archive.html, or feed.html (mobile).
Migrating from v1? Recover your old backlog from the archived HTML with zero arXiv calls:
./venv/bin/python migrate_from_html.py # parses arxiv_archive/*.html into digest.db
./run.sh --offline # summarize + embed + render the backlog(The original abstracts aren't in old HTML, so salvaged papers are flagged
needs_abstract_backfill; ./venv/bin/python fetch.py --backfill refetches them in batches.)
{
"interests": {
"Efficient ML / Edge AI": {
"query": "cat:cs.LG OR cat:cs.CV OR cat:cs.CL",
"keywords": ["efficient", "edge", "quantization", "distillation"]
}
},
"settings": {
"papers_per_interest": 25,
"recent_days": 7,
"fetch_multiplier": 3
},
"turbolab": {
"url": "http://localhost:7860",
"passage_prefix": "passage: ",
"query_prefix": "query: "
}
}The turbolab URL is normally set per-deployment via the gitignored .env (TURBOLAB_URL),
which overrides config.json — so a private LAN address never lands in the repo.
| Setting | Default | Description |
|---|---|---|
papers_per_interest |
25 | Papers kept per interest per fetch |
recent_days |
7 | Look-back window (0 = all time) |
fetch_multiplier |
3 | Over-fetch, then keyword-rank, then trim |
arXiv query syntax: combine category codes with OR/AND, e.g. cat:cs.LG OR cat:cs.AI
(full taxonomy).
From your Proxmox host:
bash <(curl -sL https://raw.githubusercontent.com/usr-wwelsh/Research-Digest/main/create-lxc.sh)This creates a Debian LXC, installs Caddy + cloudflared, sets up the venv, configures the
weekly cron (Monday 8am), and serves on :8080. After it finishes, set TURBOLAB_URL in
/opt/research-digest/.env, edit config.json, and run sudo -u www-data /opt/research-digest/run.sh.
Idle footprint is ~50–80MB (Caddy + cloudflared) — the heavy ML lives in turbolab on another host, so the digest container stays tiny.
research-digest/
├── config.json # interests, settings, turbolab block
├── db.py # SQLite schema + data access (source of truth)
├── fetch.py # arXiv ingest (backoff, upsert, --backfill)
├── summarize.py # turbolab summaries
├── embed.py # turbolab e5 embeddings
├── relate.py # nearest-neighbour "related papers"
├── render.py # static site builder (atomic, refuses empty)
├── migrate_from_html.py # one-time v1 backlog salvage
├── turbolab.py # turbolab client (chat + embeddings)
├── templates/ # Jinja2 templates (autoescaped)
├── run.sh # pipeline runner (cron entrypoint)
├── setup.sh # LXC bootstrap
├── create-lxc.sh # Proxmox LXC creator
├── Caddyfile # static file server
└── digest.db # the corpus (gitignored)
- Python 3.9+ · deps:
requests,jinja2,numpy(no torch — the ML is in turbolab) - A reachable turbolab server
- Internet only for the
fetchstage
MIT — see LICENSE.
Built for researchers who want to stay current — on their own hardware, with no gatekeepers.