📚 Research Digest

A local-first arXiv feed with AI-written plain-language summaries — running entirely on your own hardware.

Curate research interests, and Research Digest keeps a local corpus of matching papers, each with a plain-English summary, a layman explanation, topical tags, and semantically-related papers. Desktop grid for deep reading, mobile feed for quick scrolling, full-text filtering on both. No API keys, no cloud. Summaries and embeddings are generated by turbolab, a self-hosted OpenAI-compatible model server.

✨ Features

🧠 Real AI summaries — summary, layman explanation, difficulty, and tags from your own LLM via turbolab. Nothing leaves your network.
🔎 Search & relate — client-side keyword filter on every page; "related papers" precomputed from e5 embeddings.
🗃 Corpus-first — SQLite is the source of truth. The site renders from the DB, so a failed arXiv fetch never wipes good output.
🛡 Resilient pipeline — 429 backoff, upsert-never-delete, and a render step that refuses to publish an empty digest.
📱 Desktop + mobile — multi-column grid and a full-screen swipeable feed.
⚙️ Configurable — JSON interests, keyword scoring, look-back window.

🧩 How it works

SQLite holds the corpus. The pipeline is a chain of independent, idempotent stages — and only the first one touches the network:

Stage	Network?	Does
`fetch.py`	arXiv	Upsert new papers (original abstract). 429 backoff; never deletes.
`summarize.py`	turbolab	Fill missing summaries/layman/difficulty/tags via `/v1/chat/completions`.
`embed.py`	turbolab	Fill missing vectors via `/v1/embeddings` (e5, with `passage:` prefix).
`relate.py`	—	Precompute nearest-neighbour papers (cosine) for "related".
`render.py`	—	Build the static site from the DB. Atomic writes; refuses to publish empty.

run.sh chains them; a fetch failure is logged and the rest still run on the existing corpus. Because everything after fetch is offline, a multi-day arXiv rate limit just means "no new papers" — the site stays fully live.

🚀 Quick Start

You need a reachable turbolab server (chat model for summaries, e5 model for embeddings).

git clone https://github.com/usr-wwelsh/research-digest.git
cd research-digest
python3 -m venv venv && ./venv/bin/pip install -r requirements.txt

# point at your turbolab server (kept out of git)
echo 'export TURBOLAB_URL=http://YOUR_HOST:7860' > .env

./run.sh            # fetch -> summarize -> embed -> relate -> render
# or, no network:
./run.sh --offline  # rebuild the site from the existing corpus

Open index.html (landing), latest.html (digest), archive.html, or feed.html (mobile).

Migrating from v1? Recover your old backlog from the archived HTML with zero arXiv calls:

./venv/bin/python migrate_from_html.py   # parses arxiv_archive/*.html into digest.db
./run.sh --offline                       # summarize + embed + render the backlog

(The original abstracts aren't in old HTML, so salvaged papers are flagged needs_abstract_backfill; ./venv/bin/python fetch.py --backfill refetches them in batches.)

⚙️ Configuration

{
  "interests": {
    "Efficient ML / Edge AI": {
      "query": "cat:cs.LG OR cat:cs.CV OR cat:cs.CL",
      "keywords": ["efficient", "edge", "quantization", "distillation"]
    }
  },
  "settings": {
    "papers_per_interest": 25,
    "recent_days": 7,
    "fetch_multiplier": 3
  },
  "turbolab": {
    "url": "http://localhost:7860",
    "passage_prefix": "passage: ",
    "query_prefix": "query: "
  }
}

The turbolab URL is normally set per-deployment via the gitignored .env (TURBOLAB_URL), which overrides config.json — so a private LAN address never lands in the repo.

Setting	Default	Description
`papers_per_interest`	25	Papers kept per interest per fetch
`recent_days`	7	Look-back window (0 = all time)
`fetch_multiplier`	3	Over-fetch, then keyword-rank, then trim

arXiv query syntax: combine category codes with OR/AND, e.g. cat:cs.LG OR cat:cs.AI (full taxonomy).

🔧 Self-hosted deployment (Proxmox LXC)

From your Proxmox host:

bash <(curl -sL https://raw.githubusercontent.com/usr-wwelsh/Research-Digest/main/create-lxc.sh)

This creates a Debian LXC, installs Caddy + cloudflared, sets up the venv, configures the weekly cron (Monday 8am), and serves on :8080. After it finishes, set TURBOLAB_URL in /opt/research-digest/.env, edit config.json, and run sudo -u www-data /opt/research-digest/run.sh.

Idle footprint is ~50–80MB (Caddy + cloudflared) — the heavy ML lives in turbolab on another host, so the digest container stays tiny.

📂 Project structure

research-digest/
├── config.json          # interests, settings, turbolab block
├── db.py                # SQLite schema + data access (source of truth)
├── fetch.py             # arXiv ingest (backoff, upsert, --backfill)
├── summarize.py         # turbolab summaries
├── embed.py             # turbolab e5 embeddings
├── relate.py            # nearest-neighbour "related papers"
├── render.py            # static site builder (atomic, refuses empty)
├── migrate_from_html.py # one-time v1 backlog salvage
├── turbolab.py          # turbolab client (chat + embeddings)
├── templates/           # Jinja2 templates (autoescaped)
├── run.sh               # pipeline runner (cron entrypoint)
├── setup.sh             # LXC bootstrap
├── create-lxc.sh        # Proxmox LXC creator
├── Caddyfile            # static file server
└── digest.db            # the corpus (gitignored)

🛠️ Requirements

Python 3.9+ · deps: requests, jinja2, numpy (no torch — the ML is in turbolab)
A reachable turbolab server
Internet only for the fetch stage

📝 License

MIT — see LICENSE.

🙏 Acknowledgments

arXiv for the open research repository
turbolab for self-hosted, OpenAI-compatible inference

Built for researchers who want to stay current — on their own hardware, with no gatekeepers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Research Digest

✨ Features

🧩 How it works

🚀 Quick Start

⚙️ Configuration

🔧 Self-hosted deployment (Proxmox LXC)

📂 Project structure

🛠️ Requirements

📝 License

🙏 Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
templates		templates
.gitignore		.gitignore
Caddyfile		Caddyfile
LICENSE		LICENSE
PLEASE_READ.md		PLEASE_READ.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
config.json		config.json
create-lxc.sh		create-lxc.sh
db.py		db.py
desktop_demo.png		desktop_demo.png
embed.py		embed.py
fetch.py		fetch.py
migrate_from_html.py		migrate_from_html.py
mobile_demo.png		mobile_demo.png
relate.py		relate.py
render.py		render.py
requirements.txt		requirements.txt
research-digest-caddy.service		research-digest-caddy.service
run.sh		run.sh
setup.sh		setup.sh
summarize.py		summarize.py
tiktok_feed.html		tiktok_feed.html
turbolab.py		turbolab.py

Folders and files

Latest commit

History

Repository files navigation

📚 Research Digest

✨ Features

🧩 How it works

🚀 Quick Start

⚙️ Configuration

🔧 Self-hosted deployment (Proxmox LXC)

📂 Project structure

🛠️ Requirements

📝 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages