Skip to content

usr-wwelsh/Research-Digest

Repository files navigation

Python License arXiv Summaries Platform

research-digest status research-digest uptime research-digest cpu research-digest ram research-digest cpu trend research-digest ram trend

📚 Research Digest

A local-first arXiv feed with AI-written plain-language summaries — running entirely on your own hardware.

Curate research interests, and Research Digest keeps a local corpus of matching papers, each with a plain-English summary, a layman explanation, topical tags, and semantically-related papers. Desktop grid for deep reading, mobile feed for quick scrolling, full-text filtering on both. No API keys, no cloud. Summaries and embeddings are generated by turbolab, a self-hosted OpenAI-compatible model server.


✨ Features

  • 🧠 Real AI summaries — summary, layman explanation, difficulty, and tags from your own LLM via turbolab. Nothing leaves your network.
  • 🔎 Search & relate — client-side keyword filter on every page; "related papers" precomputed from e5 embeddings.
  • 🗃 Corpus-first — SQLite is the source of truth. The site renders from the DB, so a failed arXiv fetch never wipes good output.
  • 🛡 Resilient pipeline — 429 backoff, upsert-never-delete, and a render step that refuses to publish an empty digest.
  • 📱 Desktop + mobile — multi-column grid and a full-screen swipeable feed.
  • ⚙️ Configurable — JSON interests, keyword scoring, look-back window.

🧩 How it works

SQLite holds the corpus. The pipeline is a chain of independent, idempotent stages — and only the first one touches the network:

Stage Network? Does
fetch.py arXiv Upsert new papers (original abstract). 429 backoff; never deletes.
summarize.py turbolab Fill missing summaries/layman/difficulty/tags via /v1/chat/completions.
embed.py turbolab Fill missing vectors via /v1/embeddings (e5, with passage: prefix).
relate.py Precompute nearest-neighbour papers (cosine) for "related".
render.py Build the static site from the DB. Atomic writes; refuses to publish empty.

run.sh chains them; a fetch failure is logged and the rest still run on the existing corpus. Because everything after fetch is offline, a multi-day arXiv rate limit just means "no new papers" — the site stays fully live.


🚀 Quick Start

You need a reachable turbolab server (chat model for summaries, e5 model for embeddings).

git clone https://github.com/usr-wwelsh/research-digest.git
cd research-digest
python3 -m venv venv && ./venv/bin/pip install -r requirements.txt

# point at your turbolab server (kept out of git)
echo 'export TURBOLAB_URL=http://YOUR_HOST:7860' > .env

./run.sh            # fetch -> summarize -> embed -> relate -> render
# or, no network:
./run.sh --offline  # rebuild the site from the existing corpus

Open index.html (landing), latest.html (digest), archive.html, or feed.html (mobile).

Migrating from v1? Recover your old backlog from the archived HTML with zero arXiv calls:

./venv/bin/python migrate_from_html.py   # parses arxiv_archive/*.html into digest.db
./run.sh --offline                       # summarize + embed + render the backlog

(The original abstracts aren't in old HTML, so salvaged papers are flagged needs_abstract_backfill; ./venv/bin/python fetch.py --backfill refetches them in batches.)


⚙️ Configuration

{
  "interests": {
    "Efficient ML / Edge AI": {
      "query": "cat:cs.LG OR cat:cs.CV OR cat:cs.CL",
      "keywords": ["efficient", "edge", "quantization", "distillation"]
    }
  },
  "settings": {
    "papers_per_interest": 25,
    "recent_days": 7,
    "fetch_multiplier": 3
  },
  "turbolab": {
    "url": "http://localhost:7860",
    "passage_prefix": "passage: ",
    "query_prefix": "query: "
  }
}

The turbolab URL is normally set per-deployment via the gitignored .env (TURBOLAB_URL), which overrides config.json — so a private LAN address never lands in the repo.

Setting Default Description
papers_per_interest 25 Papers kept per interest per fetch
recent_days 7 Look-back window (0 = all time)
fetch_multiplier 3 Over-fetch, then keyword-rank, then trim

arXiv query syntax: combine category codes with OR/AND, e.g. cat:cs.LG OR cat:cs.AI (full taxonomy).


🔧 Self-hosted deployment (Proxmox LXC)

From your Proxmox host:

bash <(curl -sL https://raw.githubusercontent.com/usr-wwelsh/Research-Digest/main/create-lxc.sh)

This creates a Debian LXC, installs Caddy + cloudflared, sets up the venv, configures the weekly cron (Monday 8am), and serves on :8080. After it finishes, set TURBOLAB_URL in /opt/research-digest/.env, edit config.json, and run sudo -u www-data /opt/research-digest/run.sh.

Idle footprint is ~50–80MB (Caddy + cloudflared) — the heavy ML lives in turbolab on another host, so the digest container stays tiny.


📂 Project structure

research-digest/
├── config.json          # interests, settings, turbolab block
├── db.py                # SQLite schema + data access (source of truth)
├── fetch.py             # arXiv ingest (backoff, upsert, --backfill)
├── summarize.py         # turbolab summaries
├── embed.py             # turbolab e5 embeddings
├── relate.py            # nearest-neighbour "related papers"
├── render.py            # static site builder (atomic, refuses empty)
├── migrate_from_html.py # one-time v1 backlog salvage
├── turbolab.py          # turbolab client (chat + embeddings)
├── templates/           # Jinja2 templates (autoescaped)
├── run.sh               # pipeline runner (cron entrypoint)
├── setup.sh             # LXC bootstrap
├── create-lxc.sh        # Proxmox LXC creator
├── Caddyfile            # static file server
└── digest.db            # the corpus (gitignored)

🛠️ Requirements

  • Python 3.9+ · deps: requests, jinja2, numpy (no torch — the ML is in turbolab)
  • A reachable turbolab server
  • Internet only for the fetch stage

📝 License

MIT — see LICENSE.


🙏 Acknowledgments

  • arXiv for the open research repository
  • turbolab for self-hosted, OpenAI-compatible inference

Built for researchers who want to stay current — on their own hardware, with no gatekeepers.

About

Auto-curated arXiv paper digest with AI summaries and mobile-optimized feed. No API keys, no tracking, no cloud

Topics

Resources

License

Stars

Watchers

Forks

Contributors