Skip to content

wh1le/better-web

Repository files navigation

Better Web

Currently in alpha. Code quality is not there because of ai. Vibe coded MVP proved to be useful. I am working on full refactoring, MCP, and a better CLI in the dev branch.

Terminal-first web research tool. Search the web, scrape pages, score content quality, filter out junk — get clean markdown ready for LLM consumption.

demo

Why

Search engines increasingly return SEO spam and low-quality content. LLM-powered search tools often hallucinate or give shallow answers. A simple research question shouldn't mean 25+ open tabs just to find a few good sources.

better-web automates the entire research workflow: query a private search engine, scrape results, score quality using multiple signals (domain reputation, AI detection, readability, semantic relevance), filter out the noise, and return focused, clean markdown — all in one command.

No GPU required — runs on simple hardware. The only ML model used is a small sentence-transformer (~80MB) for relevance scoring.

Prerequisites

  • Python 3.12+
  • A local SearXNG instance for search queries

Quick SearXNG setup with Docker:

docker run -d --name searxng -p 8882:8080 searxng/searxng

Setup

With Nix (recommended):

nix develop && poetry install

With pipx (isolated install):

pipx install git+https://github.com/wh1le/better-web.git
playwright install chromium

Without Nix:

pip install poetry
poetry install
playwright install chromium

Configure your SearXNG URL in config.yaml under searx_engine (default http://localhost:8882/search).

Usage

bw search "query"                     # search + scrape + score + copy
bw search "q1" "q2" --limit 20       # multi-query batch
bw search --quick "query"             # snippets only, no scraping
bw scrape "https://example.com"       # single URL to stdout
bw digest --raw                       # re-export latest research
bw preview                            # render page as clean markdown
bw update-blocklist                   # refresh domain blocklists
bin/explore                           # fzf picker -> preview in editor
bin/agent                             # fzf picker -> copy/claude

Scoring

Every page gets 0-100 based on:

Signal Tool What
Domain reputation tranco Top-1M ranking, boost only (unranked = neutral)
Domain heuristics tldextract Junk TLDs, hyphen stuffing, SEO keywords, year in name
AI detection zippy Compression-based, no ML models, no API keys
Readability textstat Flesch Reading Ease, grade level
Relevance sentence-transformers Cosine similarity between query and content
HTML structure built-in Code blocks, comments, link density, nav ratio, ad scripts
Text heuristics built-in Keyword stuffing, repetitive bigrams, slop phrases, thin content
Content dedup datasketch MinHash LSH, removes near-duplicate pages

Pages below min_quality_score (default 30) are filtered out. Remaining pages are sorted best-first and tier-labeled (HIGH/MED/LOW).

Config

config.yaml — SearXNG URL, scrape timing, quality thresholds, blocklist sources. Static lists (TLDs, blocked domains, AI phrases) live in data/*.txt.

TODO

  • Support XDG configuration path at ~/.config/bw

License

MIT

About

Searches the web, scrapes, and turns noisy results into cleanmarkdown ready for LLMs. (AI detection, MinHash, dedup, semantic relevance scoring)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors