Skip to content

Filter on parameters.site#388

Merged
eyeseast merged 2 commits into
masterfrom
387-filter-site
May 5, 2026
Merged

Filter on parameters.site#388
eyeseast merged 2 commits into
masterfrom
387-filter-site

Conversation

@eyeseast
Copy link
Copy Markdown
Contributor

  • Filter on parameters.site, with noop if key is missing
  • Add a partial index

Closes #387

This will mostly apply to Klaxon and Scraper, but other add-ons that use site can use it, too. The index is partial, so it only covers addons with site and ignores others.

Comment thread documentcloud/addons/views.py
@duckduckgrayduck
Copy link
Copy Markdown
Contributor

I don't have Add-Ons/events/runs, certainly not enough to test the filter, so pushing this to staging first is the right move for testing before opening another PR to merge into prod

@eyeseast eyeseast mentioned this pull request Apr 29, 2026
@eyeseast
Copy link
Copy Markdown
Contributor Author

Indexing strategy review: B-tree vs GIN

TL;DR: the B-tree partial expression index is the right call here. GIN would not help this query and would cost more.

Why B-tree fits

I verified the SQL Django generates for parameters__site="x":

WHERE (parameters -> 'site') = '"..."'::jsonb

The index is on exactly that expression (parameters -> 'site') with WHERE parameters ? 'site' — a direct match, so the planner will use it for equality and IN (...). Single known JSON key, high-cardinality string values (URLs): textbook B-tree case.

Why GIN does not help this query

A GIN index on the whole parameters field (default jsonb_ops) accelerates @>, ?, ?|, ?&, @?, @@. It does not accelerate parameters -> 'site' = X. To benefit from GIN you'd have to rewrite the filter as Q(parameters__contains={"site": value}) (compiles to parameters @> '{...}'::jsonb), and even then GIN is slower than B-tree for plain equality on a single known key.

When to revisit

Switch to GIN if any of these become true:

  1. Multiple JSON keys need filtering (selector, categories, etc.) — one GIN covers all keys; otherwise it's one B-tree per key.
  2. We move to containment queries.
  3. We need key-existence queries (parameters ? 'foo') at query time.
  4. Substring/prefix search on values (would also need pg_trgm).

Tradeoff summary

B-tree partial (current) GIN (jsonb_ops) GIN (jsonb_path_ops)
Size Smallest (partial, single key) Largest (often 2–10× B-tree) Smaller than jsonb_ops
Write cost Low High Medium
Equality on one key ✅ Fast ❌ Not directly usable ❌ Not directly usable
@> containment
Key existence ?
ORDER BY / range
Multi-key flexibility ❌ One per key ✅ (containment only)

Minor notes (not index-related)

  • The AddOnRun.site filter joins through event__parameters__site. The index on AddOnEvent still helps on the inner side of the join, but worth a quick EXPLAIN ANALYZE against production-sized data to confirm the planner picks an index nested-loop rather than hashing over a seq scan.
  • AddIndexConcurrently with atomic = False is the right pattern for production safety. 👍

@eyeseast
Copy link
Copy Markdown
Contributor Author

Here's what this query looks like on staging: https://api.staging.documentcloud.org/api/addon_runs/?addon=48&expand=~all&site=https://www.muckrock.com

Note that if you take away the addon parameter, you get a bunch of runs for add-ons that don't have the site key at all, and so they match by default. That's fine in this case. We can tighten this later if we want. The intended use is for filtering add-on runs for Klaxon.

@duckduckgrayduck
Copy link
Copy Markdown
Contributor

I will just note that we may want the Klaxon front end to normalize url, like muckrock.com doesn't return any results while www.muckrock.com would. Some domains don't have a www, so this may be nuanced. This is why icontains may be necessary in the API itself.

@mitchelljkotler
Copy link
Copy Markdown
Member

I will just note that we may want the Klaxon front end to normalize url, like muckrock.com doesn't return any results while www.muckrock.com would. Some domains don't have a www, so this may be nuanced. This is why icontains may be necessary in the API itself.

muckrock.com and www.muckrock.com are technically different URLs. I don't think it should try to normalize them. Substring matching may still make sense though if that is the behavior we want.

@eyeseast
Copy link
Copy Markdown
Contributor Author

eyeseast commented May 5, 2026

I think the place we're going to have trouble is URLs with query params, especially for stuff like analytics. If I mean to watch https://www.muckrock.com/news/ and I happen to be on https://www.muckrock.com/news/?ref=bsky.com, I need to have that normalized on the frontend so I capture the most correct and consistent URL. I don't think we want to second-guess that on the API side, but we do need the UI to support it on the frontend.

@eyeseast eyeseast merged commit e24e11f into master May 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filter add-on runs and events based on paramenters.site

3 participants