Skip to content

feat: Playwright fallback extractor for JS-heavy URL ingestion#1392

Merged
SorraTheOrc merged 4 commits intomainfrom
copilot/ob-0mnht5h0070el7-playwright-fallback-retrieval
Apr 5, 2026
Merged

feat: Playwright fallback extractor for JS-heavy URL ingestion#1392
SorraTheOrc merged 4 commits intomainfrom
copilot/ob-0mnht5h0070el7-playwright-fallback-retrieval

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 4, 2026

JS-rendered pages return empty content from the fast-path HTTP extractor, causing ob add <url> to ingest nothing useful. This adds an opt-in Playwright fallback that fires only when primary extraction falls below a configurable content-length threshold.

New modules

  • src/lib/ingestion/extractor.tsExtractor interface + ExtractResult type shared by all extractors
  • src/lib/ingestion/extractor-fetch.ts — Fast-path primary extractor (fetch + HTML-to-text strip)
  • src/lib/ingestion/extractor-playwright.tsPlaywrightExtractor: headless Chromium/Firefox/WebKit, fresh context per run (no credential leakage), configurable timeout, graceful degradation if playwright is absent or launch fails
  • src/lib/ingestion/service.tsIngestionService orchestrating primary → fallback with structured telemetry on every invocation
  • src/cli/commands/add.tsob add <url> CLI entry point

Opt-in / threshold behaviour

// Config-driven
const svc = new IngestionService(primary, playwright, { playwrightFallback: true, minContentLength: 200 });

// Or via environment variable (no config change needed)
// OB_PLAYWRIGHT_FALLBACK=1 ob add https://js-heavy-site.example.com

Fallback only runs when primaryResult.text.length < minContentLength (default 200). Without the flag, no browser process is ever spawned.

Dependency

playwright declared as an optional peer dependency — existing installs are unaffected. The import is guarded by a dynamic import() at runtime; a missing package logs a warning and falls through to the primary result.

Testing

47 unit tests using constructor-injected stub browser objects — no real browser, no network. Playwright's loadPlaywrightFn is injectable via the constructor to sidestep ESM module spy limitations.

const extractor = new PlaywrightExtractor(
  { browser: 'chromium', timeoutMs: 5000 },
  logger,
  async () => fakeBrowserModule   // injected stub — no real playwright import
);

Telemetry shape emitted on every ingest() call:

{
  event: 'playwright_fallback',
  triggered: boolean,
  primaryContentLength: number,
  fallbackContentLength: number,
  durationMs: number,
  success: boolean,
  errorType: 'launch_failed' | 'timeout' | 'navigation_error' | null,
  provider: 'playwright'
}

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh issue create --repo TheWizardsCode/ContextHub --title Sample --body-file - (http block)
    • Triggering command: /usr/bin/gh gh issue create --repo TheWizardsCode/ContextHub --title Sample --body-file - /wt k/ContextHub/Conrev-parse bash i/mo�� show-toplevel /tmp/wl-wt-err-wo4xgR/local-repo/.worklog/tmp-worktree-W2QQZD/wt in/bash -z --show-toplevel i/mock-bin/bash bash (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI linked an issue Apr 4, 2026 that may be closed by this pull request
8 tasks
Copilot AI and others added 3 commits April 4, 2026 23:34
Implements:
- src/lib/ingestion/extractor.ts: Extractor interface + ExtractResult type
- src/lib/ingestion/extractor-fetch.ts: fast-path HTTP fetch extractor
- src/lib/ingestion/extractor-playwright.ts: PlaywrightExtractor (opt-in,
  dynamic import, fresh context per run, configurable timeout, graceful
  degradation)
- src/lib/ingestion/service.ts: IngestionService with primary + fallback
  orchestration, telemetry emission, OB_PLAYWRIGHT_FALLBACK env var support
- src/cli/commands/add.ts: 'ob add <url>' CLI command
- package.json: playwright as optional peer dependency
- 47 unit tests (no real browser required in CI)
- tests/fixtures/playwright-fallback/: HTML fixtures for documentation

Agent-Logs-Url: https://github.com/TheWizardsCode/ContextHub/sessions/511715b4-dc90-4fe1-8664-cd673580f894

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
- extractor-fetch.ts: fix bad-tag-filter (allow spaces before closing >) and
  double-escaping (use single-pass entity decoder)
- extractor-playwright.ts: replace globalThis any cast with typed interface,
  add explanatory comment for the optional-peer-dep import cast
- add-command.test.ts: rename s -> chunk for clarity

Agent-Logs-Url: https://github.com/TheWizardsCode/ContextHub/sessions/511715b4-dc90-4fe1-8664-cd673580f894

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
- extractor-fetch.ts: use [^>]* in script/style closing tag regex to handle
  any attributes (fixes CodeQL bad-tag-filter); document numeric entity
  limitation
- add.ts: add parsePositiveInt helper with validation and user-friendly
  error messages for --min-content-length and --timeout options

Agent-Logs-Url: https://github.com/TheWizardsCode/ContextHub/sessions/511715b4-dc90-4fe1-8664-cd673580f894

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
Copilot AI changed the title [WIP] Add Playwright fallback retrieval for OpenBrain feat: Playwright fallback extractor for JS-heavy URL ingestion Apr 4, 2026
Copilot AI requested a review from SorraTheOrc April 4, 2026 23:41
@SorraTheOrc SorraTheOrc marked this pull request as ready for review April 5, 2026 23:52
@SorraTheOrc SorraTheOrc merged commit 0d95378 into main Apr 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenBrain Playwright Fallback Retrieval

2 participants