Skip to content

Enhance LInkVet relevance filtering logic #22

@robertmorgan-se

Description

@robertmorgan-se

Problem

Current LinkVet logic relies heavily on HTTP HEAD requests for MIME-type checking. Many valid links are discarded due to sites blocking HEAD requests, resulting in missed valuable content. Also, MIME-type checking is insufficient for accurately evaluating relevance, especially for mixed-content pages (text with embedded videos).

Goals

  • Reduce false negatives (discarding relevant links).
  • Improve filtering accuracy by leveraging page content and structured domain allow-/deny-lists.
  • Provide better coverage for mixed-content resources.

Proposed Enhancements

  • Domain-based filtering
    Expand and maintain a structured allow-/deny-list for trusted content domains (e.g., khanacademy.org, docs.microsoft.com, wikipedia.org) categorized by content type (video, article, forum). This ensures only known educational sources are considered by default.

  • Content-based relevance heuristics
    Improve relevance by inspecting actual content instead of relying solely on MIME types:

    • For non-YouTube pages:

      • Perform a partial GET (first 32KB or so).
      • Extract the <title>, meta description, and visible <h1>/<p> content.
      • Compare against the lesson title and tags using a fuzzy matching algorithm (e.g., Jaro-Winkler or cosine similarity on tokenized phrases).
      • Discard links where similarity falls below a defined threshold (e.g., 0.75), or where the page contains mostly ads, service offerings, or lists without instructional depth.
    • For YouTube videos:

      • Call the YouTube Data API to fetch video metadata.
      • Only keep videos that:
        • Are 3–35 minutes long
        • Contain human captions (not auto-generated)
        • Have titles or descriptions that closely match lesson keywords
      • Optionally verify topic fit by checking the top-level YouTube category.
  • Fallback on HEAD failures
    When a site blocks HEAD requests or omits the Content-Type header, default to partial GET with the above heuristics instead of rejecting the link outright.

Expected Outcomes

  • Higher relevance rate for accepted links.
  • Improved learner experience through better-quality content recommendations.

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or requesthelp wantedExtra attention is neededtype: roadmapItems for platform roadmap and long-term planning

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions