Skip to content

lordisalyaswzl2gi/hackernoon-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

HackerNoon Scraper

HackerNoon Scraper collects structured articles from HackerNoon categories so you can analyze tech stories, authors, and engagement at scale. It turns long-form posts into clean JSON records for research, trend tracking, and content intelligence. If you need a reliable HackerNoon scraper for category-based discovery, this project keeps the workflow simple and repeatable.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for hackernoon-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

This project extracts HackerNoon indicating articles and their rich metadata (content, author details, images, and engagement metrics) into a consistent, developer-friendly output. It solves the problem of manually browsing categories and copying content by automating collection and normalization into a single dataset format. It’s built for developers, analysts, marketers, and researchers who need structured HackerNoon data for reporting, monitoring, or downstream pipelines.

Category-Based Article Intelligence

  • Scrapes articles by category (e.g., AI, Programming, Finance, Web3, Top Stories).
  • Captures full article text plus summaries (excerpt and TL;DR) where available.
  • Enriches each record with author profile attributes and trust/brand flags.
  • Includes engagement signals (comments and page views) for prioritization and ranking.
  • Outputs clean JSON suitable for dashboards, search indexing, or ML pipelines.

Features

Feature Description
Multi-category scraping Collect articles from supported categories with a single configuration.
Full content extraction Retrieves the full articleBody along with excerpt and TL;DR when present.
Author enrichment Captures author name, handle, avatar, bio, and brand/trust indicators.
Engagement metrics Includes commentsCount and pageViews for popularity and performance analysis.
Media capture Extracts main image, dimensions, and social preview image for content previews.
Configurable limits Control maximum records per run for faster iterations and predictable output size.
Clean JSON output Produces normalized records ready for storage, analytics, or integration.
Safe defaults Sensible defaults for category selection and limits to avoid oversized runs.

What Data This Scraper Extracts

Field Name Field Description
id Unique article identifier.
title Article headline/title.
slug URL slug for the article.
link Full canonical URL to the article.
excerpt Short summary/preview text.
tldr “Too Long; Didn’t Read” summary when available.
articleBody Full article content/body text.
createdAt Publication timestamp/date-time.
parentCategory Primary category the article belongs to.
tags Array of topic tags associated with the article.
commentsCount Total number of comments on the article.
pageViews Estimated reads/page views value.
arweave Content reference identifier if present.
mainImage Main article image URL.
mainImageHeight Main image height in pixels.
mainImageWidth Main image width in pixels.
socialPreviewImage Social sharing preview image URL.
author_name Author display name.
author_handle Author username/handle.
author_avatar Author profile avatar URL.
author_bio Author biography text.
author_isBrand Indicates if author account represents a brand.
author_isTrusted Indicates if author is verified/trusted.

Example Output

[
      {
        "id": "rxhvxiLNxsRwnMGivMFc",
        "title": "The Future of AI in Healthcare",
        "slug": "the-future-of-ai-in-healthcare",
        "link": "https://hackernoon.com/the-future-of-ai-in-healthcare",
        "excerpt": "Exploring how AI is revolutionizing medical diagnosis...",
        "tldr": "AI is transforming healthcare through improved diagnostics and personalized treatment.",
        "articleBody": "Full article content here...",
        "createdAt": "2023-09-30T21:36:56.367Z",
        "parentCategory": "ai",
        "tags": [
              "artificial-intelligence",
              "healthcare",
              "machine-learning"
        ],
        "commentsCount": 15,
        "pageViews": 348080,
        "arweave": "QuKn6Hew8wrwpJ9Zt0OFoeVt5yQwBQyZf30TtejOOno",
        "mainImage": "https://hackernoon.imgix.net/images/...",
        "mainImageHeight": 1024,
        "mainImageWidth": 1536,
        "socialPreviewImage": "https://hackernoon.imgix.net/images/...",
        "author_name": "Dr. Sarah Johnson",
        "author_handle": "sarahj_ai",
        "author_avatar": "https://cdn.hackernoon.com/images/...",
        "author_bio": "AI researcher and healthcare innovation expert",
        "author_isBrand": false,
        "author_isTrusted": true
      }
]

Directory Structure Tree

HackerNoon Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! HackerNoon Scraper )/
├── src/
│   ├── main.py
│   ├── runner.py
│   ├── cli.py
│   ├── config/
│   │   ├── categories.json
│   │   ├── settings.example.json
│   │   └── settings.py
│   ├── core/
│   │   ├── http_client.py
│   │   ├── rate_limiter.py
│   │   ├── logger.py
│   │   └── errors.py
│   ├── extractors/
│   │   ├── category_listing.py
│   │   ├── article_parser.py
│   │   ├── author_parser.py
│   │   └── media_parser.py
│   ├── normalizers/
│   │   ├── article_normalizer.py
│   │   └── text_cleaner.py
│   ├── outputs/
│   │   ├── exporters.py
│   │   ├── json_writer.py
│   │   └── schema.py
│   └── utils/
│       ├── dates.py
│       ├── validators.py
│       └── strings.py
├── data/
│   ├── input.sample.json
│   └── output.sample.json
├── tests/
│   ├── test_article_parser.py
│   ├── test_normalizer.py
│   └── test_validators.py
├── scripts/
│   ├── run_local.sh
│   └── smoke_test.py
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md

Use Cases

  • Growth marketers use it to track category trends and top-performing stories, so they can spot topics worth writing or sponsoring sooner.
  • Data analysts use it to build engagement dashboards from pageViews and commentsCount, so they can rank content by impact instead of guesswork.
  • Researchers use it to collect full article text across AI/Web3/Finance, so they can run NLP, clustering, and sentiment studies at scale.
  • Content teams use it to benchmark authors and brands across categories, so they can identify credible contributors and collaboration targets.
  • Developers use it to feed search indexes or internal knowledge bases, so they can enable fast discovery and reuse of tech writing.

FAQs

Q: How do I choose a category and limit the number of articles? Set category to a supported value (e.g., AI, Programming, Finance, Web3, or Top Stories) and use max_posts to cap the output size. A practical range is 50–500 records to balance speed and completeness.

Q: Why does “Top Stories” return fewer articles than other categories? Top Stories typically takes longer to process due to heavier rendering and ranking logic. To keep runs predictable, the scraper enforces a stricter cap (commonly around 150) even if max_posts is higher.

Q: What happens if an article is missing fields like TL;DR or images? The scraper keeps a stable schema and returns empty or null-safe values for optional fields. This prevents downstream pipelines from breaking while still preserving everything that is present on the page.

Q: Can I store outputs for later analysis and avoid duplicates? Yes—store results by id (and optionally slug) as unique keys. When re-running, deduplicate using id and update engagement fields (like pageViews) if you want freshness over time.


Performance Benchmarks and Results

Primary Metric: Typical throughput of ~35–70 articles/min on category pages, depending on media density and the amount of content per article.

Reliability Metric: ~97–99% successful record creation when running with conservative pacing and retry logic; most failures are transient network/render errors.

Efficiency Metric: Stable memory usage for long runs by streaming normalized records to the output writer instead of holding full pages in memory.

Quality Metric: High completeness for core fields (title, link, createdAt, articleBody, author_handle) with consistent normalization of tags and category labeling across runs.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Releases

No releases published

Packages

 
 
 

Contributors