HackerNoon Scraper

HackerNoon Scraper collects structured articles from HackerNoon categories so you can analyze tech stories, authors, and engagement at scale. It turns long-form posts into clean JSON records for research, trend tracking, and content intelligence. If you need a reliable HackerNoon scraper for category-based discovery, this project keeps the workflow simple and repeatable.

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for hackernoon-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

This project extracts HackerNoon indicating articles and their rich metadata (content, author details, images, and engagement metrics) into a consistent, developer-friendly output. It solves the problem of manually browsing categories and copying content by automating collection and normalization into a single dataset format. It’s built for developers, analysts, marketers, and researchers who need structured HackerNoon data for reporting, monitoring, or downstream pipelines.

Category-Based Article Intelligence

Scrapes articles by category (e.g., AI, Programming, Finance, Web3, Top Stories).
Captures full article text plus summaries (excerpt and TL;DR) where available.
Enriches each record with author profile attributes and trust/brand flags.
Includes engagement signals (comments and page views) for prioritization and ranking.
Outputs clean JSON suitable for dashboards, search indexing, or ML pipelines.

Features

Feature	Description
Multi-category scraping	Collect articles from supported categories with a single configuration.
Full content extraction	Retrieves the full articleBody along with excerpt and TL;DR when present.
Author enrichment	Captures author name, handle, avatar, bio, and brand/trust indicators.
Engagement metrics	Includes commentsCount and pageViews for popularity and performance analysis.
Media capture	Extracts main image, dimensions, and social preview image for content previews.
Configurable limits	Control maximum records per run for faster iterations and predictable output size.
Clean JSON output	Produces normalized records ready for storage, analytics, or integration.
Safe defaults	Sensible defaults for category selection and limits to avoid oversized runs.

What Data This Scraper Extracts

Field Name	Field Description
id	Unique article identifier.
title	Article headline/title.
slug	URL slug for the article.
link	Full canonical URL to the article.
excerpt	Short summary/preview text.
tldr	“Too Long; Didn’t Read” summary when available.
articleBody	Full article content/body text.
createdAt	Publication timestamp/date-time.
parentCategory	Primary category the article belongs to.
tags	Array of topic tags associated with the article.
commentsCount	Total number of comments on the article.
pageViews	Estimated reads/page views value.
arweave	Content reference identifier if present.
mainImage	Main article image URL.
mainImageHeight	Main image height in pixels.
mainImageWidth	Main image width in pixels.
socialPreviewImage	Social sharing preview image URL.
author_name	Author display name.
author_handle	Author username/handle.
author_avatar	Author profile avatar URL.
author_bio	Author biography text.
author_isBrand	Indicates if author account represents a brand.
author_isTrusted	Indicates if author is verified/trusted.

Example Output

[
      {
        "id": "rxhvxiLNxsRwnMGivMFc",
        "title": "The Future of AI in Healthcare",
        "slug": "the-future-of-ai-in-healthcare",
        "link": "https://hackernoon.com/the-future-of-ai-in-healthcare",
        "excerpt": "Exploring how AI is revolutionizing medical diagnosis...",
        "tldr": "AI is transforming healthcare through improved diagnostics and personalized treatment.",
        "articleBody": "Full article content here...",
        "createdAt": "2023-09-30T21:36:56.367Z",
        "parentCategory": "ai",
        "tags": [
              "artificial-intelligence",
              "healthcare",
              "machine-learning"
        ],
        "commentsCount": 15,
        "pageViews": 348080,
        "arweave": "QuKn6Hew8wrwpJ9Zt0OFoeVt5yQwBQyZf30TtejOOno",
        "mainImage": "https://hackernoon.imgix.net/images/...",
        "mainImageHeight": 1024,
        "mainImageWidth": 1536,
        "socialPreviewImage": "https://hackernoon.imgix.net/images/...",
        "author_name": "Dr. Sarah Johnson",
        "author_handle": "sarahj_ai",
        "author_avatar": "https://cdn.hackernoon.com/images/...",
        "author_bio": "AI researcher and healthcare innovation expert",
        "author_isBrand": false,
        "author_isTrusted": true
      }
]

Directory Structure Tree

HackerNoon Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! HackerNoon Scraper )/
├── src/
│   ├── main.py
│   ├── runner.py
│   ├── cli.py
│   ├── config/
│   │   ├── categories.json
│   │   ├── settings.example.json
│   │   └── settings.py
│   ├── core/
│   │   ├── http_client.py
│   │   ├── rate_limiter.py
│   │   ├── logger.py
│   │   └── errors.py
│   ├── extractors/
│   │   ├── category_listing.py
│   │   ├── article_parser.py
│   │   ├── author_parser.py
│   │   └── media_parser.py
│   ├── normalizers/
│   │   ├── article_normalizer.py
│   │   └── text_cleaner.py
│   ├── outputs/
│   │   ├── exporters.py
│   │   ├── json_writer.py
│   │   └── schema.py
│   └── utils/
│       ├── dates.py
│       ├── validators.py
│       └── strings.py
├── data/
│   ├── input.sample.json
│   └── output.sample.json
├── tests/
│   ├── test_article_parser.py
│   ├── test_normalizer.py
│   └── test_validators.py
├── scripts/
│   ├── run_local.sh
│   └── smoke_test.py
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md

Use Cases

Growth marketers use it to track category trends and top-performing stories, so they can spot topics worth writing or sponsoring sooner.
Data analysts use it to build engagement dashboards from pageViews and commentsCount, so they can rank content by impact instead of guesswork.
Researchers use it to collect full article text across AI/Web3/Finance, so they can run NLP, clustering, and sentiment studies at scale.
Content teams use it to benchmark authors and brands across categories, so they can identify credible contributors and collaboration targets.
Developers use it to feed search indexes or internal knowledge bases, so they can enable fast discovery and reuse of tech writing.

FAQs

Q: How do I choose a category and limit the number of articles? Set category to a supported value (e.g., AI, Programming, Finance, Web3, or Top Stories) and use max_posts to cap the output size. A practical range is 50–500 records to balance speed and completeness.

Q: Why does “Top Stories” return fewer articles than other categories? Top Stories typically takes longer to process due to heavier rendering and ranking logic. To keep runs predictable, the scraper enforces a stricter cap (commonly around 150) even if max_posts is higher.

Q: What happens if an article is missing fields like TL;DR or images? The scraper keeps a stable schema and returns empty or null-safe values for optional fields. This prevents downstream pipelines from breaking while still preserving everything that is present on the page.

Q: Can I store outputs for later analysis and avoid duplicates? Yes—store results by id (and optionally slug) as unique keys. When re-running, deduplicate using id and update engagement fields (like pageViews) if you want freshness over time.

Performance Benchmarks and Results

Primary Metric: Typical throughput of ~35–70 articles/min on category pages, depending on media density and the amount of content per article.

Reliability Metric: ~97–99% successful record creation when running with conservative pacing and retry logic; most failures are transient network/render errors.

Efficiency Metric: Stable memory usage for long runs by streaming normalized records to the output writer instead of holding full pages in memory.

Quality Metric: High completeness for core fields (title, link, createdAt, articleBody, author_handle) with consistent normalization of tags and category labeling across runs.

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HackerNoon Scraper

Introduction

Category-Based Article Intelligence

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HackerNoon Scraper

Introduction

Category-Based Article Intelligence

Features

What Data This Scraper Extracts

Example Output

Directory Structure Tree

Use Cases

FAQs

Performance Benchmarks and Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages