HackerNoon Scraper collects structured articles from HackerNoon categories so you can analyze tech stories, authors, and engagement at scale. It turns long-form posts into clean JSON records for research, trend tracking, and content intelligence. If you need a reliable HackerNoon scraper for category-based discovery, this project keeps the workflow simple and repeatable.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for hackernoon-scraper you've just found your team — Let’s Chat. 👆👆
This project extracts HackerNoon indicating articles and their rich metadata (content, author details, images, and engagement metrics) into a consistent, developer-friendly output. It solves the problem of manually browsing categories and copying content by automating collection and normalization into a single dataset format. It’s built for developers, analysts, marketers, and researchers who need structured HackerNoon data for reporting, monitoring, or downstream pipelines.
- Scrapes articles by category (e.g., AI, Programming, Finance, Web3, Top Stories).
- Captures full article text plus summaries (excerpt and TL;DR) where available.
- Enriches each record with author profile attributes and trust/brand flags.
- Includes engagement signals (comments and page views) for prioritization and ranking.
- Outputs clean JSON suitable for dashboards, search indexing, or ML pipelines.
| Feature | Description |
|---|---|
| Multi-category scraping | Collect articles from supported categories with a single configuration. |
| Full content extraction | Retrieves the full articleBody along with excerpt and TL;DR when present. |
| Author enrichment | Captures author name, handle, avatar, bio, and brand/trust indicators. |
| Engagement metrics | Includes commentsCount and pageViews for popularity and performance analysis. |
| Media capture | Extracts main image, dimensions, and social preview image for content previews. |
| Configurable limits | Control maximum records per run for faster iterations and predictable output size. |
| Clean JSON output | Produces normalized records ready for storage, analytics, or integration. |
| Safe defaults | Sensible defaults for category selection and limits to avoid oversized runs. |
| Field Name | Field Description |
|---|---|
| id | Unique article identifier. |
| title | Article headline/title. |
| slug | URL slug for the article. |
| link | Full canonical URL to the article. |
| excerpt | Short summary/preview text. |
| tldr | “Too Long; Didn’t Read” summary when available. |
| articleBody | Full article content/body text. |
| createdAt | Publication timestamp/date-time. |
| parentCategory | Primary category the article belongs to. |
| tags | Array of topic tags associated with the article. |
| commentsCount | Total number of comments on the article. |
| pageViews | Estimated reads/page views value. |
| arweave | Content reference identifier if present. |
| mainImage | Main article image URL. |
| mainImageHeight | Main image height in pixels. |
| mainImageWidth | Main image width in pixels. |
| socialPreviewImage | Social sharing preview image URL. |
| author_name | Author display name. |
| author_handle | Author username/handle. |
| author_avatar | Author profile avatar URL. |
| author_bio | Author biography text. |
| author_isBrand | Indicates if author account represents a brand. |
| author_isTrusted | Indicates if author is verified/trusted. |
[
{
"id": "rxhvxiLNxsRwnMGivMFc",
"title": "The Future of AI in Healthcare",
"slug": "the-future-of-ai-in-healthcare",
"link": "https://hackernoon.com/the-future-of-ai-in-healthcare",
"excerpt": "Exploring how AI is revolutionizing medical diagnosis...",
"tldr": "AI is transforming healthcare through improved diagnostics and personalized treatment.",
"articleBody": "Full article content here...",
"createdAt": "2023-09-30T21:36:56.367Z",
"parentCategory": "ai",
"tags": [
"artificial-intelligence",
"healthcare",
"machine-learning"
],
"commentsCount": 15,
"pageViews": 348080,
"arweave": "QuKn6Hew8wrwpJ9Zt0OFoeVt5yQwBQyZf30TtejOOno",
"mainImage": "https://hackernoon.imgix.net/images/...",
"mainImageHeight": 1024,
"mainImageWidth": 1536,
"socialPreviewImage": "https://hackernoon.imgix.net/images/...",
"author_name": "Dr. Sarah Johnson",
"author_handle": "sarahj_ai",
"author_avatar": "https://cdn.hackernoon.com/images/...",
"author_bio": "AI researcher and healthcare innovation expert",
"author_isBrand": false,
"author_isTrusted": true
}
]
HackerNoon Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! HackerNoon Scraper )/
├── src/
│ ├── main.py
│ ├── runner.py
│ ├── cli.py
│ ├── config/
│ │ ├── categories.json
│ │ ├── settings.example.json
│ │ └── settings.py
│ ├── core/
│ │ ├── http_client.py
│ │ ├── rate_limiter.py
│ │ ├── logger.py
│ │ └── errors.py
│ ├── extractors/
│ │ ├── category_listing.py
│ │ ├── article_parser.py
│ │ ├── author_parser.py
│ │ └── media_parser.py
│ ├── normalizers/
│ │ ├── article_normalizer.py
│ │ └── text_cleaner.py
│ ├── outputs/
│ │ ├── exporters.py
│ │ ├── json_writer.py
│ │ └── schema.py
│ └── utils/
│ ├── dates.py
│ ├── validators.py
│ └── strings.py
├── data/
│ ├── input.sample.json
│ └── output.sample.json
├── tests/
│ ├── test_article_parser.py
│ ├── test_normalizer.py
│ └── test_validators.py
├── scripts/
│ ├── run_local.sh
│ └── smoke_test.py
├── .env.example
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── LICENSE
└── README.md
- Growth marketers use it to track category trends and top-performing stories, so they can spot topics worth writing or sponsoring sooner.
- Data analysts use it to build engagement dashboards from pageViews and commentsCount, so they can rank content by impact instead of guesswork.
- Researchers use it to collect full article text across AI/Web3/Finance, so they can run NLP, clustering, and sentiment studies at scale.
- Content teams use it to benchmark authors and brands across categories, so they can identify credible contributors and collaboration targets.
- Developers use it to feed search indexes or internal knowledge bases, so they can enable fast discovery and reuse of tech writing.
Q: How do I choose a category and limit the number of articles?
Set category to a supported value (e.g., AI, Programming, Finance, Web3, or Top Stories) and use max_posts to cap the output size. A practical range is 50–500 records to balance speed and completeness.
Q: Why does “Top Stories” return fewer articles than other categories?
Top Stories typically takes longer to process due to heavier rendering and ranking logic. To keep runs predictable, the scraper enforces a stricter cap (commonly around 150) even if max_posts is higher.
Q: What happens if an article is missing fields like TL;DR or images? The scraper keeps a stable schema and returns empty or null-safe values for optional fields. This prevents downstream pipelines from breaking while still preserving everything that is present on the page.
Q: Can I store outputs for later analysis and avoid duplicates?
Yes—store results by id (and optionally slug) as unique keys. When re-running, deduplicate using id and update engagement fields (like pageViews) if you want freshness over time.
Primary Metric: Typical throughput of ~35–70 articles/min on category pages, depending on media density and the amount of content per article.
Reliability Metric: ~97–99% successful record creation when running with conservative pacing and retry logic; most failures are transient network/render errors.
Efficiency Metric: Stable memory usage for long runs by streaming normalized records to the output writer instead of holding full pages in memory.
Quality Metric: High completeness for core fields (title, link, createdAt, articleBody, author_handle) with consistent normalization of tags and category labeling across runs.
