Skip to content

Add robots.txt to block Baidu and aggressive crawlers#214

Merged
JuanVqz merged 2 commits intomainfrom
feature/robots-txt-block-baidu-crawlers
May 6, 2026
Merged

Add robots.txt to block Baidu and aggressive crawlers#214
JuanVqz merged 2 commits intomainfrom
feature/robots-txt-block-baidu-crawlers

Conversation

@JuanVqz
Copy link
Copy Markdown
Member

@JuanVqz JuanVqz commented May 6, 2026

Problem

Web dyno is hitting R14 (memory quota exceeded) continuously, averaging 1,167 MB against a 1,024 MB quota (114%), with a peak of 2,142 MB (209%) observed around Tue May 5 01:00-02:00 UTC. R14 fired 180 times in the observed window.

Analysis of production logs shows two Chinese IP ranges responsible for a significant share of requests to the legacy gem pages:

  • 116.179.32.x / 116.179.33.x / 116.179.37.x - Baidu crawler datacenter (multiple IPs)
  • 220.181.108.x / 220.181.51.x - Baidu crawler datacenter (multiple IPs)

These IPs are hammering /gems/:id/compatibility/:rails_version and /gems/compat_table at high frequency. A single compat page can run 26-74 DB queries and load hundreds of ActiveRecord objects per request. Under continuous crawler load this keeps memory pressure elevated.

Sample of crawler hits from a single log window:

116.179.32.208  GET /gems/Imlib2-Ruby/compatibility/rails-7-1       55ms
116.179.32.219  GET /gems/rr/compatibility/rails-8-0                54ms
220.181.108.114 GET /gems/firebase_dynamic_link                     23ms
220.181.108.178 GET /gems/err_supply/compatibility/rails-8-0        54ms
116.179.32.227  GET /gems/rswag/compatibility/rails-6-1            203ms
116.179.33.76   GET /gems/compat_table?gemmy_ids=10186             214ms
183.209.102.248 GET /gems/aws-sdk-cognitoidentityprovider/...      404ms
220.181.108.92  GET /gems/aws-sdk-ses/compatibility/rails-7-2      297ms

The 223.199.x and 14.135.x ranges (also top hitters) appear to be scanning for random paths and favicon files unrelated to the app - pure noise.

Change

Adds public/robots.txt with:

  • Crawl-delay: 10 for all bots
  • Full Disallow: / for Baiduspider (all variants: main, render, image, video, news)
  • Full Disallow: / for SemrushBot, AhrefsBot, MJ12bot, DotBot - none of these provide SEO value for this app and all contribute to the crawler load

Caveats

  • Baidu respects robots.txt but not immediately. Expect 24-72h before crawl traffic drops.
  • This does not block the IPs at the network level. If crawler traffic continues after robots.txt propagates, a Rack::Attack rule or Heroku WAF rule targeting those CIDR blocks would be the next step.
  • Googlebot and other legitimate crawlers are unaffected.

@JuanVqz JuanVqz self-assigned this May 6, 2026
@JuanVqz JuanVqz marked this pull request as ready for review May 6, 2026 15:24
Combines the existing sitemap + /lockfiles block with the new
Baidu/aggressive-bot rules. Deletes public/robots.txt which would
have shadowed the route-served ERB and lost the dynamic sitemap URL.
@JuanVqz JuanVqz merged commit 8e57f22 into main May 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant