Add robots.txt to block Baidu and aggressive crawlers by JuanVqz · Pull Request #214 · railsbump/app

JuanVqz · 2026-05-06T15:22:04Z

Problem

Web dyno is hitting R14 (memory quota exceeded) continuously, averaging 1,167 MB against a 1,024 MB quota (114%), with a peak of 2,142 MB (209%) observed around Tue May 5 01:00-02:00 UTC. R14 fired 180 times in the observed window.

Analysis of production logs shows two Chinese IP ranges responsible for a significant share of requests to the legacy gem pages:

116.179.32.x / 116.179.33.x / 116.179.37.x - Baidu crawler datacenter (multiple IPs)
220.181.108.x / 220.181.51.x - Baidu crawler datacenter (multiple IPs)

These IPs are hammering /gems/:id/compatibility/:rails_version and /gems/compat_table at high frequency. A single compat page can run 26-74 DB queries and load hundreds of ActiveRecord objects per request. Under continuous crawler load this keeps memory pressure elevated.

Sample of crawler hits from a single log window:

116.179.32.208  GET /gems/Imlib2-Ruby/compatibility/rails-7-1       55ms
116.179.32.219  GET /gems/rr/compatibility/rails-8-0                54ms
220.181.108.114 GET /gems/firebase_dynamic_link                     23ms
220.181.108.178 GET /gems/err_supply/compatibility/rails-8-0        54ms
116.179.32.227  GET /gems/rswag/compatibility/rails-6-1            203ms
116.179.33.76   GET /gems/compat_table?gemmy_ids=10186             214ms
183.209.102.248 GET /gems/aws-sdk-cognitoidentityprovider/...      404ms
220.181.108.92  GET /gems/aws-sdk-ses/compatibility/rails-7-2      297ms

The 223.199.x and 14.135.x ranges (also top hitters) appear to be scanning for random paths and favicon files unrelated to the app - pure noise.

Change

Adds public/robots.txt with:

Crawl-delay: 10 for all bots
Full Disallow: / for Baiduspider (all variants: main, render, image, video, news)
Full Disallow: / for SemrushBot, AhrefsBot, MJ12bot, DotBot - none of these provide SEO value for this app and all contribute to the crawler load

Caveats

Baidu respects robots.txt but not immediately. Expect 24-72h before crawl traffic drops.
This does not block the IPs at the network level. If crawler traffic continues after robots.txt propagates, a Rack::Attack rule or Heroku WAF rule targeting those CIDR blocks would be the next step.
Googlebot and other legitimate crawlers are unaffected.

Combines the existing sitemap + /lockfiles block with the new Baidu/aggressive-bot rules. Deletes public/robots.txt which would have shadowed the route-served ERB and lost the dynamic sitemap URL.

Add robots.txt to block Baidu and aggressive crawlers

338084e

JuanVqz self-assigned this May 6, 2026

JuanVqz marked this pull request as ready for review May 6, 2026 15:24

Use ERB robots.txt template instead of static file

3a9acb3

Combines the existing sitemap + /lockfiles block with the new Baidu/aggressive-bot rules. Deletes public/robots.txt which would have shadowed the route-served ERB and lost the dynamic sitemap URL.

JuanVqz merged commit 8e57f22 into main May 6, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add robots.txt to block Baidu and aggressive crawlers#214

Add robots.txt to block Baidu and aggressive crawlers#214
JuanVqz merged 2 commits intomainfrom
feature/robots-txt-block-baidu-crawlers

JuanVqz commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JuanVqz commented May 6, 2026

Problem

Change

Caveats

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant