Add robots.txt to block Baidu and aggressive crawlers#214
Merged
Conversation
Combines the existing sitemap + /lockfiles block with the new Baidu/aggressive-bot rules. Deletes public/robots.txt which would have shadowed the route-served ERB and lost the dynamic sitemap URL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Web dyno is hitting R14 (memory quota exceeded) continuously, averaging 1,167 MB against a 1,024 MB quota (114%), with a peak of 2,142 MB (209%) observed around Tue May 5 01:00-02:00 UTC. R14 fired 180 times in the observed window.
Analysis of production logs shows two Chinese IP ranges responsible for a significant share of requests to the legacy gem pages:
These IPs are hammering
/gems/:id/compatibility/:rails_versionand/gems/compat_tableat high frequency. A single compat page can run 26-74 DB queries and load hundreds of ActiveRecord objects per request. Under continuous crawler load this keeps memory pressure elevated.Sample of crawler hits from a single log window:
The
223.199.xand14.135.xranges (also top hitters) appear to be scanning for random paths and favicon files unrelated to the app - pure noise.Change
Adds
public/robots.txtwith:Crawl-delay: 10for all botsDisallow: /for Baiduspider (all variants: main, render, image, video, news)Disallow: /for SemrushBot, AhrefsBot, MJ12bot, DotBot - none of these provide SEO value for this app and all contribute to the crawler loadCaveats