feat: add robots.txt to prevent crawler traps#365
Open
MAVRICK-1 wants to merge 1 commit intoopenml:masterfrom
Open
feat: add robots.txt to prevent crawler traps#365MAVRICK-1 wants to merge 1 commit intoopenml:masterfrom
MAVRICK-1 wants to merge 1 commit intoopenml:masterfrom
Conversation
This commit adds a robots.txt file to the public directory of the Next.js application. The robots.txt file disallows crawling of search pages with query parameters to prevent web crawlers from getting stuck in crawler traps. Fixes openml#335
PGijsbers
reviewed
Dec 5, 2025
Contributor
PGijsbers
left a comment
There was a problem hiding this comment.
Thanks for taking the time for a contribution!
There are two issues with this update:
- as noted in the original issue, the current dataset pages include queries (though they are under
/searchso wouldn't be blocked with this) - all pages are currently under the
/searchpath, this robots file is configuring paths which do not exist on the server and are not being crawled.
It's probably easier to wait with the update until the new frontend is live which has fixed urls for datasets so that we can block all queries for crawlers, unless you have a suggestion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new
robots.txtfile to control how search engines crawl and index certain pages of the application. The main change is to restrict search engine access to specific query-based URLs while allowing access to the main resource pages.Search engine crawling restrictions:
robots.txtfile to disallow crawling of query parameter URLs fordatasets,tasks,flows, andruns(e.g.,/datasets?*), helping prevent indexing of filtered or search result pages.Fixes #335