Skip to content

Harden SSRF in HandwrittenText/generate.py#23

Open
eprifti wants to merge 1 commit into
alphanumericslab:mainfrom
eprifti:security/harden-ssrf-in-handwritten-text
Open

Harden SSRF in HandwrittenText/generate.py#23
eprifti wants to merge 1 commit into
alphanumericslab:mainfrom
eprifti:security/harden-ssrf-in-handwritten-text

Conversation

@eprifti
Copy link
Copy Markdown

@eprifti eprifti commented Apr 18, 2026

Summary

get_handwritten() in codes/ecg-image-generator/HandwrittenText/generate.py calls requests.get(link) on a caller-provided URL (the --link CLI flag) with no scheme check, no timeout, no response-size cap, and redirects enabled. This creates an SSRF surface:

  • Non-HTTP schemes (file://, gopher://, dict://) are accepted by validators.url in versions <0.20
  • Links can target cloud metadata endpoints (http://169.254.169.254/latest/meta-data/) on AWS/GCP/Azure
  • A slow or non-responsive server hangs the generator indefinitely
  • Open redirects can pivot to internal hosts

Change

#Extract n medical terms
-if(validators.url(link)):
-    #Parse URL
-    r = requests.get(link)
+if(validators.url(link) and link.lower().startswith(("http://", "https://"))):
+    #Parse URL — restrict scheme, set timeout, disable redirects to prevent SSRF/DoS
+    r = requests.get(link, timeout=10, allow_redirects=False)
+    r.raise_for_status()

Four-line, behaviour-preserving hardening:

  • Scheme allowlist (http/https) before the request is made
  • 10 second timeout
  • No automatic redirect following
  • Raise on 4xx/5xx so error pages don't feed into the HTML parser

Related

Test plan

  • Existing CLI invocation with a normal biomedical HTML URL still works (e.g. the Wikipedia/PubMed examples implied by the validators.url branch)
  • Invocation with a non-http(s) URL now falls through to the local .txt branch
  • Invocation with a server that never responds times out after ~10s instead of hanging

Happy to adjust the timeout or add a host allowlist if you'd prefer stricter defaults.

get_handwritten() called requests.get(link) on a caller-provided URL
with no scheme check, no timeout, and redirects enabled — reachable to
cloud metadata endpoints (e.g. 169.254.169.254) and arbitrary hosts.
On older validators (<0.20) the URL validator also accepts non-HTTP
schemes, further widening the surface.

- Require scheme to be http:// or https:// before fetching
- timeout=10 to prevent hangs
- allow_redirects=False to prevent open-redirect pivots
- raise_for_status() so 4xx/5xx errors surface instead of feeding error
  pages into the HTML parser

Reported in alphanumericslab#22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant