Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 119 additions & 1 deletion py-src/data_formulator/tables_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -937,4 +937,122 @@
except Exception as e:
logger.error(f"Error listing table metadata: {str(e)}")
safe_msg, status_code = sanitize_db_error_message(e)
return jsonify({"status": "error", "message": safe_msg}), status_code
return jsonify({"status": "error", "message": safe_msg}), status_code

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 18 days ago

In general, the fix is to ensure that detailed exception information (messages or stack traces) is only logged server-side and not returned to clients. The client should receive either a generic message or a carefully curated/sanitized message that cannot expose internal implementation details. Status codes can remain specific, but message bodies must be safe.

The best targeted fix here is to change sanitize_db_error_message so that, for unknown errors, it does not include error_msg in the returned string. Instead, it should return a static, generic error like "An unexpected error occurred." while still logging the full error with logger.error for debugging. Optionally, to be extra safe, even the “safe” patterns can be changed to return generic messages rather than echoing the full error_msg, but the minimal change required to stop the identified leak is to fix the default branch at line 673. No changes are needed in the callers (e.g., at lines 629, 693, 939) because they already rely on this helper for sanitization and they log separately.

Concretely, in py-src/data_formulator/tables_routes.py, within sanitize_db_error_message, keep converting the exception to a string and keep the pattern matching, but change the final return statement to drop error_msg from the message. Optionally, we can also slightly adjust the log to include more context (but still server-side only). This change requires no new imports or additional functions.

Suggested changeset 1
py-src/data_formulator/tables_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/tables_routes.py b/py-src/data_formulator/tables_routes.py
--- a/py-src/data_formulator/tables_routes.py
+++ b/py-src/data_formulator/tables_routes.py
@@ -669,8 +669,8 @@
     # Log the full error for debugging
     logger.error(f"Unexpected error occurred: {error_msg}")
     
-    # Return a generic error message for unknown errors
-    return f"An unexpected error occurred: {error_msg}", 500
+    # Return a generic error message for unknown errors (do not expose internal details)
+    return "An unexpected error occurred.", 500
 
 
 @tables_bp.route('/data-loader/list-data-loaders', methods=['GET'])
EOF
@@ -669,8 +669,8 @@
# Log the full error for debugging
logger.error(f"Unexpected error occurred: {error_msg}")

# Return a generic error message for unknown errors
return f"An unexpected error occurred: {error_msg}", 500
# Return a generic error message for unknown errors (do not expose internal details)
return "An unexpected error occurred.", 500


@tables_bp.route('/data-loader/list-data-loaders', methods=['GET'])
Copilot is powered by AI and may make mistakes. Always verify output.

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 18 days ago

In general, the fix is to ensure that client responses never include raw exception messages for unexpected errors. Instead, log the full error (possibly with stack trace) server-side and send a generic, non-sensitive message to the client, while only returning more specific messages when they are explicitly known to be safe.

The best targeted fix here is to change sanitize_db_error_message so that for unknown errors (those not matching any safe_error_patterns), it returns a fully generic message such as "An unexpected error occurred. Please try again later." without embedding error_msg. We should still log the detailed error for server-side diagnosis; to improve diagnostics we can include a full stack trace using logger.exception (which logs the traceback) or traceback.format_exc() instead of only the message string. We must keep the function’s signature and return type unchanged, because it is used in multiple places (e.g., at lines 628, 693, and 939). The only edits needed are within sanitize_db_error_message in py-src/data_formulator/tables_routes.py; no new imports are required since logging, traceback, and re are already imported.

Concretely:

  • Leave the “safe pattern” handling unchanged.
  • Replace the current “unexpected error” handling (lines 669–673) with:
    • A logger.exception("Unexpected error occurred while processing request") call (or similar) to capture stack trace.
    • A generic user-facing message that does not contain error_msg.
      This removes the tainted exception message from the user-facing safe_msg while preserving observability on the server.
Suggested changeset 1
py-src/data_formulator/tables_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/tables_routes.py b/py-src/data_formulator/tables_routes.py
--- a/py-src/data_formulator/tables_routes.py
+++ b/py-src/data_formulator/tables_routes.py
@@ -666,11 +666,11 @@
         if re.search(pattern, error_msg, re.IGNORECASE):
             return safe_msg, status_code
             
-    # Log the full error for debugging
-    logger.error(f"Unexpected error occurred: {error_msg}")
+    # Log the full error for debugging (including stack trace)
+    logger.exception("Unexpected error occurred while processing request")
     
-    # Return a generic error message for unknown errors
-    return f"An unexpected error occurred: {error_msg}", 500
+    # Return a generic error message for unknown errors without exposing internal details
+    return "An unexpected error occurred. Please try again later.", 500
 
 
 @tables_bp.route('/data-loader/list-data-loaders', methods=['GET'])
EOF
@@ -666,11 +666,11 @@
if re.search(pattern, error_msg, re.IGNORECASE):
return safe_msg, status_code

# Log the full error for debugging
logger.error(f"Unexpected error occurred: {error_msg}")
# Log the full error for debugging (including stack trace)
logger.exception("Unexpected error occurred while processing request")

# Return a generic error message for unknown errors
return f"An unexpected error occurred: {error_msg}", 500
# Return a generic error message for unknown errors without exposing internal details
return "An unexpected error occurred. Please try again later.", 500


@tables_bp.route('/data-loader/list-data-loaders', methods=['GET'])
Copilot is powered by AI and may make mistakes. Always verify output.


@tables_bp.route('/fetch-url', methods=['GET'])
def fetch_url_proxy():
"""Proxy endpoint to fetch external URL content server-side.

This bypasses browser CORS restrictions for URLs that do not include
the required Access-Control-Allow-Origin header (e.g. public S3 files).
Only http:// and https:// URLs are allowed. Requests to private/internal
network addresses are blocked to prevent Server-Side Request Forgery (SSRF).
All resolved IP addresses are validated and the connection is pinned to the
validated IP to prevent DNS-rebinding attacks.
"""
import ipaddress
import socket as _socket
import requests as _requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse, urlunparse

url = request.args.get('url', '').strip()
if not url:
return jsonify({"status": "error", "message": "Missing 'url' query parameter"}), 400

if not (url.startswith('http://') or url.startswith('https://')):
return jsonify({"status": "error", "message": "Only http:// and https:// URLs are supported"}), 400

# SSRF protection: resolve all IP addresses for the hostname and ensure none
# of them point to a private or reserved range. Validating every resolved
# address prevents an attacker from mixing a public IP with a private one in
# round-robin DNS to bypass the check. The connection URL is then rebuilt
# using the resolved IP directly so the OS cannot re-resolve the hostname
# to a different (possibly private) IP between validation and connection.
try:
parsed = urlparse(url)
hostname = parsed.hostname or ''

if not hostname or hostname.lower() == 'localhost':
return jsonify({"status": "error", "message": "Requests to internal addresses are not allowed"}), 403

try:
addr_infos = _socket.getaddrinfo(hostname, None)
if not addr_infos:
return jsonify({"status": "error", "message": "Unable to resolve hostname"}), 400

for info in addr_infos:
raw_ip = info[4][0]
try:
ip_obj = ipaddress.ip_address(raw_ip)
except ValueError:
continue
if (ip_obj.is_private or ip_obj.is_loopback or ip_obj.is_link_local
or ip_obj.is_reserved or ip_obj.is_unspecified):
return jsonify({"status": "error", "message": "Requests to internal addresses are not allowed"}), 403

# Use the first resolved address for the pinned connection.
resolved_ip = addr_infos[0][4][0]
except _socket.gaierror:
return jsonify({"status": "error", "message": "Unable to resolve hostname"}), 400

default_port = 443 if parsed.scheme == 'https' else 80
port = parsed.port or default_port
# Omit the port from the netloc when it equals the scheme default to
# avoid confusing servers that do not expect an explicit default port.
ip_netloc = resolved_ip if port == default_port else f"{resolved_ip}:{port}"
# Reconstruct the URL with the validated IP in the netloc so subsequent
# network I/O uses the already-verified address (no re-resolution).
safe_url = urlunparse((parsed.scheme, ip_netloc, parsed.path or '/',
parsed.params, parsed.query, ''))

except Exception as e:
logger.warning(f"SSRF check error for URL {url}: {e}")
return jsonify({"status": "error", "message": "Invalid URL"}), 400

_STREAM_CHUNK_SIZE = 8192

class _HostnameVerifyAdapter(HTTPAdapter):
"""Adapter that verifies TLS certificates against *hostname* even when the
connection URL contains an IP address. This is needed because we pin the
connection to the pre-resolved IP (to prevent DNS-rebinding) while still
needing to match the cert's CN/SAN, which is issued for the hostname."""

def __init__(self, hostname: str, *args, **kwargs):
self._hostname = hostname
super().__init__(*args, **kwargs)

def init_poolmanager(self, num_pools, maxsize, block=False,
**connection_pool_kw):
connection_pool_kw['assert_hostname'] = self._hostname
super().init_poolmanager(num_pools, maxsize, block,
**connection_pool_kw)

try:
session = _requests.Session()
if parsed.scheme == 'https':
session.mount('https://', _HostnameVerifyAdapter(hostname))

resp = session.get(
safe_url, timeout=30, stream=True,
headers={'Host': hostname},
verify=True,
)
resp.raise_for_status()

content_type = resp.headers.get('Content-Type', 'text/plain')
return Response(
resp.iter_content(chunk_size=_STREAM_CHUNK_SIZE),
status=resp.status_code,
content_type=content_type,
)
except _requests.exceptions.Timeout:
logger.warning(f"Timeout fetching URL: {url}")
return jsonify({"status": "error", "message": "Request timed out"}), 504
except _requests.exceptions.HTTPError as e:
logger.warning(f"HTTP error fetching URL {url}: {e}")
return jsonify({"status": "error", "message": str(e)}), 502

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 18 days ago

In general, to fix this kind of problem you should avoid returning raw exception messages (or stack traces) to the client. Instead, log the detailed error on the server for diagnostics and return a generic message in the API response.

For this specific case, we should keep the logger.warning(f"HTTP error fetching URL {url}: {e}") line so operators still see the actual HTTP error, but change the JSON response to use a neutral message that does not include str(e). Functionality does not depend on the exact error string; the HTTP status code 502 is already being used to signal a bad upstream response. So on lines 1053–1055 in py-src/data_formulator/tables_routes.py, we will replace the return statement inside the _requests.exceptions.HTTPError handler so it no longer exposes e and instead returns a constant like "Upstream HTTP error".

No new imports or helper methods are required; we only adjust the JSON payload in this one except block.

Suggested changeset 1
py-src/data_formulator/tables_routes.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/py-src/data_formulator/tables_routes.py b/py-src/data_formulator/tables_routes.py
--- a/py-src/data_formulator/tables_routes.py
+++ b/py-src/data_formulator/tables_routes.py
@@ -1052,7 +1052,7 @@
         return jsonify({"status": "error", "message": "Request timed out"}), 504
     except _requests.exceptions.HTTPError as e:
         logger.warning(f"HTTP error fetching URL {url}: {e}")
-        return jsonify({"status": "error", "message": str(e)}), 502
+        return jsonify({"status": "error", "message": "Upstream HTTP error"}), 502
     except Exception as e:
         logger.error(f"Error fetching URL {url}: {e}")
         return jsonify({"status": "error", "message": "Failed to fetch URL"}), 502
\ No newline at end of file
EOF
@@ -1052,7 +1052,7 @@
return jsonify({"status": "error", "message": "Request timed out"}), 504
except _requests.exceptions.HTTPError as e:
logger.warning(f"HTTP error fetching URL {url}: {e}")
return jsonify({"status": "error", "message": str(e)}), 502
return jsonify({"status": "error", "message": "Upstream HTTP error"}), 502
except Exception as e:
logger.error(f"Error fetching URL {url}: {e}")
return jsonify({"status": "error", "message": "Failed to fetch URL"}), 502
Copilot is powered by AI and may make mistakes. Always verify output.
except Exception as e:
logger.error(f"Error fetching URL {url}: {e}")
return jsonify({"status": "error", "message": "Failed to fetch URL"}), 502
8 changes: 6 additions & 2 deletions src/app/useDataRefresh.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import { DataFormulatorState, dfActions, selectRefreshConfigs } from './dfSlice'
import { AppDispatch } from './store';
import { DictTable } from '../components/ComponentType';
import { createTableFromText } from '../data/utils';
import { fetchWithIdentity, getUrls, computeContentHash } from './utils';
import { fetchWithIdentity, getUrls, computeContentHash, buildProxiedUrl } from './utils';

interface RefreshResult {
tableId: string;
Expand Down Expand Up @@ -55,7 +55,11 @@ export function useDataRefresh() {
}

try {
const response = await fetch(source.url);
// For external http/https URLs, route through the backend proxy to avoid
// CORS failures (e.g. public S3 buckets without Access-Control-Allow-Origin).
const fetchUrl = buildProxiedUrl(source.url);

const response = await fetch(fetchUrl);
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
Expand Down
15 changes: 15 additions & 0 deletions src/app/utils.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,24 @@ export function getUrls() {

// Workspace
OPEN_WORKSPACE: `/api/tables/open-workspace`,

// URL proxy - fetches external URLs server-side (bypasses browser CORS)
FETCH_URL_PROXY: `/api/tables/fetch-url`,
};
}

/**
* Build a fetch URL that routes external http/https URLs through the backend
* proxy to avoid CORS failures (e.g. public S3 buckets without CORS headers).
* Relative and non-http URLs are returned unchanged.
*/
export function buildProxiedUrl(url: string): string {
if (url.startsWith('http://') || url.startsWith('https://')) {
return `${getUrls().FETCH_URL_PROXY}?url=${encodeURIComponent(url)}`;
}
return url;
}

/**
* Get the current namespaced identity from the Redux store, or fall back to browser ID.
* Returns identity in "type:id" format (e.g., "user:alice@example.com" or "browser:550e8400-...")
Expand Down
8 changes: 6 additions & 2 deletions src/views/UnifiedDataUploadDialog.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ import { DataSourceConfig, DictTable } from '../components/ComponentType';
import { createTableFromFromObjectArray, createTableFromText, loadTextDataWrapper, loadBinaryDataWrapper } from '../data/utils';
import { DataLoadingChat } from './DataLoadingChat';
import { DatasetSelectionView, DatasetMetadata } from './TableSelectionView';
import { getUrls, fetchWithIdentity } from '../app/utils';
import { getUrls, fetchWithIdentity, buildProxiedUrl } from '../app/utils';
import { DBManagerPane } from './DBTableManager';
import { MultiTablePreview } from './MultiTablePreview';
import {
Expand Down Expand Up @@ -821,7 +821,11 @@ export const UnifiedDataUploadDialog: React.FC<UnifiedDataUploadDialogProps> = (
const baseName = parts[parts.length - 1]?.split('?')[0] || 'dataset';
const tableName = getUniqueTableName(baseName.replace(/\.[^.]+$/, ''), existingNames);

fetch(fullUrl)
// For external http/https URLs, route through the backend proxy to avoid
// CORS failures (e.g. public S3 buckets without Access-Control-Allow-Origin).
const fetchUrl = buildProxiedUrl(urlToUse) || fullUrl;

fetch(fetchUrl)
.then(res => {
if (!res.ok) {
throw new Error(`HTTP ${res.status}: ${res.statusText}`);
Expand Down