Skip to content

Conversation

@vzucher
Copy link
Contributor

@vzucher vzucher commented Dec 1, 2025


🚀 Bright Data Python SDK v2.0 - Major Release

Overview

Complete rewrite of the Bright Data Python SDK with modern async-first architecture, dataclass payloads, Jupyter notebooks for data scientists, and enterprise-grade features.


✨ What's New

🎓 For Data Scientists

  • 5 Jupyter Notebooks - Interactive tutorials from quickstart to batch processing
  • Pandas Integration - Native DataFrame support with examples
  • Cost Tracking - Budget management and cost analytics
  • Progress Bars - tqdm integration for batch operations
  • Caching Support - joblib integration for development workflows

🎨 Dataclass Payloads (Major Upgrade)

  • Runtime Validation - Catch errors at instantiation time
  • Helper Properties - .asin, .is_remote_search, .domain, etc.
  • IDE Autocomplete - Full IntelliSense/type hints support
  • to_dict() Method - Easy API conversion

🖥️ CLI Tool

  • New brightdata command for terminal usage
  • Scrape & search operations from command line
  • Multiple output formats (JSON, pretty, minimal)

🏗️ Architecture Improvements

  • Async-first design with sync wrappers for compatibility
  • Single shared AsyncEngine - 8x efficiency improvement
  • 100% type safety - Dataclasses + TypedDict definitions
  • 502+ comprehensive tests - Unit, integration, and E2E

🆕 New Platform Support

  • Facebook Scraper - Posts (profile/group/URL), Comments, Reels
  • Instagram Scraper - Profiles, Posts, Comments, Reels discovery

🛡️ Enterprise Features

  • Rich result objects with timing, cost tracking, method tracking
  • SSL error handling with platform-specific guidance
  • .env file support via python-dotenv
  • Function-level monitoring for analytics

📊 Stats

Metric | Value -- | -- Production Code | ~9,000 lines Test Code | ~4,000 lines Tests Passing | 502+ Type Safety | 100% Supported Platforms | Amazon, LinkedIn, ChatGPT, Facebook, Instagram, Generic Search Engines | Google, Bing, Yandex


🔄 Migration

The new SDK uses BrightDataClient instead of bdclient:

# Before (v1)
from brightdata import bdclient
client = bdclient(api_token="...")
results = client.search("query")

# After (v2)
from brightdata import BrightDataClient
client = BrightDataClient(token="...")
result = client.search.google(query="query")

<clipboard-copy aria-label="Copy" class="ClipboardButton btn js-clipboard-copy m-2 p-0" data-copy-feedback="Copied!" data-tooltip-direction="w" value="# Before (v1)
from brightdata import bdclient
client = bdclient(api_token="...")
results = client.search("query")

After (v2)

from brightdata import BrightDataClient
client = BrightDataClient(token="...")
result = client.search.google(query="query")" tabindex="0" role="button" style="box-sizing: border-box; position: relative; display: inline-block; padding: 0px !important; font-size: 14px; font-weight: 500; line-height: 20px; white-space: nowrap; vertical-align: middle; cursor: pointer; user-select: none; border: 1px solid rgb(61, 68, 77); border-radius: 6px; appearance: none; color: rgb(240, 246, 252); background-color: rgb(33, 40, 48); box-shadow: none; transition: color 80ms cubic-bezier(0.33, 1, 0.68, 1), background-color, box-shadow, border-color; margin: 8px !important;">


📚 Documentation

  • 5 Jupyter notebooks in /notebooks/
  • 10+ example scripts in /examples/
  • Full API reference in /docs/
  • Comprehensive README with usage examples

🧪 Testing

All tests passing:

pytest tests/ --cov=brightdata
# 502+ tests, comprehensive coverage

vzucher and others added 30 commits November 10, 2025 21:51
docs: add comprehensive SDK refactoring plan and structure documentation
- Add BaseResult class with common fields (success, cost, error, timing) - Add ScrapeResult, SearchResult, and CrawlResult service-specific classes - Implement serialization methods (to_dict, to_json, save_to_file) - Add timing breakdown methods for performance optimization - Include comprehensive data validation with __post_init__ - Add type safety with Literal types for enums - Implement security checks for file operations - Add custom __repr__ methods for better debugging - Include full docstrings with Attributes, Args, Returns, Raises - Add 20 unit tests covering all functionality
…models

Implement high-level WebUnlockerService wrapper around Bright Data's Web Unlocker proxy service. This is the fastest, most cost-effective option for basic HTML extraction without JavaScript rendering.

Features:
- WebUnlockerService: async-first service with sync wrappers
- Unified result models: BaseResult, ScrapeResult, SearchResult, CrawlResult
- BrightData client with scrape() method
- AsyncEngine: HTTP client with aiohttp
- Comprehensive validation utilities
- Exception hierarchy with proper error handling
- CI/CD workflow with lint and pytest
- Pre-commit hooks with Black, Ruff, and mypy
- Python 3.9+ compatibility (timezone.utc instead of UTC)

Breaking changes: None
… comprehensive authentication

Implement the main SDK entry point (BrightDataClient) that provides a unified,
intuitive interface for all Bright Data services with robust authentication and
configuration management.

Features:
- Single-line client initialization with automatic token loading
- Hierarchical service access pattern (client.scrape.amazon, client.search.google)
- Multi-source token authentication (4 env var fallbacks)
- Connection testing and account info retrieval
- Both async and sync API support
- Backward compatibility with legacy BrightData alias

Authentication & Configuration:
- Auto-loads tokens from BRIGHTDATA_API_TOKEN, BRIGHTDATA_API_KEY,
  BRIGHTDATA_TOKEN, or BD_API_TOKEN environment variables
- Token validation with clear, actionable error messages
- Optional token validation on initialization
- Customer ID support
- Configurable timeouts and zone names
- Token whitespace trimming and format validation

Service Architecture:
- ScrapeService: Unified scraping interface with amazon, linkedin, chatgpt,
  and generic sub-services
- SearchService: SERP API access for google, bing searches
- CrawlerService: Web discovery and sitemap extraction
- GenericScraper: Direct Web Unlocker API access (fully functional)
- Lazy initialization and caching of service instances

Connection Management:
- test_connection(): Safe connection testing (never raises exceptions)
- get_account_info(): Retrieve zones, usage stats, and account metadata
- Both async and sync versions available
- Connection state tracking and caching

Philosophical Principles:
- Client is single source of truth for configuration
- Authentication "just works" with minimal setup
- Fails fast and clearly when credentials missing/invalid
- Follows principle of least surprise (common SDK patterns)

Testing:
- 60 comprehensive tests across 3 test suites
- Unit tests: 29/29 passing (100%)
- Integration tests: 16/16 passing (100%)
- E2E tests: 15/15 passing (100%)
- Tests cover token loading, validation, errors, services, connection,
  hierarchical access, backward compatibility, and philosophical principles

Files Changed:
- src/brightdata/client.py: 639 lines - Main client implementation
- src/brightdata/__init__.py: Updated exports
- tests/unit/test_client.py: 283 lines - Comprehensive unit tests
- tests/integration/test_client_integration.py: 224 lines - API integration tests
- tests/e2e/test_client_e2e.py: 320 lines - End-to-end workflow tests

Breaking Changes: None
- Maintains backward compatibility with BrightData alias
- Legacy scrape_url() methods still work

Documentation:
- Comprehensive docstrings on all public methods
- Full type hints throughout
- Clear error messages with actionable guidance
- Usage examples in docstrings

Future Work:
- Implement specialized scrapers (Amazon, LinkedIn, ChatGPT)
- Implement SERP API methods (Google, Bing search)
- Implement Crawler API methods (discover, sitemap)
- Add more E2E workflow tests
feat: implement BrightDataClient with hierarchical service access and…
Implement foundational service layer providing common interface for platform-specific
scrapers with unified scrape (URL-based) and search (parameter-based) patterns.

Core Components:
- BaseWebScraper: Abstract base with trigger/poll/fetch workflow
- Registry pattern: @register decorator for auto-discovery
- AmazonScraper: products(), reviews(), scrape()
- LinkedInScraper: profiles(), companies(), jobs(), scrape()
- ChatGPTScraper: prompt(), prompts()

Key Features:
- Unified signatures across platforms
- Auto-discovery via get_scraper_for(url)
- Data normalization hooks
- Cost tracking and timing metrics
- Both async and sync APIs

Testing:
- 42 new unit tests (100% passing)
- CLI-tested with real Bright Data API
- Total: 122/122 tests passing

Resolves: BRI-17
Add unified SERP API supporting Google, Bing, and Yandex with normalized results
across engines for SEO analysis and competitive intelligence.

Core Components:
- BaseSERPService: Common search patterns for all engines
- GoogleSERPService: Full Google search with SERP features
- BingSERPService & YandexSERPService: Multi-engine support
- SearchService: Integrated into client.search namespace

Features:
- Normalized result format across engines (ranking positions, titles, URLs)
- SERP feature extraction (featured snippets, knowledge panels, People Also Ask)
- Location and language targeting per engine
- Device type support (desktop/mobile/tablet)
- Returns SearchResult with query metadata and timing

Interface:
  result = client.search.google(query="python", location="US", num_results=20)
  result = client.search.bing(query="python", location="UK")
  result = client.search.yandex(query="python", location="Russia")

Philosophy:
- SERP data normalized for easy cross-engine comparison
- Engine quirks handled transparently
- Ranking positions included for competitive context

Testing:
- 30 comprehensive unit tests (100% passing)
- URL building, normalization, feature extraction validated
- Total: 152/152 tests across all 5 task specs

Files:
- src/brightdata/api/serp.py (554 lines)
- src/brightdata/client.py (updated SearchService)
- tests/unit/test_serp.py (30 tests)"
"
…T, and SERP services

Implement production-ready async-first SDK with hierarchical service access, comprehensive platform support, and 100% type safety. Features include: BrightDataClient with multi-source token auth and connection testing; unified result models (ScrapeResult, SearchResult, CrawlResult) with timing/cost tracking; WebUnlockerService for generic web scraping; platform-specific scrapers with registry pattern (Amazon products/reviews/sellers, LinkedIn posts/jobs/profiles/companies
…ntations

Move production-ready SDK from new-sdk/ to repository root for cleaner structure.
Archive old-sdk/ and ref-sdk/ (added to .gitignore) as they are superseded by the
new implementation. The repository now shows only the modern async-first SDK with
237 passing tests, complete LinkedIn/Amazon/ChatGPT support, and FAANG-level quality.

Changes:
- Moved new-sdk/* to root directory
- Removed old-sdk/ and ref-sdk/ from git tracking
- Updated .gitignore to exclude archived implementations
- Root now contains production SDK directly

Result: Clean repository structure with world-class SDK at root level.
Update demo_sdk.py to showcase complete API:
- Generic web scraping
- Amazon (products, reviews, sellers) - URL-based
- LinkedIn scrape (posts, jobs, profiles, companies) - URL-based
- LinkedIn search (jobs, profiles, posts) - parameter-based discovery
- SERP (Google, Bing, Yandex)
- ChatGPT search with prompts
- Batch operations
- Sync vs async mode comparison
- 12 interactive menu options covering all features
- Fix: Remove direct _session access - add public methods post_to_url() and get_from_url()
- Fix: Unify duplicate _trigger_async methods in base.py
- Fix: Add structured logging to registry to prevent silent failures
- Feat: Implement rate limiting with aiolimiter (10 req/s default)

All 32 direct _session accesses replaced with public methods
Rate limiting configurable via client parameters
Logging added for better debugging
Code quality improved to enterprise standards
This commit implements comprehensive refactoring to improve code quality, maintainability, and developer experience:  **Code Organization:** - Extract service classes from client.py into dedicated modules   - ScrapeService, GenericScraper → api/scrape_service.py   - SearchService → api/search_service.py   - CrawlerService → api/crawler_service.py - Refactor BaseWebScraper (600+ → 277 lines)   - Extract HTTP operations to DatasetAPIClient (api_client.py)   - Extract workflow logic to WorkflowExecutor (workflow.py) - Simplify LinkedIn scraper structure   - Remove empty placeholder files (companies.py, jobs.py, profiles.py, posts.py)   - Consolidate URL-based methods in scraper.py, search methods in search.py  **API Improvements:** - Standardize environment variable to BRIGHTDATA_API_TOKEN only   - Remove BRIGHTDATA_API_KEY, BRIGHTDATA_TOKEN, BD_API_TOKEN - Add .env file support via python-dotenv - Remove sync parameter from all async methods   - Standardize on trigger/poll/fetch workflow for async operations   - Sync methods are now simple wrappers around async counterparts - Implement dependency injection for search services   - LinkedInSearchScraper and ChatGPTSearchService accept optional engine parameter  **Model Changes:** - Rename timing fields for clarity   - request_sent_at → trigger_sent_at   - data_received_at → data_fetched_at - Replace fallback_used boolean with method string field   - Provides explicit method information ("web_scraper", "web_unlocker", etc.)  **Naming Consistency:** - Rename LinkedInSearchService → LinkedInSearchScraper   - Consistent naming pattern with LinkedInScraper  **Error Handling:** - Add SSL certificate error handling for macOS   - Custom SSLError with platform-specific guidance   - Helpful error messages with fix instructions  **Files Changed:** - New: api/scrape_service.py, api/search_service.py, api/crawler_service.py - New: scrapers/api_client.py, scrapers/workflow.py - New: utils/ssl_helpers.py - Modified: client.py, models.py, base.py, all scraper implementations - Removed: scrapers/linkedin/{companies,jobs,profiles,posts}.py  All changes maintain backward compatibility where possible, with clear migration paths documented in docstrings and error messages.

BREAKING CHANGE: Multiple environment variable names removed, sync parameter removed from async methods, timing field names changed, fallback_used field replaced with method field
vzucher and others added 29 commits November 20, 2025 16:54
- Add HTTP status code constants (HTTP_OK, HTTP_UNAUTHORIZED, etc.)
- Replace all magic numbers for HTTP status codes with named constants
- Move imports to top of files (except intentional lazy loading)
- Replace hardcoded cost values with platform-specific constants
- Improve exception handling with specific exception types
- Add platform-specific cost constants (COST_PER_RECORD_LINKEDIN, etc.)

This refactoring improves code maintainability, readability, and follows
Python best practices by eliminating magic numbers and organizing imports.

Files modified:
- constants.py: Added HTTP status codes and platform-specific cost constants
- client.py: Use HTTP constants, move warnings import, improve exception handling
- core/engine.py: Use HTTP constants for status code checks
- core/zone_manager.py: Use HTTP constants, move aiohttp import, improve exceptions
- api/serp/base.py: Use HTTP_OK constant
- api/web_unlocker.py: Use HTTP_OK constant, improve exception handling
- scrapers/api_client.py: Use HTTP_OK constant
- scrapers/base.py: Move os and concurrent.futures imports to top
- api/base.py: Move asyncio import to top
- utils/ssl_helpers.py: Move aiohttp import to top with try/except
- scrapers/workflow.py: Move poll_until_ready import to top, use DEFAULT_COST_PER_RECORD
- All scraper files: Use platform-specific cost constants instead of hardcoded values
- Update test_client.py to expect new default zone names (web_unlocker1, serp_api1, browser_api1)
- Fix test_amazon.py dataset IDs to match actual values in scraper
- Update test_scrapers.py to verify COST_PER_RECORD uses DEFAULT_COST_PER_RECORD constant

These changes align tests with the refactoring that replaced magic numbers
with constants and updated default zone names to match Bright Data conventions.
refactor: improve code quality with constants and best practices
…text when the client is already used as a context manager. This causes the session lifecycle issues.
@shahar-brd shahar-brd merged commit 4108b23 into brightdata:main Dec 1, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants