A serverless ETL pipeline that extracts, transforms, and loads weightlifting workout data from the CrossFit Invictus blog. The pipeline processes WordPress blog posts, structures workout sessions by day, and stores the data in AWS S3 and DynamoDB.
This project automates the extraction of weightlifting workout programs from the CrossFit Invictus WordPress blog. It uses AWS Step Functions to orchestrate a series of Lambda functions that:
- Fetch blog posts from the WordPress REST API
- Extract and clean HTML content
- Parse and structure workouts by day
- Segment workouts into individual exercises
- Store structured data in DynamoDB and S3
The pipeline includes idempotency mechanisms to prevent duplicate processing and ensure data consistency.
The pipeline is built using:
- AWS Lambda - Serverless compute for data processing
- AWS Step Functions - Orchestrates the ETL workflow
- AWS S3 - Stores raw posts and processed JSON data
- AWS DynamoDB - Stores structured workout sessions
- AWS Secrets Manager - Securely stores WordPress API credentials
- Serverless Framework - Infrastructure as code
graph TB
WP[WordPress API]
EB[EventBridge Rule]
SF[Step Functions]
L1[get_invictus_post]
L2[dump_post_to_bucket]
L3[strip_post_html]
L4[group_post_by_day]
L5[segment_days]
L6[sessions_to_date_records]
L7[clean_session_records]
L8[save_sessions_to_bucket]
S3[S3 Bucket]
DDB[(DynamoDB)]
SNS[SNS]
Phone[Phone]
EB -->|Triggers| SF
SF -->|Get Posts| L1
L1 -->|API Call| WP
WP -->|Returns| L1
L1 -->|Posts| SF
SF -->|For Each Post| L2
L2 -->|Save Raw| S3
L2 -->|Post Data| L3
L3 -->|Plain Text| L4
L4 -->|Grouped| L5
L5 -->|Segmented| L6
L6 -->|Date Records| L7
L7 -->|Cleaned| SF
SF -->|Parallel| DDB
SF -->|Parallel| L8
L8 -->|Save Weekly| S3
SF -->|Complete| SNS
SNS -->|SMS| Phone
GetPost → DumpPostToStagingBucket → StripPostHTML → GroupByDay →
GetDaySegments → SessionsToDateRecordsJSON → CleanSessionRecords →
PersistRecords (DynamoDB + S3)
The codebase follows a service layer architecture pattern:
-
Services (
services/): Abstraction layer for AWS services and external APIss3_service.py: S3 operations (put, get, check existence)dynamodb_service.py: DynamoDB operationsidempotency_service.py: Idempotency tracking logicinvictus_api_service.py: External WordPress API calls
-
Handlers (
handler.py): Thin Lambda handler functions that orchestrate services -
Transforms (
transforms.py): Pure transformation functions for data processing -
Utils (
utils/): Shared utilities (decorators, exceptions, validators) -
Config (
config.py): Type-safe environment variable validation
- Type Safety: Comprehensive type hints throughout the codebase
- Error Handling: Custom exception classes and standardized error responses
- Idempotency: Built-in idempotency checks for all write operations
- Logging: Structured logging with correlation IDs for request tracking
- Testing: Unit tests for service layer with mock fixtures
- Configuration: Environment variable validation with clear error messages
- Client Initialization: Boto3 clients initialized per invocation (not at module level)
- Function Configuration: Timeout, memory, DLQ, and X-Ray tracing configured
- Python Runtime: Upgraded to Python 3.11 for better performance
- Handler Decorators: Standardized error handling and response formatting
- Service Abstraction: AWS service calls abstracted into service layer
- Python 3.11
- Node.js (for Serverless Framework)
- AWS CLI configured with appropriate credentials
- Serverless Framework CLI
uvpackage manager (recommended) orpip
git clone <repository-url>
cd weightlifting-WOD-ETLUsing uv (recommended):
# Create virtual environment
python3.9 -m venv .venv
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txtOr using pip:
python3.9 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtnpm installCreate a .env file in the project root:
INVICTUS_USER=your_wordpress_username
INVICTUS_PASS=your_wordpress_passwordAlternatively, use Make commands for a streamlined setup:
# Complete setup (creates venv and installs all dependencies)
make setup
# Or step by step
make venv
make installThe project includes a Makefile with convenient commands for common tasks. Run make help to see all available commands.
make setup # Complete project setup (venv + dependencies)
make install # Install all dependencies (Python + Node)
make install-python # Install Python dependencies only
make install-node # Install Node.js dependencies only
make venv # Create Python virtual environmentmake test # Run all tests
make test-infra # Run infrastructure tests only
make test-cov # Run tests with coverage report
make test-verbose # Run tests with verbose outputmake plan # Preview deployment changes (dry-run, like terraform plan)
make plan-prod # Preview production deployment changes
make deploy # Deploy to dev stage
make deploy-prod # Deploy to prod stage
make deploy-function FUNC=get_invictus_post # Deploy single function# Invoke function locally
make invoke FUNC=get_invictus_post EVENT=test_events/get_invictus_post.json
# View logs
make logs FUNC=get_invictus_post
make logs-tail FUNC=get_invictus_postmake lint # Run linting checks
make format # Format code with autopep8make clean # Remove build artifacts and caches
make remove # Remove virtual environmentThe main configuration is in serverless.yml with modular configuration files in serverless/. Key settings:
- Runtime: Python 3.9
- Region: Configurable via
AWS_REGIONenvironment variable (default: us-east-1) - Stage: dev (configurable via
--stageflag orSTAGEenvironment variable) - S3 Bucket: Configurable via
INVICTUS_BUCKETenvironment variable - Python Requirements: Uses
requirements-prod.txtfor Lambda packaging - Docker: Uses Docker for non-Linux pip installations (via
serverless-python-requirementsplugin)
Using Make (recommended):
# Run all tests
make test
# Run infrastructure tests only
make test-infra
# Run with coverage
make test-covUsing direct commands:
# Activate virtual environment
source .venv/bin/activate
# Run all tests
pytest
# Run infrastructure tests only
pytest -m infrastructure
# Run with coverage
pytest --covUsing Make (recommended):
# Preview deployment changes (dry-run, similar to terraform plan)
make plan
# Preview production deployment changes
make plan-prod
# Deploy to dev stage
make deploy
# Deploy to prod stage
make deploy-prod
# Deploy a single function
make deploy-function FUNC=get_invictus_postUsing direct commands:
# Preview deployment (package without deploying)
npx serverless package --stage dev
# View compiled configuration
npx serverless print --stage dev
# Deploy the entire stack
npx serverless deploy
# Deploy to a specific stage
npx serverless deploy --stage prod
# Deploy a single function
npx serverless deploy function -f get_invictus_postUsing Make (recommended):
make invoke FUNC=get_invictus_post EVENT=test_events/get_invictus_post.jsonUsing direct commands:
npx serverless invoke local -f get_invictus_post --path test_events/get_invictus_post.jsonUsing Make (recommended):
# View logs
make logs FUNC=get_invictus_post
# Tail logs
make logs-tail FUNC=get_invictus_postUsing direct commands:
# View logs
npx serverless logs -f get_invictus_post
# Tail logs
npx serverless logs -f get_invictus_post --tailThe project includes comprehensive tests using pytest and moto for AWS service mocking.
tests/test_infrastructure.py- Tests for DynamoDB tables, Secrets Manager configuration, and TTL settingstests/test_idempotency.py- Unit tests for idempotency key generation, DynamoDB checks, and S3 idempotencytests/test_idempotency_integration.py- Integration tests for idempotency behavior in Lambda functions
Using Make (recommended):
# Run all tests
make test
# Run with verbose output
make test-verbose
# Run infrastructure tests only
make test-infra
# Run with coverage report
make test-covUsing direct commands:
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_infrastructure.py
# Run with coverage report
pytest --cov --cov-report=html.
├── handler.py # Main Lambda handler functions (includes idempotency logic)
├── transforms.py # Data transformation functions
├── serverless.yml # Serverless Framework main configuration
├── serverless/ # Modular Serverless configuration
│ ├── environment.yml # Environment variables
│ ├── functions.yml # Lambda function definitions
│ ├── iam.yml # IAM role statements
│ └── resources.yml # AWS resources (DynamoDB, Secrets Manager, etc.)
├── SemiStructureInvictusPost_stateMachine.yml # Step Functions state machine definition
├── requirements.txt # Python dependencies
├── requirements-prod.txt # Production Python dependencies (used for Lambda packaging)
├── package.json # Node.js dependencies
├── pytest.ini # Pytest configuration
├── Makefile # Make commands for common tasks
├── tests/ # Test files
│ ├── __init__.py
│ ├── test_infrastructure.py # Infrastructure tests
│ ├── test_idempotency.py # Idempotency unit tests
│ └── test_idempotency_integration.py # Idempotency integration tests
└── test_events/ # Sample event data for testing
├── get_invictus_post.json
├── clean_session_records.json
├── group_post_by_day.json
├── segment_days.json
├── segmented_sessions.json
├── save_sessions_to_bucket.json
├── test_idempotency.json
└── weekly/ # Weekly session examples
└── 2021-01-03__2021-01-08--5-day-weightlifting-program.json
get_invictus_post- Fetches blog posts from WordPress APIdump_post_to_bucket- Saves raw posts to S3 (with idempotency checks)strip_post_html- Removes HTML markup from post contentgroup_post_by_day- Groups content by workout daysegment_days- Segments workouts into exercise componentssessions_to_date_records- Converts sessions to date-based recordsclean_session_records- Cleans and normalizes session datasave_sessions_to_bucket- Saves processed sessions to S3 (with idempotency checks)
-
WorkoutPostsTable - Stores structured workout sessions
- Partition Key:
date(String) - Sort Key:
session(String)
- Partition Key:
-
IdempotencyTable - Prevents duplicate processing
- Partition Key:
idempotency_key(String) - TTL enabled on
ttlattribute (default: 24 hours) - Automatically expires records to prevent table growth
- Partition Key:
- WordPressCredentialsSecret - Stores WordPress API credentials
- SemiStructureInvictusPostStateMachine - Orchestrates the ETL workflow
| Variable | Description | Default |
|---|---|---|
INVICTUS_BUCKET |
S3 bucket for storing data | invictus-test-213 |
INVICTUS_WEIGHTLIFTING_API |
WordPress API endpoint URL | Required |
INVICTUS_WEIGHTLIFTING_API_CAT_ID |
WordPress category ID | 213 |
INVICTUS_USER |
WordPress API username | From .env or Secrets Manager |
INVICTUS_PASS |
WordPress API password | From .env or Secrets Manager |
DYNAMODB_TABLE |
DynamoDB table name | Auto-generated |
IDEMPOTENCY_TABLE |
Idempotency table name | Auto-generated |
AWS_REGION |
AWS region for deployment | us-east-1 |
AWS_PROFILE |
AWS CLI profile for deployment | serverless-invictus-agent |
The pipeline implements idempotency to prevent duplicate processing and ensure data consistency:
- DynamoDB-based idempotency: Uses an IdempotencyTable to track completed operations
- S3-based idempotency: Checks for existing objects before writing
- Fail-open design: If idempotency checks fail, operations proceed (prevents blocking on infrastructure issues)
- TTL-based expiration: Idempotency records expire after 24 hours to prevent table growth
Idempotency keys are generated using SHA256 hashes of operation names and unique identifiers (e.g., S3 paths, post slugs).
The project uses:
autopep8for code formattingpycodestylefor linting
Using Make:
# Format code
make format
# Run linting
make lintUsing direct commands:
# Format code
autopep8 --in-place --aggressive --aggressive --max-line-length=120 handler.py transforms.py tests/
# Run linting
pycodestyle --max-line-length=120 handler.py transforms.py tests/- Add the function handler to
handler.pyortransforms.py - Define the function in
serverless/functions.yml - Update the Step Functions definition in
SemiStructureInvictusPost_stateMachine.ymlif needed - Add tests in
tests/ - Consider adding idempotency checks for functions that write data
Import errors with moto:
- Ensure you're using Python 3.9
- Reinstall dependencies:
uv pip install -r requirements.txtormake install-python
AWS credential errors:
- Verify AWS CLI is configured:
aws configure - Check Serverless profile:
serverless.ymlusesserverless-invictus-agentprofile by default - Set
AWS_PROFILEenvironment variable if using a different profile
Deployment failures:
- Check IAM permissions for the deployment role
- Verify all environment variables are set in
.envor AWS Secrets Manager - Check CloudFormation stack events in AWS Console
- Ensure S3 bucket exists and is accessible
Idempotency issues:
- Verify
IDEMPOTENCY_TABLEenvironment variable is set - Check DynamoDB table exists and has correct schema
- Review CloudWatch logs for idempotency warnings