This project is a task/homework assignment from Datascentics company that implements a data processing pipeline for analyzing book recommendation data using PySpark and Python.
The pipeline processes book recommendation data following a medallion architecture (Bronze → Silver → Gold) and generates visualizations to analyze the most popular books, authors, user locations, and age demographics.
- ETL Pipeline: Implements a complete Extract, Transform, Load process using PySpark
- Data Quality: Cleans and filters data to ensure quality
- Analytics: Aggregates data to find top books, authors, and user demographics
- Visualizations: Creates interactive charts showing:
- Top most popular books (by number of ratings)
- Top most popular authors
- Geographic distribution of users who rated books
- Age distribution of users who rated books
Books.csv: Book information including ISBN, title, authorRatings.csv: User ratings for booksUsers.csv: User demographic information
Source of the input CSV files (DataFrames): Kaggle Book Recommendation Dataset
- Removes null values from critical fields
- Filters ratings to valid range (0-10)
- Cleanses user data
- Combines books and ratings data
- Calculates rating counts and average ratings per book
- Ready for analytics and visualization
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txt
- Ensure the data files are in the
bronze/directory
Run the complete pipeline:
python main.pyThe code has been formatted and checked using Ruff
Main class that orchestrates the entire data processing pipeline.
load_bronze(): Load raw data from CSV filestransform_silver(): Clean and filter dataaggregate_gold(): Create aggregated analytics dataget_top_books(): Retrieve most popular booksget_top_authors(): Retrieve most popular authorsget_top_locations_of_users_who_rated_books(): Get user location analyticsget_top_age_of_users_who_rated_books(): Get user age analytics
show_top_books_graph(): Display popular books chartshow_top_authors_graph(): Display popular authors chartshow_top_locations_of_users_graph(): Display user location distributionshow_top_ages_of_users_graph(): Display user age distribution