Book Recommendation Data Pipeline

This project is a task/homework assignment from Datascentics company that implements a data processing pipeline for analyzing book recommendation data using PySpark and Python.

Overview

The pipeline processes book recommendation data following a medallion architecture (Bronze → Silver → Gold) and generates visualizations to analyze the most popular books, authors, user locations, and age demographics.

Features

ETL Pipeline: Implements a complete Extract, Transform, Load process using PySpark
Data Quality: Cleans and filters data to ensure quality
Analytics: Aggregates data to find top books, authors, and user demographics
Visualizations: Creates interactive charts showing:
- Top most popular books (by number of ratings)
- Top most popular authors
- Geographic distribution of users who rated books
- Age distribution of users who rated books

Data Architecture

Bronze Layer (Raw Data)

Books.csv: Book information including ISBN, title, author
Ratings.csv: User ratings for books
Users.csv: User demographic information

Source of the input CSV files (DataFrames): Kaggle Book Recommendation Dataset

Silver Layer (Cleaned Data)

Removes null values from critical fields
Filters ratings to valid range (0-10)
Cleanses user data

Gold Layer (Aggregated Data)

Combines books and ratings data
Calculates rating counts and average ratings per book
Ready for analytics and visualization

Installation

Clone or download this repository
Install dependencies:
```
pip install -r requirements.txt
```
Ensure the data files are in the bronze/ directory

Usage

Run the complete pipeline:

python main.py

The code has been formatted and checked using Ruff

Main class and it's methods

`BooksPipeline`

Main class that orchestrates the entire data processing pipeline.

Core Methods:

load_bronze(): Load raw data from CSV files
transform_silver(): Clean and filter data
aggregate_gold(): Create aggregated analytics data
get_top_books(): Retrieve most popular books
get_top_authors(): Retrieve most popular authors
get_top_locations_of_users_who_rated_books(): Get user location analytics
get_top_age_of_users_who_rated_books(): Get user age analytics

Visualization Methods:

show_top_books_graph(): Display popular books chart
show_top_authors_graph(): Display popular authors chart
show_top_locations_of_users_graph(): Display user location distribution
show_top_ages_of_users_graph(): Display user age distribution

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bronze		bronze
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book Recommendation Data Pipeline

Overview

Features

Data Architecture

Bronze Layer (Raw Data)

Silver Layer (Cleaned Data)

Gold Layer (Aggregated Data)

Installation

Usage

Main class and it's methods

`BooksPipeline`

Core Methods:

Visualization Methods:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Book Recommendation Data Pipeline

Overview

Features

Data Architecture

Bronze Layer (Raw Data)

Silver Layer (Cleaned Data)

Gold Layer (Aggregated Data)

Installation

Usage

Main class and it's methods

BooksPipeline

Core Methods:

Visualization Methods:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`BooksPipeline`

Packages