This repository contains a comprehensive data analysis project exploring the history of the Formula 1 World Championship (1950-2024) using NoSQL document-oriented databases. Developed for the SMBUD course at Politecnico di Milano, the project leverages MongoDB's advanced aggregation framework to extract meaningful insights from large, interconnected racing datasets.
The primary goal of this project is to demonstrate the power and flexibility of Documental Databases in analyzing complex, real-world data. By utilizing NoSQL approaches instead of traditional relational models, the project efficiently handles historical F1 data to uncover driver performance patterns, race trends, and circuit statistics across seven decades of motorsport.
The analysis is built upon the publicly available Kaggle F1 dataset, which was preprocessed and imported into MongoDB. The database is structured into four primary collections:
- 🏁
results: Contains individual race entries, including finishing positions, grid spots, points scored, and fastest lap data. - 🏎️
drivers: Biographical and career information for every driver in F1 history. - 🌍
circuits: Geographic and technical details regarding the race tracks. - 📅
races: Event-specific data, including season calendars, rounds, and historical dates.
Note: While a fully embedded subdocument structure was considered, separate collections were maintained and linked via $lookup to avoid excessive redundancy while preserving NoSQL flexibility.
This project heavily utilizes MongoDB Aggregation Pipelines to perform complex data transformations and analytics.
- Advanced Data Aggregation: Extensive use of operators such as
$group,$match,$sort, and$projectto filter and summarize historical trends. - Cross-Collection Joins: Utilized
$lookupto combine data across theresults,drivers,races, andcircuitscollections, simulating relational joins in a document database. - Analytical Queries: * Calculated the historical Average Driver Position across various seasons and cars.
- Analyzed historical reliability by tracking Non-Finishers (DNFs) and identifying races/eras with the highest attrition rates.
- Evaluated specific circuit characteristics and their impact on race outcomes.
- MongoDB installed locally or via Atlas.
- MongoDB Compass (GUI for importing data and running queries).
- Clone the repository:
git clone [https://github.com/paolorv/F1_DataAnalysis.git](https://github.com/paolorv/F1_DataAnalysis.git)
- Download the source dataset from Kaggle.
- Open MongoDB Compass, create a new database (e.g., f1_db), and import the .csv/.json files into their respective collections (results, drivers, circuits, races).
- Open the provided query scripts in this repository to execute the aggregation pipelines directly in the Compass shell or via a MongoDB driver.