This project builds a machine learning pipeline to predict the quality (OK / KO) of Pastel de Nata products for the Nata Visionaries brotherhood. The model will classify each production based on recipe and process data, saving pastries from being destroyed.
The project will consist of 5 Jupyter notebooks as required:
- Load
learn.csv - Understand data structure & feature types
- Summary statistics, distributions, correlations
- Initial observations & potential data issues
- Handle missing values / outliers
- Encode categorical features (
origin) - Scale / normalize numerical values if needed
- Save processed dataset (if needed for later notebooks)
- Engineer new relevant features
- Drop irrelevant / redundant variables
- Justify transformations
- Export cleaned + engineered dataset
- Train/test split
- Experiment with ML classifiers (e.g. RF, XGBoost, etc.)
- Compare accuracy
- Optional hyperparameter search
- Load raw data directly from files (
learn.csv,predict.csv) - Rebuild best data prep + feature steps
- Load best model configuration
- Train and export Kaggle
sampred.csv-style predictions
learn.csv→ full training dataset including targetpredict.csv→ same features, no target, must predictsampred.csv→ example submission format
- Output must match
sampred.csvformat - Submit early and often -> leaderboard feedback required
- Kaggle closes: 17 Dec
- Final notebook submission: 20 Dec (follow naming rules!)
- Markdown explanations in every notebook
- No data leakage
- NB1 and NB9 must run standalone from raw files
- Use accuracy as main metric
This README is a quick roadmap to keep the project organized and correctly aligned with grading rules.