- Lai Hoang Hiep - 22028295
- Hoang Duc Duong - 22028259
- Pham Mai Anh - 22028225
This repository showcases the collaborative efforts of our team in the Kaggle competition, focusing on predicting problematic internet usage among young individuals.
Kaggle Competition: Child Mind Institute — Problematic Internet Use
The main aim of this competition is to use the given training data to predict the Severity Impairment Index (SII).
The target variable is originally derived from the field PCIAT-PCIAT_Total. Our task is to train a model based on other features to predict SII effectively.
- Remove NaN columns.
- Fill missing values with a constant (using
SimpleImputerfrom sklearn). - KNN Imputation (
KNNImputerfrom sklearn). - Remove outliers.
- Feature engineering (e.g., pulse pressure, fat-to-muscle ratio).
- Correlation-based feature removal.
- Neural network encoder-decoder for time-series data.
- Tree-based algorithms: Random Forest, LightGBM, LightGBM Regressor, XGBoost, CatBoost.
- Optimization: Grid Search, Bayesian Optimization, K-Fold Cross-Validation, threshold rounder optimization.
confusion_matrix,classification_reportfromsklearn- Quadratic Weighted Kappa (QWK).
We developed our models with approximately 40 versions, marked by 3 outstanding milestone versions.
- Data: Drop unlabeled records, remove columns with >50% missing values, fill missing values with -1, process Parquet data by averaging columns and merging into CSV by ID, and split data into 80% training, 20% testing.
- Metric: Evaluate with
confusion_matrix,classification_reportfromsklearn - Model: Random Forest with grid search for hyperparameter tuning.
- Result: Model shows strong bias toward class 0, struggles with class 1 and 2, and demonstrates prediction imbalance; accuracy and scores indicate room for improvement.
- Data: Replace missing values using the KNN algorithm. Use an adversarial neural network with an encoder-decoder structure.
- Metric: Evaluate with
confusion_matrix,classification_reportfromsklearn - Model: Implement LightGBM. Use L1/L2 regularization and early stopping to mitigate overfitting.
- Result: Improved accuracy with better focus on minority classes (class 1). However, submission scores remain low, highlighting potential issues with feature usage or evaluation strategies.
- Data: Retain the previous data processing approach but engineer new features (e.g., pulse pressure, fat-to-muscle ratio) based on medical metrics to enhance feature informativeness.
- Metric: Evaluate with
confusion_matrix,classification_reportfromsklearn, and Quadratic Weighted Kappa (QWK) as the optimization metric. - Model: LGBMRegressor for ordinal classification, implement threshold rounding, apply K-Fold Cross-Validation for generalization, and optimize hyperparameters using Bayesian Optimization.
- Result: Submission scores improve significantly, achieving the highest private test score and a bronze medal rank (364). Model shows improved awareness of ordinal class relationships, reducing extreme misclassifications.
| Metric | Value |
|---|---|
| Weighted Kappa (test) | 0.5407461505690463 |
| Weighted Kappa (train) | 0.8167463778309185 |
- FINAL:
2nd-improvement.ipynb: The final version submitted to the Kaggle competition. - Other Versions:
Baseline.ipynb: Baseline version.1st-Improvement.ipynb: 1st improvement version.- Other: Notable versions implemented with high scores but not submitted.
Our hard work and dedication are reflected in the results. After implementing 46 versions, we achieved a rank of 364 with an official private score of 0.439, achieving the bronze medal.
We are especially proud that all of our work was implemented independently. While we explored ideas from others for inspiration, we ensured a thorough understanding and executed everything ourselves, staying true to our commitment to originality and integrity.
