Machine Learning pipeline for predicting heart disease risk using clinical features and statistical modeling techniques.
This project focuses on data distribution analysis, preprocessing, and classification performance evaluation, combining exploratory analysis with supervised learning.
The objective of this project is to analyze cardiovascular health indicators and build a predictive model capable of identifying heart disease presence.
The workflow includes:
✔️ Numerical feature distribution analysis ✔️ Data preprocessing & feature preparation ✔️ Logistic Regression modeling ✔️ Performance evaluation using confusion matrix
heart_disease_classification/
│
├── heart_disase_classification.ipynb
│
│
└── README.md
Understanding the distribution of medical features is critical before training predictive models.
Observed Patterns:
- Age and Max Heart Rate follow near-normal distributions.
- Cholesterol shows wider variance and potential outliers.
- Oldpeak is heavily right-skewed, indicating potential scaling considerations.
The confusion matrix below shows the performance of the baseline classification model.
Interpretation:
- The model correctly identifies a strong portion of positive heart disease cases.
- Some false positives and false negatives remain, suggesting room for improvement with advanced models.
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- Jupyter Notebook
git clone https://github.com/your-username/heart_disease_classification.git
cd heart_disease_classificationInstall dependencies:
pip install pandas numpy matplotlib seaborn scikit-learnRun:
heart_disase_classification.ipynb
- Feature scaling experiments
- Hyperparameter tuning
- Tree-based models (Random Forest / XGBoost)
- ROC-AUC & Precision-Recall analysis
Arzu Selda Avcı Computer Engineering — Final Year Data Science & AI Enthusiast