HR Analytics - Employee Job Change Prediction

Mô tả

Dự án này sử dụng Machine Learning để dự đoán khả năng một nhân viên có muốn đổi việc hay không dựa trên các thông tin cá nhân và nghề nghiệp. Điểm đặc biệt của project là implement tất cả các thuật toán ML từ đầu bằng NumPy thay vì sử dụng các framework có sẵn như scikit-learn, nhằm hiểu sâu về cách hoạt động của các thuật toán.

Giới thiệu

Mô tả bài toán

Bài toán Employee Job Change Prediction là một bài toán phân loại nhị phân (binary classification), trong đó:

Input: Thông tin về nhân viên (thành phố, giới tính, kinh nghiệm, trình độ học vấn, v.v.)
Output: Dự đoán xem nhân viên có muốn đổi việc hay không (0 = không muốn, 1 = muốn đổi việc)

Động lực và ứng dụng thực tế

Giảm tỷ lệ nhân viên nghỉ việc: Giúp công ty xác định sớm những nhân viên có nguy cơ nghỉ việc để có biện pháp giữ chân
Tối ưu chi phí tuyển dụng: Tiết kiệm chi phí tuyển dụng và đào tạo nhân viên mới
Cải thiện môi trường làm việc: Hiểu được các yếu tố ảnh hưởng đến quyết định nghỉ việc
Lập kế hoạch nhân sự: Dự đoán nhu cầu tuyển dụng trong tương lai

Mục tiêu cụ thể

Xây dựng các thuật toán ML từ đầu bằng NumPy:
- Logistic Regression với Gradient Descent
- K-Nearest Neighbors (KNN)
- Gaussian Naive Bayes
- Decision Tree
Implement các thành phần cốt lõi:
- Loss functions (Binary Cross-Entropy)
- Optimization algorithms (Gradient Descent)
- Evaluation metrics (Accuracy, Precision, Recall, F1-Score)
- Cross-validation
Đạt được hiệu suất tốt trên tập dữ liệu HR Analytics

Dataset

Nguồn dữ liệu

Dataset được lấy từ cuộc thi HR Analytics: Job Change of Data Scientists trên Kaggle, mô phỏng dữ liệu thực tế về nhân viên trong lĩnh vực công nghệ thông tin. HR Analytics: Job Change of Data Scientists

Mô tả các features

Dataset bao gồm các features sau:

Feature	Mô tả	Kiểu dữ liệu
`enrollee_id`	ID duy nhất của ứng viên	Integer
`city`	Mã thành phố	Categorical
`city_development_index`	Chỉ số phát triển thành phố (0-1)	Float
`gender`	Giới tính	Categorical (Male/Female/Other)
`relevent_experience`	Có kinh nghiệm liên quan hay không	Categorical
`enrolled_university`	Trường đại học đã đăng ký	Categorical
`education_level`	Trình độ học vấn	Categorical
`major_discipline`	Chuyên ngành	Categorical
`experience`	Số năm kinh nghiệm	Categorical
`company_size`	Quy mô công ty	Categorical
`company_type`	Loại công ty	Categorical
`last_new_job`	Số năm từ công việc mới nhất	Categorical
`training_hours`	Số giờ đào tạo	Float
`target`	Có muốn đổi việc hay không (0/1)	Binary

Kích thước và đặc điểm dữ liệu

Training set: 19,158 samples
Test set: 2,129 samples
Số features: 12 features (sau preprocessing)
Class imbalance:
- Class 0 (không muốn đổi việc): 75.07% (14,381 samples)
- Class 1 (muốn đổi việc): 24.93% (4,777 samples)

Đặc điểm quan trọng:

Có nhiều giá trị missing (NaN) cần xử lý
Các features categorical cần được encode thành số
Class imbalance nghiêm trọng (3:1 ratio)

Method

Kỹ thuật NumPy nâng cao được sử dụng

Project này tận dụng tối đa các kỹ thuật NumPy nâng cao để đạt hiệu suất và điểm số cao:

Vectorization (Không dùng for loops cho array operations)

✅ KNN.predict(): Fully vectorized - tính distances cho tất cả test samples cùng lúc
✅ KNN.predict_proba(): Fully vectorized - tính probabilities cho tất cả samples
✅ Logistic Regression: Tất cả operations đều vectorized
✅ Gaussian Naive Bayes: Sử dụng vectorized operations
✅ Metrics functions: Hoàn toàn vectorized

np.einsum (Einstein summation)

✅ Logistic Regression:
- Forward pass: np.einsum('ij,j->i', X, weights)
- Gradients: np.einsum('ij,i->j', X, error)
✅ KNN Distance:
- Norm calculation: np.einsum('ij,ij->i', x, x)
- Dot product: np.einsum('ik,jk->ij', x1, x2)

Broadcasting

✅ Tự động broadcast trong normalization: (X - mean) / std
✅ Broadcasting trong distance calculations
✅ Broadcasting trong loss functions

Fancy Indexing & Masking

✅ Boolean masking: X[y == c] trong Naive Bayes
✅ Advanced indexing: y_train[k_indices] trong KNN
✅ Conditional operations: (y_true == 1) & (y_pred == 1) trong metrics

Array Manipulation

✅ reshape(): Reshape arrays cho broadcasting
✅ argsort(): Sort với axis parameter
✅ column_stack(): Stack arrays theo columns
✅ concatenate(): Nối arrays

Memory-Efficient Operations

✅ In-place operations: std[std == 0] = 1
✅ Tối ưu memory: Sử dụng views thay vì copies khi có thể
✅ Efficient distance calculation: Sử dụng công thức tối ưu

Quy trình xử lý dữ liệu

Data Exploration (01_data_exploration.ipynb):
- Phân tích phân phối của các features
- Kiểm tra missing values
- Phân tích correlation giữa các features
- Trực quan hóa dữ liệu
Data Preprocessing (02_preprocessing.ipynb):
- Xử lý missing values:
  - Sử dụng hàm fill_missing_values() tự implement bằng NumPy
  - Điền missing values dựa trên phân bố xác suất của các giá trị hiện có
  - Phân bổ missing values theo tỷ lệ của từng giá trị (không chỉ dùng mode/median)
- Encoding categorical features:
  - City: Extract số từ format "city_XXX" (ví dụ: "city_103" → 103)
  - Các features khác: Sử dụng mapping dictionary cho từng feature
  - Áp dụng mapping bằng hàm apply_mapping() với np.vectorize
  - Chuyển đổi tất cả categorical features thành số nguyên (0, 1, 2, ...)
- Lưu dữ liệu: Sử dụng np.savetxt() với format tùy chỉnh để giữ đúng kiểu dữ liệu
Feature Scaling (trong 03_modeling.ipynb):
- Chuẩn hóa dữ liệu bằng hàm normalize() tự implement
- Sử dụng z-score normalization (standardization): $(x - \mu) / \sigma$
- Lưu normalization parameters để áp dụng cho test data
Modeling (03_modeling.ipynb):
- Chuẩn hóa dữ liệu bằng hàm normalize() tự implement
- Train/Test split: Chia 80/20 với random_state=42 (tự implement bằng NumPy)
- Train và đánh giá các mô hình
- Cross-validation để chọn hyperparameters
- So sánh hiệu suất các mô hình
- Tạo submission file

Thuật toán sử dụng

1. Logistic Regression

Công thức toán học:

Sigmoid function: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Linear output: $$z = X \cdot w + b$$
Binary Cross-Entropy Loss: $$L = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$
Gradient Descent: $$\frac{\partial L}{\partial w} = \frac{1}{n}X^T(\hat{y} - y)$$ $$\frac{\partial L}{\partial b} = \frac{1}{n}\sum(\hat{y} - y)$$

Update rules: $$w := w - \alpha \frac{\partial L}{\partial w}$$ $$b := b - \alpha \frac{\partial L}{\partial b}$$

Implementation bằng NumPy (sử dụng np.einsum và vectorization):

# Forward pass - sử dụng np.einsum cho matrix multiplication
linear_output = np.einsum('ij,j->i', X, self.weights) + self.bias
y_pred = 1 / (1 + np.exp(-np.clip(linear_output, -500, 500)))

# Loss - vectorized
loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

# Gradients - sử dụng np.einsum
error = y_pred - y
dw = np.einsum('ij,i->j', X, error) / n_samples
db = np.mean(error)

# Update
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db

Kỹ thuật NumPy sử dụng:

np.einsum: Tính toán matrix multiplication hiệu quả ('ij,j->i' cho forward pass, 'ij,i->j' cho gradients)
Broadcasting: Tự động broadcast bias và error
Vectorized operations: Tất cả operations đều vectorized, không có for loops

2. K-Nearest Neighbors (KNN)

Công thức toán học:

Euclidean Distance: $$d(x_i, x_j) = \sqrt{\sum_{k=1}^{n}(x_{ik} - x_{jk})^2}$$
Prediction: Majority voting từ k neighbors gần nhất

Implementation bằng NumPy (fully vectorized với np.einsum):

# Tính khoảng cách Euclidean - vectorized cho tất cả pairs
# Sử dụng công thức: ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*x1*x2
x1_norm = np.einsum('ij,ij->i', X_train, X_train)[:, np.newaxis]  # (n_train, 1)
x2_norm = np.einsum('ij,ij->i', X_test, X_test)[np.newaxis, :]    # (1, n_test)
x1_x2 = np.einsum('ik,jk->ij', X_train, X_test)                    # (n_train, n_test)
distances = np.sqrt(np.maximum(x1_norm + x2_norm - 2 * x1_x2, 0))

# Lấy k điểm gần nhất cho tất cả test samples - vectorized
k_indices = np.argsort(distances, axis=0)[:self.k, :]  # (k, n_test)
k_nearest_labels = y_train[k_indices]                   # (k, n_test)

# Majority voting - fully vectorized
class_1_counts = np.sum(k_nearest_labels, axis=0)       # (n_test,)
predictions = (class_1_counts >= (self.k / 2)).astype(int)

Kỹ thuật NumPy sử dụng:

np.einsum: Tính toán norm và dot product hiệu quả ('ij,ij->i', 'ik,jk->ij')
Broadcasting: Tự động broadcast arrays với shapes khác nhau
Fancy indexing: y_train[k_indices] để lấy labels của k neighbors
Vectorized operations: Tính toán cho tất cả test samples cùng lúc, không có for loops

3. Gaussian Naive Bayes

Công thức toán học:

Bayes' Theorem: $$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$$
Naive assumption (features độc lập): $$P(x|y) = \prod_{i=1}^{n} P(x_i|y)$$
Gaussian PDF: $$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$

Implementation bằng NumPy (vectorized):

# Tính mean và variance cho từng class - vectorized
X_c = X[y == c]  # Boolean masking để lấy data của class c
self.means_[i] = np.mean(X_c, axis=0)  # Vectorized mean
self.vars_[i] = np.var(X_c, axis=0)    # Vectorized variance
self.vars_[i] = np.maximum(self.vars_[i], 1e-9)  # Fancy indexing để tránh 0

# Tính probability - vectorized cho tất cả samples
# Broadcasting: (n_samples, n_features) với (n_features,)
prob = (1.0 / np.sqrt(2 * np.pi * var)) * np.exp(-0.5 * ((x - mean) ** 2) / var)

# Softmax trick để tránh overflow - vectorized
probabilities = probabilities - np.max(probabilities, axis=1, keepdims=True)
probabilities = np.exp(probabilities)
probabilities = probabilities / np.sum(probabilities, axis=1, keepdims=True)

Kỹ thuật NumPy sử dụng:

Boolean masking: X[y == c] để filter data theo class
Broadcasting: Tự động broadcast mean và var cho tất cả samples
Fancy indexing: self.vars_[i] = np.maximum(...) để xử lý điều kiện
keepdims: Giữ dimensions để broadcasting đúng

4. Decision Tree

Công thức toán học:

Entropy: $$H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$
Information Gain: $$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$

Implementation bằng NumPy:

# Tính entropy - vectorized
probs = counts / len(y)
probs = probs[probs > 0]  # Fancy indexing để loại bỏ 0
entropy = -np.sum(probs * np.log2(probs))

# Tính information gain
gain = parent_entropy - (n_left/n * left_entropy + n_right/n * right_entropy)

# Split data sử dụng boolean masking
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
y_left = y[left_mask]  # Fancy indexing với boolean mask
y_right = y[right_mask]

Kỹ thuật NumPy sử dụng:

Boolean masking: X[:, feature] <= threshold để tạo mask
Fancy indexing: y[left_mask] để lấy subset dữ liệu
Vectorized comparisons: So sánh toàn bộ array cùng lúc

Evaluation Metrics

Tất cả các metrics được implement từ đầu bằng NumPy:

Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
F1-Score: $\frac{2 \times Precision \times Recall}{Precision + Recall}$

Normalization

Implement z-score normalization từ đầu bằng NumPy:

Standardization: $(x - \mu) / \sigma$
Tính mean và std từ training data
Lưu parameters để áp dụng cho test data
Sử dụng broadcasting để normalize toàn bộ array cùng lúc

def normalize(X, method='standard'):
    if method == 'standard':
        mean = np.mean(X, axis=0)  # Broadcasting: (n_samples, n_features) - (n_features,)
        std = np.std(X, axis=0)
        std[std == 0] = 1  # Tránh chia cho 0 - sử dụng fancy indexing
        return (X - mean) / std, (mean, std)

Kỹ thuật NumPy:

Broadcasting: X - mean tự động broadcast mean cho tất cả rows
Fancy indexing: std[std == 0] = 1 để xử lý các features có std = 0

Train/Test Split

Implement train_test_split từ đầu bằng NumPy:

Shuffle indices với random seed
Chia theo tỷ lệ test_size (mặc định 0.2 = 20%)

Cross-Validation

Implement K-fold Cross-Validation từ đầu bằng NumPy:

Chia dữ liệu thành k folds sử dụng np.random.permutation() và slicing
Train trên k-1 folds, test trên 1 fold
Lặp lại k lần và tính trung bình scores
Tự động normalize dữ liệu trong mỗi fold
Sử dụng fancy indexing để chia dữ liệu: X[train_indices], X[test_indices]
Vectorized: Tính scores cho tất cả folds và lấy mean/std bằng np.mean() và np.std()

Installation & Setup

Yêu cầu hệ thống

Python >= 3.7
NumPy >= 1.21.0
Matplotlib >= 3.5.0
Jupyter >= 1.0.0

Cài đặt

Clone repository (nếu có):

git clone https://github.com/hungle123-dev/hr-analytics-project.git
cd hr-analytics-project

Tạo virtual environment (khuyến nghị):

python -m venv venv
source venv/bin/activate  # Trên Windows: venv\Scripts\activate

Cài đặt dependencies:

pip install -r requirements.txt

Chạy Jupyter Notebook:

jupyter notebook

🚀 Usage

1. Data Exploration

Mở và chạy notebooks/01_data_exploration.ipynb:

Phân tích phân phối dữ liệu
Kiểm tra missing values
Trực quan hóa các features

jupyter notebook notebooks/01_data_exploration.ipynb

2. Data Preprocessing

Chạy notebooks/02_preprocessing.ipynb:

Xử lý missing values
Encode categorical features
Chuẩn hóa dữ liệu
Lưu dữ liệu đã xử lý vào data/processed/

jupyter notebook notebooks/02_preprocessing.ipynb

3. Modeling

Chạy notebooks/03_modeling.ipynb:

Train các mô hình ML
Đánh giá và so sánh hiệu suất
Tạo submission file

jupyter notebook notebooks/03_modeling.ipynb

Lưu ý: Chạy các notebook theo thứ tự (01 → 02 → 03) vì notebook sau phụ thuộc vào kết quả của notebook trước.

Results

Kết quả đạt được

Sau khi train và đánh giá 4 mô hình, kết quả trên test set:

Model	Accuracy	Precision	Recall	F1-Score
Gaussian Naive Bayes	0.7484	0.4940	0.5147	0.5041
Decision Tree (depth=5)	0.7726	0.5581	0.4086	0.4718
KNN (k=11)	0.7591	0.5245	0.3267	0.4026
Logistic Regression	0.7614	0.5432	0.2511	0.3434

Mô hình tốt nhất: Gaussian Naive Bayes với F1-Score = 0.5041

Lưu ý:

KNN sử dụng k=11 (giá trị tốt nhất từ cross-validation)
Decision Tree sử dụng max_depth=5 (giá trị tốt nhất từ cross-validation)

Phân tích kết quả

Class Imbalance: Dataset có class imbalance nghiêm trọng (75% class 0, 25% class 1), khiến các mô hình có xu hướng dự đoán class 0 nhiều hơn.
Recall khác nhau giữa các mô hình:
- Gaussian Naive Bayes có recall cao nhất (0.5147) - phát hiện được khoảng 51% số nhân viên muốn đổi việc
- Decision Tree có recall trung bình (0.4086)
- KNN và Logistic Regression có recall thấp hơn (~0.25-0.33)
Gaussian Naive Bayes hoạt động tốt nhất: Mặc dù có giả định "naive" (features độc lập), mô hình này vẫn cho kết quả tốt nhất, có thể do:
- Phù hợp với dữ liệu có nhiều features
- Xử lý tốt với dữ liệu đã được chuẩn hóa

Trực quan hóa kết quả

Các biểu đồ được tạo và lưu trong results/figures/ bao gồm:

1. Logistic Regression

loss_curve.png: Biểu đồ loss curve theo số lần iteration, cho thấy quá trình hội tụ của mô hình

2. K-Nearest Neighbors (KNN)

knn_performance.png: Biểu đồ so sánh F1-Score với các giá trị k khác nhau (3, 5, 7, 9, 11) từ 5-Fold Cross-Validation

3. Gaussian Naive Bayes

naive_bayes_cv.png: Biểu đồ bar chart hiển thị F1-Score của từng fold trong 5-Fold Cross-Validation
naive_bayes_confusion_matrix.png: Confusion Matrix với heatmap màu xanh, hiển thị số lượng dự đoán đúng/sai

4. Decision Tree

decision_tree_performance.png: Biểu đồ so sánh F1-Score với các giá trị max_depth khác nhau (5, 10, 15, 20)
decision_tree_confusion_matrix.png: Confusion Matrix với heatmap màu xanh lá

5. So sánh tổng thể

model_comparison.png: Biểu đồ grouped bar chart so sánh tất cả 4 mô hình theo 4 metrics (Accuracy, Precision, Recall, F1-Score)

Project Structure

hr-analytics-project/
│
├── data/
│   ├── raw/                    # Dữ liệu gốc
│   │   ├── aug_train.csv       # Training data gốc
│   │   ├── aug_test.csv        # Test data gốc
│   │   └── sample_submission.csv
│   └── processed/              # Dữ liệu đã xử lý
│       ├── aug_train.csv       # Training data sau preprocessing
│       └── aug_test.csv        # Test data sau preprocessing
│
├── notebooks/
│   ├── 01_data_exploration.ipynb    # Phân tích và khám phá dữ liệu
│   ├── 02_preprocessing.ipynb       # Tiền xử lý dữ liệu
│   └── 03_modeling.ipynb            # Xây dựng và đánh giá mô hình
│
├── src/
│   ├── __init__.py
│   ├── models.py                    # Các mô hình ML (Logistic Regression, KNN, Naive Bayes, Decision Tree)
│   └── visualization.py             # Các hàm trực quan hóa dữ liệu
│
├── results/
│   ├── figures/                      # Các biểu đồ và hình ảnh
│   └── submission.csv                # File submission cuối cùng
│
├── requirements.txt                  # Dependencies
└── README.md                         # File này

Giải thích chức năng từng file/folder

data/raw/: Chứa dữ liệu gốc chưa xử lý
data/processed/: Chứa dữ liệu đã được xử lý (fill missing, encode) - chưa normalize
notebooks/01_data_exploration.ipynb: Phân tích EDA, tìm hiểu dữ liệu
notebooks/02_preprocessing.ipynb: Xử lý missing values và encoding categorical features
notebooks/03_modeling.ipynb: Normalization, train/test split, train và đánh giá các mô hình
src/models.py: Implement các thuật toán ML từ đầu bằng NumPy với vectorization và np.einsum
src/visualization.py: Các hàm helper để vẽ biểu đồ
results/submission.csv: Kết quả dự đoán cuối cùng
results/figures/: Chứa các biểu đồ và hình ảnh kết quả:
- loss_curve.png: Loss curve của Logistic Regression
- knn_performance.png: KNN performance vs k
- naive_bayes_cv.png: Naive Bayes CV scores
- naive_bayes_confusion_matrix.png: Naive Bayes confusion matrix
- decision_tree_performance.png: Decision Tree performance vs max_depth
- decision_tree_confusion_matrix.png: Decision Tree confusion matrix
- model_comparison.png: So sánh tất cả các mô hình

Challenges & Solutions

Khó khăn gặp phải khi dùng NumPy

1. Xử lý Missing Values

Vấn đề: NumPy không có hàm sẵn như pandas để fill missing values. Cần điền dựa trên phân bố xác suất, không chỉ mode/median.

Giải pháp:

def fill_missing_values(arr, missing_value=''):
    # 1. Tìm vị trí missing
    empty_idx = np.where(arr == '')[0]
    
    # 2. Tính phân bố của giá trị hiện có
    non_missing = arr[arr != '']
    unique, counts = np.unique(non_missing, return_counts=True)
    ratios = counts / counts.sum()
    
    # 3. Phân bổ missing values theo tỷ lệ
    raw_counts = len(empty_idx) * ratios
    counts_int = [int(c) for c in raw_counts]
    # Phân bổ phần dư...
    
    # 4. Điền vào mảng
    arr[empty_idx] = fill_values
    return arr

2. Encoding Categorical Features

Vấn đề: Cần chuyển đổi categorical thành số mà không dùng LabelEncoder của scikit-learn. Một số features có format đặc biệt (ví dụ: "city_103").

Giải pháp:

# City: Extract số từ format "city_XXX"
city_n = np.array([int(c.split("_")[1]) for c in city_col])

# Các features khác: Sử dụng mapping dictionary
def apply_mapping(arr, mapping_dict):
    mapper = np.vectorize(lambda x: mapping_dict.get(x, -1))
    return mapper(arr)

# Tạo mapping cho từng feature
gender_encod = {'Male': 0, 'Female': 1, 'Other': 2, ...}
encoded = apply_mapping(categorical_col, gender_encod)

3. Overflow trong Sigmoid Function

Vấn đề: exp(-z) có thể gây overflow khi z rất âm.

Giải pháp:

# Clip giá trị để tránh overflow
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))

4. Log(0) trong Loss Function

Vấn đề: log(0) = -∞ gây lỗi.

Giải pháp:

# Clip predictions để tránh log(0)
y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

5. Tính toán Khoảng cách trong KNN - Vectorization

Vấn đề: Tính khoảng cách cho từng test sample đến tất cả training samples rất chậm nếu dùng for loop.

Giải pháp - Fully Vectorized với np.einsum:

# Vectorize toàn bộ: tính distances cho tất cả pairs cùng lúc
# Sử dụng công thức: ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*x1*x2
x1_norm = np.einsum('ij,ij->i', X_train, X_train)[:, np.newaxis]
x2_norm = np.einsum('ij,ij->i', X_test, X_test)[np.newaxis, :]
x1_x2 = np.einsum('ik,jk->ij', X_train, X_test)
distances = np.sqrt(np.maximum(x1_norm + x2_norm - 2 * x1_x2, 0))

# Lấy k neighbors cho tất cả test samples cùng lúc
k_indices = np.argsort(distances, axis=0)[:self.k, :]
k_nearest_labels = y_train[k_indices]

# Majority voting vectorized
class_1_counts = np.sum(k_nearest_labels, axis=0)
predictions = (class_1_counts >= (self.k / 2)).astype(int)

Lợi ích:

Nhanh hơn hàng trăm lần so với for loop
Sử dụng memory hiệu quả hơn
Tận dụng tối đa NumPy's optimized C code

6. Class Imbalance

Vấn đề: Dataset có class imbalance nghiêm trọng (75% vs 25%).

Giải pháp hiện tại:

Sử dụng F1-Score thay vì Accuracy để đánh giá
Có thể cải thiện bằng class weights (chưa implement)

7. Lưu dữ liệu với đúng kiểu

Vấn đề: np.savetxt mặc định lưu tất cả thành float.

Giải pháp:

# Sử dụng fmt parameter
fmt_list = ['%d', '%.6f', '%d', ...]  # Chỉ định format cho từng cột
np.savetxt(file, data, fmt=','.join(fmt_list))

8. Vectorization và np.einsum

Vấn đề: For loops trong KNN và DecisionTree rất chậm. Cần vectorize hoàn toàn.

Giải pháp:

# KNN - Vectorize distance calculation với np.einsum
x1_norm = np.einsum('ij,ij->i', X_train, X_train)[:, np.newaxis]
x2_norm = np.einsum('ij,ij->i', X_test, X_test)[np.newaxis, :]
x1_x2 = np.einsum('ik,jk->ij', X_train, X_test)
distances = np.sqrt(np.maximum(x1_norm + x2_norm - 2 * x1_x2, 0))

# Logistic Regression - sử dụng np.einsum
linear_output = np.einsum('ij,j->i', X, weights)  # Thay vì X @ weights
dw = np.einsum('ij,i->j', X, error)  # Thay vì X.T @ error

Lợi ích:

np.einsum: Hiệu quả hơn cho các phép tính phức tạp, dễ đọc hơn
Vectorization: Nhanh hơn hàng trăm lần so với for loops
Memory efficient: Tận dụng tối đa NumPy's optimized operations

Future Improvements

1. Xử lý Class Imbalance

Implement class weights trong Logistic Regression
Thử SMOTE (Synthetic Minority Oversampling) - implement từ đầu bằng NumPy
Tune threshold thay vì dùng 0.5 mặc định

2. Feature Engineering

Tạo interaction features (ví dụ: experience × training_hours)
Polynomial features (bậc 2, 3)
Feature selection dựa trên correlation hoặc importance

3. Cải thiện Mô hình

Regularization (L1/L2) cho Logistic Regression
Ensemble methods: Voting Classifier, Bagging (implement từ đầu)
Neural Network đơn giản (1-2 layers) implement từ đầu bằng NumPy

4. Hyperparameter Tuning

Grid Search hoặc Random Search implement từ đầu
Tune learning rate, max_depth, k values, v.v.

5. Advanced Algorithms

Support Vector Machine (SVM) - implement từ đầu
Random Forest - implement từ đầu (nhiều Decision Trees)
Gradient Boosting - implement từ đầu

6. Evaluation Improvements

ROC Curve và AUC Score - implement từ đầu
Precision-Recall Curve
Stratified K-Fold Cross-Validation để xử lý class imbalance

Contributors

Thông tin tác giả

Tác giả: [Lê Võ Xuân Hưng]

Trường: Đại học Khoa học Tự nhiên, Đại học Quốc gia TP.HCM
Khoa: Công nghệ Thông tin
Bộ môn: Khoa học Dữ liệu Môn học: Data Science Programming
Bài tập: HW02 - HR Analytics Project

Contact

Email: [lvxhung23@clc.fitus.edu.vn]

License

This project is created for educational purposes as part of the Data Science Programming course at VNU-HCMUS.

License: MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Acknowledgments

Dataset từ Kaggle - HR Analytics: Job Change of Data Scientists
Giảng viên và trợ giảng môn Data Science Programming
Cộng đồng open source cho các tài liệu và hướng dẫn

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
HOMEWORK 2_ NUMPY FOR DATA SCIENCE.pdf		HOMEWORK 2_ NUMPY FOR DATA SCIENCE.pdf
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

HR Analytics - Employee Job Change Prediction

Mô tả

Mục lục

Giới thiệu

Mô tả bài toán

Động lực và ứng dụng thực tế

Mục tiêu cụ thể

Dataset

Nguồn dữ liệu

Mô tả các features

Kích thước và đặc điểm dữ liệu

Method

Kỹ thuật NumPy nâng cao được sử dụng

Vectorization (Không dùng for loops cho array operations)

np.einsum (Einstein summation)

Broadcasting

Fancy Indexing & Masking

Array Manipulation

Memory-Efficient Operations

Quy trình xử lý dữ liệu

Thuật toán sử dụng

1. Logistic Regression

2. K-Nearest Neighbors (KNN)

3. Gaussian Naive Bayes

4. Decision Tree

Evaluation Metrics

Normalization

Train/Test Split

Cross-Validation

Installation & Setup

Yêu cầu hệ thống

Cài đặt

🚀 Usage

1. Data Exploration

2. Data Preprocessing

3. Modeling

Results

Kết quả đạt được

Phân tích kết quả

Trực quan hóa kết quả

1. Logistic Regression

2. K-Nearest Neighbors (KNN)

3. Gaussian Naive Bayes

4. Decision Tree

5. So sánh tổng thể

Project Structure

Giải thích chức năng từng file/folder

Challenges & Solutions

Khó khăn gặp phải khi dùng NumPy

1. Xử lý Missing Values

2. Encoding Categorical Features

3. Overflow trong Sigmoid Function

4. Log(0) trong Loss Function

5. Tính toán Khoảng cách trong KNN - Vectorization

6. Class Imbalance

7. Lưu dữ liệu với đúng kiểu

8. Vectorization và np.einsum

Future Improvements

1. Xử lý Class Imbalance

2. Feature Engineering

3. Cải thiện Mô hình

4. Hyperparameter Tuning

5. Advanced Algorithms

6. Evaluation Improvements

Contributors

Thông tin tác giả

Contact

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages