Zelestra X AWS ML Ascend Challenge - 2nd Edition

Ranked 61 / 700 (Top 9%)

🧭 Abstract

This project was developed as part of the Zelestra X AWS ML Ascend Challenge - 2nd Edition, where participants were tasked with building a machine learning pipeline to predict performance degradation and optimize solar panel efficiency.

Out of ~700 participants worldwide, this pipeline ranked 61st (Top 9%) on the official leaderboard.

The approach centers around an 8-model stacked ensemble with a ridge regression meta-learner, designed to capture distinct feature interactions, learn non-linear dependencies, and make robust predictions through k-fold cross-validation.

⚙️ Problem Statement

Develop a scalable machine learning pipeline that predicts solar panel performance degradation over time, considering multiple feature interactions such as environmental, operational, and system-level parameters.

The challenge demanded:

High predictive accuracy on unseen data.
Stability across multiple validation folds.
Interpretability and generalizability.

🧩 Architecture Overview

1. Base Layer — 8 Distinct Learners

The ensemble’s foundation consists of 8 diverse models, each designed to learn feature interactions in a unique way — maximizing representational diversity and ensuring that different data patterns are captured across learners.

Category	Model	Distinct Learning Strategy
Boosting Models (4)	XGBoost, LightGBM, CatBoost, Explainable Boosting Machine (EBM)	Iteratively refine predictions through boosting, capturing non-linear feature interactions and explaining feature importance.
Neural Models (4)	Multilayer Perceptron (MLP), FT-Transformer, TabNet, Deep & Cross Network (DCN)	Leverage deep representations, attention mechanisms, and explicit feature crossing to capture complex dependencies and high-order feature interactions.

Together, these 8 models maximize feature interaction coverage — where one model may overfit a specific interaction, another balances it by generalizing across deeper or orthogonal feature patterns.

This architectural diversity forms the backbone of the stacked ensemble, providing the ridge regression meta-learner with rich, non-redundant base predictions.

2. Data Preprocessing & Cleaning

This preprocessing pipeline ensures that both numerical and categorical features are properly cleaned, validated, and made model-ready — particularly optimized for TabNet, which cannot handle missing values.

Step 1: Define Numeric Columns to Clean

A list of numeric columns (columns_to_clean_alphanumeric) is created, including features like temperature, irradiance, humidity, voltage, and others.
These are the columns where invalid entries (like alphanumeric or corrupt text values) are most likely to occur.

Step 2: Remove Alphanumeric and Invalid Entries

def remove_alphanumeric(df, col):
    df[col] = pd.to_numeric(df[col], errors='coerce')

This function converts each column to numeric format. If any cell contains non-numeric or invalid text (e.g., "badval", "N/A", "error"), it’s automatically converted to NaN.

➡️ Effect: Ensures that all numerical columns are truly numeric, replacing garbage or corrupt entries with missing values that can be handled systematically.

Step 3: Replace Unrealistic Zero Values

def remove_zeros(df, col):
    df[col] = df[col].replace(0, np.nan)

For specific columns like module_temperature, zero is physically unrealistic (e.g., solar panels can’t have 0°C module temperature in operation). Hence, zeros are replaced with NaN to indicate missing or invalid data.

➡️ Effect: Prevents models from learning false patterns caused by impossible zero readings.

Step 4: Create TabNet-Compatible Copies

X_tabnet = X.copy()
X_test_tabnet = X_test.copy()

Separate copies are created because TabNet models cannot handle NaN values directly. These copies will undergo imputation to replace missing values safely.

Step 5: Impute Numerical Columns with Median

num_cols = X_tabnet.select_dtypes(include=[np.number]).columns
for col in num_cols:
    median_val = X_tabnet[col].median()
    X_tabnet[col].fillna(median_val, inplace=True)
    X_test_tabnet[col].fillna(median_val, inplace=True)

All missing numeric values are replaced with the median of the respective column (calculated from training data).

➡️ Effect: Median imputation preserves distribution robustness and reduces the influence of outliers compared to mean-based filling.

Step 6: Impute Categorical Columns with "Unknown"

cat_cols = X_tabnet.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
    X_tabnet[col].fillna("Unknown", inplace=True)
    X_test_tabnet[col].fillna("Unknown", inplace=True)

Missing categorical values are replaced with the placeholder "Unknown" to retain these records rather than dropping them.

➡️ Effect: Prevents data loss and ensures that categorical encoders or models don’t fail due to null labels.

🔧 3. Feature Engineered Columns and Their Significance

1. `temp_diff`

Definition: temperature - module_temperature
Significance: Highlights thermal stress; large differences indicate inefficiency in cooling or poor heat dissipation, directly affecting output power.

2. `irradiance_efficiency`

Definition: (voltage × current) / irradiance
Significance: Measures the conversion efficiency of solar irradiance into electrical energy — useful for detecting performance degradation or inefficiency.

3. `irradiance_norm`

Definition: irradiance / module_temperature
Significance: Normalizes irradiance against temperature effects; helps the model learn temperature-induced efficiency losses.

4. `cleaned_irradiance`

Definition: irradiance × (1 - soiling_ratio)
Significance: Adjusts irradiance for panel soiling; represents the effective irradiance actually contributing to energy generation.

5. `irradiance_through_clouds`

Definition: irradiance × (1 - cloud_coverage / 100)
Significance: Estimates irradiance after cloud interference; models the effect of cloudy conditions or shading on solar performance.

6. `temp_to_wind_ratio`

Definition: temperature / (wind_speed + 1e-3)
Significance: Indicates cooling efficiency; higher ratios suggest poor heat dissipation when wind speed is low — crucial for predicting thermal performance.

7. `irradiance_by_pressure`

Definition: irradiance / (pressure + 1e-3)
Significance: Captures atmospheric attenuation — lower pressure can increase irradiance intensity due to thinner air, affecting solar input quality.

8. `inverse_maintenance`

Definition: 1 / (maintenance_count + 1)
Significance: Inversely related to maintenance frequency; higher values imply fewer maintenance events, which may correlate with performance degradation.

9. `cleanliness_adjusted_irradiance`

Definition: irradiance × inverse_maintenance
Significance: Combines maintenance and irradiance effects into a single metric — a proxy for cleanliness and upkeep impact on irradiance capture.

10. `humidity_pressure_index`

Definition: humidity / (pressure + 1e-3)
Significance: Captures the joint influence of atmospheric moisture and pressure — can affect irradiance scattering and panel cooling efficiency.

11. `voltage_current_ratio`

Definition: voltage / (current + 1e-3)
Significance: Represents electrical load behavior; deviations from expected ratios can signal inefficiency, shading, or partial system faults.

12. `is_dual_axis`

Definition: (installation_type == "dual-axis").astype(int)
Significance: Encodes mechanical tracking capability — dual-axis installations generally achieve higher irradiance capture and output stability.

13. `stress_index`

Definition: irradiance × module_temperature
Significance: Reflects thermal and irradiance stress on modules; helps model nonlinear efficiency losses or degradation under extreme operating conditions.

14. `has_error`

Definition: error_code.notnull().astype(int)
Significance: Binary indicator for operational faults; helps the model explain anomalies in power output linked to logged system errors.

3. Cross-Validation Strategy (K-Fold)

To ensure robustness, 5-Fold Cross Validation is employed across all base learners:

The dataset is split into 5 folds.
Each model trains on 4 folds and validates on the 5th.
This process repeats 5 times, ensuring that every data point is used once for validation.
Out-of-fold predictions are generated for each base model — these become input features for the meta-learner.

This ensures:

No data leakage between training and validation stages.
Reliable generalization assessment for the meta-layer.

4. Meta-Learner — Ridge Regression Layer

The Ridge Regression meta-learner sits at the top of the stack, combining predictions from all 8 base learners.

Key aspects:

Uses L2 regularization to balance contributions of each model.
Learns which model to trust more for specific feature interactions.
Aggregates base model outputs into a single, stable prediction.

🧠 Why Stacked Ensemble?

Unlike a single model that learns one representation of data, stacking learns how models complement one another.

Boosting models capture non-linear interactions.
Linear models balance overfitting by identifying global trends.
Random Forests provide stability through aggregation.
KNN captures local data behaviors.

The ridge meta-learner then identifies optimal weightings for these models, minimizing validation loss across folds.

This architecture maximizes variance reduction and bias correction simultaneously, leading to superior out-of-sample performance.

🧪 Future Scope

Frontend Integration: Couple this pipeline to a web-based interface for real-time regression on .csv inputs.
Model Expansion: Experiment with adding Factorization Machines or Neural Networks as additional base learners.
Interpretability: Introduce SHAP and LIME visualizations to interpret feature contributions.
Efficiency Improvements: Optimize ensemble parallelization and memory footprint.

📊 Results Summary

Metric	Description
Rank	61 / 700 (Top 9%)
Pipeline Type	8-Model Stacked Ensemble
Meta-Learner	Ridge Regression
Cross Validation	5-Fold CV
Primary Use Case	Solar Panel Performance Degradation Prediction

Author

[Steve Sojan]
Machine Learning Engineer | AI for Social Impact
📫 [https://www.linkedin.com/in/stevesojan/]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
saved_models_and_checkpts		saved_models_and_checkpts
README.md		README.md
flowchart.png		flowchart.png
stacked_model.py		stacked_model.py

Folders and files

Latest commit

History

Repository files navigation

Zelestra X AWS ML Ascend Challenge - 2nd Edition

Ranked 61 / 700 (Top 9%)

🧭 Abstract

⚙️ Problem Statement

🧩 Architecture Overview

1. Base Layer — 8 Distinct Learners

2. Data Preprocessing & Cleaning

Step 1: Define Numeric Columns to Clean

Step 2: Remove Alphanumeric and Invalid Entries

Step 3: Replace Unrealistic Zero Values

Step 4: Create TabNet-Compatible Copies

Step 5: Impute Numerical Columns with Median

Step 6: Impute Categorical Columns with "Unknown"

🔧 3. Feature Engineered Columns and Their Significance

1. temp_diff

2. irradiance_efficiency

3. irradiance_norm

4. cleaned_irradiance

5. irradiance_through_clouds

6. temp_to_wind_ratio

7. irradiance_by_pressure

8. inverse_maintenance

9. cleanliness_adjusted_irradiance

10. humidity_pressure_index

11. voltage_current_ratio

12. is_dual_axis

13. stress_index

14. has_error

3. Cross-Validation Strategy (K-Fold)

4. Meta-Learner — Ridge Regression Layer

🧠 Why Stacked Ensemble?

🧪 Future Scope

📊 Results Summary

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `temp_diff`

2. `irradiance_efficiency`

3. `irradiance_norm`

4. `cleaned_irradiance`

5. `irradiance_through_clouds`

6. `temp_to_wind_ratio`

7. `irradiance_by_pressure`

8. `inverse_maintenance`

9. `cleanliness_adjusted_irradiance`

10. `humidity_pressure_index`

11. `voltage_current_ratio`

12. `is_dual_axis`

13. `stress_index`

14. `has_error`

Packages