Skip to content

stevesojan/zelestra_challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zelestra X AWS ML Ascend Challenge - 2nd Edition

Ranked 61 / 700 (Top 9%)

Leaderboard Result

Flowchart for the Pipeline


🧭 Abstract

This project was developed as part of the Zelestra X AWS ML Ascend Challenge - 2nd Edition, where participants were tasked with building a machine learning pipeline to predict performance degradation and optimize solar panel efficiency.

Out of ~700 participants worldwide, this pipeline ranked 61st (Top 9%) on the official leaderboard.

The approach centers around an 8-model stacked ensemble with a ridge regression meta-learner, designed to capture distinct feature interactions, learn non-linear dependencies, and make robust predictions through k-fold cross-validation.


⚙️ Problem Statement

Develop a scalable machine learning pipeline that predicts solar panel performance degradation over time, considering multiple feature interactions such as environmental, operational, and system-level parameters.

The challenge demanded:

  • High predictive accuracy on unseen data.
  • Stability across multiple validation folds.
  • Interpretability and generalizability.

🧩 Architecture Overview

1. Base Layer — 8 Distinct Learners

The ensemble’s foundation consists of 8 diverse models, each designed to learn feature interactions in a unique way — maximizing representational diversity and ensuring that different data patterns are captured across learners.

Category Model Distinct Learning Strategy
Boosting Models (4) XGBoost, LightGBM, CatBoost, Explainable Boosting Machine (EBM) Iteratively refine predictions through boosting, capturing non-linear feature interactions and explaining feature importance.
Neural Models (4) Multilayer Perceptron (MLP), FT-Transformer, TabNet, Deep & Cross Network (DCN) Leverage deep representations, attention mechanisms, and explicit feature crossing to capture complex dependencies and high-order feature interactions.

Together, these 8 models maximize feature interaction coverage — where one model may overfit a specific interaction, another balances it by generalizing across deeper or orthogonal feature patterns.

This architectural diversity forms the backbone of the stacked ensemble, providing the ridge regression meta-learner with rich, non-redundant base predictions.


2. Data Preprocessing & Cleaning

This preprocessing pipeline ensures that both numerical and categorical features are properly cleaned, validated, and made model-ready — particularly optimized for TabNet, which cannot handle missing values.


Step 1: Define Numeric Columns to Clean

A list of numeric columns (columns_to_clean_alphanumeric) is created, including features like temperature, irradiance, humidity, voltage, and others.
These are the columns where invalid entries (like alphanumeric or corrupt text values) are most likely to occur.


Step 2: Remove Alphanumeric and Invalid Entries

def remove_alphanumeric(df, col):
    df[col] = pd.to_numeric(df[col], errors='coerce')

This function converts each column to numeric format. If any cell contains non-numeric or invalid text (e.g., "badval", "N/A", "error"), it’s automatically converted to NaN.

➡️ Effect: Ensures that all numerical columns are truly numeric, replacing garbage or corrupt entries with missing values that can be handled systematically.

Step 3: Replace Unrealistic Zero Values

def remove_zeros(df, col):
    df[col] = df[col].replace(0, np.nan)

For specific columns like module_temperature, zero is physically unrealistic (e.g., solar panels can’t have 0°C module temperature in operation). Hence, zeros are replaced with NaN to indicate missing or invalid data.

➡️ Effect: Prevents models from learning false patterns caused by impossible zero readings.

Step 4: Create TabNet-Compatible Copies

X_tabnet = X.copy()
X_test_tabnet = X_test.copy()

Separate copies are created because TabNet models cannot handle NaN values directly. These copies will undergo imputation to replace missing values safely.

Step 5: Impute Numerical Columns with Median

num_cols = X_tabnet.select_dtypes(include=[np.number]).columns
for col in num_cols:
    median_val = X_tabnet[col].median()
    X_tabnet[col].fillna(median_val, inplace=True)
    X_test_tabnet[col].fillna(median_val, inplace=True)

All missing numeric values are replaced with the median of the respective column (calculated from training data).

➡️ Effect: Median imputation preserves distribution robustness and reduces the influence of outliers compared to mean-based filling.

Step 6: Impute Categorical Columns with "Unknown"

cat_cols = X_tabnet.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
    X_tabnet[col].fillna("Unknown", inplace=True)
    X_test_tabnet[col].fillna("Unknown", inplace=True)

Missing categorical values are replaced with the placeholder "Unknown" to retain these records rather than dropping them.

➡️ Effect: Prevents data loss and ensures that categorical encoders or models don’t fail due to null labels.

🔧 3. Feature Engineered Columns and Their Significance

1. temp_diff

Definition: temperature - module_temperature
Significance: Highlights thermal stress; large differences indicate inefficiency in cooling or poor heat dissipation, directly affecting output power.


2. irradiance_efficiency

Definition: (voltage × current) / irradiance
Significance: Measures the conversion efficiency of solar irradiance into electrical energy — useful for detecting performance degradation or inefficiency.


3. irradiance_norm

Definition: irradiance / module_temperature
Significance: Normalizes irradiance against temperature effects; helps the model learn temperature-induced efficiency losses.


4. cleaned_irradiance

Definition: irradiance × (1 - soiling_ratio)
Significance: Adjusts irradiance for panel soiling; represents the effective irradiance actually contributing to energy generation.


5. irradiance_through_clouds

Definition: irradiance × (1 - cloud_coverage / 100)
Significance: Estimates irradiance after cloud interference; models the effect of cloudy conditions or shading on solar performance.


6. temp_to_wind_ratio

Definition: temperature / (wind_speed + 1e-3)
Significance: Indicates cooling efficiency; higher ratios suggest poor heat dissipation when wind speed is low — crucial for predicting thermal performance.


7. irradiance_by_pressure

Definition: irradiance / (pressure + 1e-3)
Significance: Captures atmospheric attenuation — lower pressure can increase irradiance intensity due to thinner air, affecting solar input quality.


8. inverse_maintenance

Definition: 1 / (maintenance_count + 1)
Significance: Inversely related to maintenance frequency; higher values imply fewer maintenance events, which may correlate with performance degradation.


9. cleanliness_adjusted_irradiance

Definition: irradiance × inverse_maintenance
Significance: Combines maintenance and irradiance effects into a single metric — a proxy for cleanliness and upkeep impact on irradiance capture.


10. humidity_pressure_index

Definition: humidity / (pressure + 1e-3)
Significance: Captures the joint influence of atmospheric moisture and pressure — can affect irradiance scattering and panel cooling efficiency.


11. voltage_current_ratio

Definition: voltage / (current + 1e-3)
Significance: Represents electrical load behavior; deviations from expected ratios can signal inefficiency, shading, or partial system faults.


12. is_dual_axis

Definition: (installation_type == "dual-axis").astype(int)
Significance: Encodes mechanical tracking capability — dual-axis installations generally achieve higher irradiance capture and output stability.


13. stress_index

Definition: irradiance × module_temperature
Significance: Reflects thermal and irradiance stress on modules; helps model nonlinear efficiency losses or degradation under extreme operating conditions.


14. has_error

Definition: error_code.notnull().astype(int)
Significance: Binary indicator for operational faults; helps the model explain anomalies in power output linked to logged system errors.

3. Cross-Validation Strategy (K-Fold)

To ensure robustness, 5-Fold Cross Validation is employed across all base learners:

  • The dataset is split into 5 folds.
  • Each model trains on 4 folds and validates on the 5th.
  • This process repeats 5 times, ensuring that every data point is used once for validation.
  • Out-of-fold predictions are generated for each base model — these become input features for the meta-learner.

This ensures:

  • No data leakage between training and validation stages.
  • Reliable generalization assessment for the meta-layer.

4. Meta-Learner — Ridge Regression Layer

The Ridge Regression meta-learner sits at the top of the stack, combining predictions from all 8 base learners.

Key aspects:

  • Uses L2 regularization to balance contributions of each model.
  • Learns which model to trust more for specific feature interactions.
  • Aggregates base model outputs into a single, stable prediction.

🧠 Why Stacked Ensemble?

Unlike a single model that learns one representation of data, stacking learns how models complement one another.

  • Boosting models capture non-linear interactions.
  • Linear models balance overfitting by identifying global trends.
  • Random Forests provide stability through aggregation.
  • KNN captures local data behaviors.

The ridge meta-learner then identifies optimal weightings for these models, minimizing validation loss across folds.

This architecture maximizes variance reduction and bias correction simultaneously, leading to superior out-of-sample performance.


🧪 Future Scope

  • Frontend Integration: Couple this pipeline to a web-based interface for real-time regression on .csv inputs.
  • Model Expansion: Experiment with adding Factorization Machines or Neural Networks as additional base learners.
  • Interpretability: Introduce SHAP and LIME visualizations to interpret feature contributions.
  • Efficiency Improvements: Optimize ensemble parallelization and memory footprint.

📊 Results Summary

Metric Description
Rank 61 / 700 (Top 9%)
Pipeline Type 8-Model Stacked Ensemble
Meta-Learner Ridge Regression
Cross Validation 5-Fold CV
Primary Use Case Solar Panel Performance Degradation Prediction

Author

[Steve Sojan]
Machine Learning Engineer | AI for Social Impact
📫 [https://www.linkedin.com/in/stevesojan/]

About

8 Mode Stacked Ensemble

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages