This project was developed as part of the Zelestra X AWS ML Ascend Challenge - 2nd Edition, where participants were tasked with building a machine learning pipeline to predict performance degradation and optimize solar panel efficiency.
Out of ~700 participants worldwide, this pipeline ranked 61st (Top 9%) on the official leaderboard.
The approach centers around an 8-model stacked ensemble with a ridge regression meta-learner, designed to capture distinct feature interactions, learn non-linear dependencies, and make robust predictions through k-fold cross-validation.
Develop a scalable machine learning pipeline that predicts solar panel performance degradation over time, considering multiple feature interactions such as environmental, operational, and system-level parameters.
The challenge demanded:
- High predictive accuracy on unseen data.
- Stability across multiple validation folds.
- Interpretability and generalizability.
The ensemble’s foundation consists of 8 diverse models, each designed to learn feature interactions in a unique way — maximizing representational diversity and ensuring that different data patterns are captured across learners.
| Category | Model | Distinct Learning Strategy |
|---|---|---|
| Boosting Models (4) | XGBoost, LightGBM, CatBoost, Explainable Boosting Machine (EBM) | Iteratively refine predictions through boosting, capturing non-linear feature interactions and explaining feature importance. |
| Neural Models (4) | Multilayer Perceptron (MLP), FT-Transformer, TabNet, Deep & Cross Network (DCN) | Leverage deep representations, attention mechanisms, and explicit feature crossing to capture complex dependencies and high-order feature interactions. |
Together, these 8 models maximize feature interaction coverage — where one model may overfit a specific interaction, another balances it by generalizing across deeper or orthogonal feature patterns.
This architectural diversity forms the backbone of the stacked ensemble, providing the ridge regression meta-learner with rich, non-redundant base predictions.
This preprocessing pipeline ensures that both numerical and categorical features are properly cleaned, validated, and made model-ready — particularly optimized for TabNet, which cannot handle missing values.
A list of numeric columns (columns_to_clean_alphanumeric) is created, including features like temperature, irradiance, humidity, voltage, and others.
These are the columns where invalid entries (like alphanumeric or corrupt text values) are most likely to occur.
def remove_alphanumeric(df, col):
df[col] = pd.to_numeric(df[col], errors='coerce')This function converts each column to numeric format. If any cell contains non-numeric or invalid text (e.g., "badval", "N/A", "error"), it’s automatically converted to NaN.
➡️ Effect: Ensures that all numerical columns are truly numeric, replacing garbage or corrupt entries with missing values that can be handled systematically.
def remove_zeros(df, col):
df[col] = df[col].replace(0, np.nan)For specific columns like module_temperature, zero is physically unrealistic (e.g., solar panels can’t have 0°C module temperature in operation). Hence, zeros are replaced with NaN to indicate missing or invalid data.
➡️ Effect: Prevents models from learning false patterns caused by impossible zero readings.
X_tabnet = X.copy()
X_test_tabnet = X_test.copy()
Separate copies are created because TabNet models cannot handle NaN values directly. These copies will undergo imputation to replace missing values safely.
num_cols = X_tabnet.select_dtypes(include=[np.number]).columns
for col in num_cols:
median_val = X_tabnet[col].median()
X_tabnet[col].fillna(median_val, inplace=True)
X_test_tabnet[col].fillna(median_val, inplace=True)
All missing numeric values are replaced with the median of the respective column (calculated from training data).
➡️ Effect: Median imputation preserves distribution robustness and reduces the influence of outliers compared to mean-based filling.
cat_cols = X_tabnet.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
X_tabnet[col].fillna("Unknown", inplace=True)
X_test_tabnet[col].fillna("Unknown", inplace=True)
Missing categorical values are replaced with the placeholder "Unknown" to retain these records rather than dropping them.
➡️ Effect: Prevents data loss and ensures that categorical encoders or models don’t fail due to null labels.
Definition: temperature - module_temperature
Significance: Highlights thermal stress; large differences indicate inefficiency in cooling or poor heat dissipation, directly affecting output power.
Definition: (voltage × current) / irradiance
Significance: Measures the conversion efficiency of solar irradiance into electrical energy — useful for detecting performance degradation or inefficiency.
Definition: irradiance / module_temperature
Significance: Normalizes irradiance against temperature effects; helps the model learn temperature-induced efficiency losses.
Definition: irradiance × (1 - soiling_ratio)
Significance: Adjusts irradiance for panel soiling; represents the effective irradiance actually contributing to energy generation.
Definition: irradiance × (1 - cloud_coverage / 100)
Significance: Estimates irradiance after cloud interference; models the effect of cloudy conditions or shading on solar performance.
Definition: temperature / (wind_speed + 1e-3)
Significance: Indicates cooling efficiency; higher ratios suggest poor heat dissipation when wind speed is low — crucial for predicting thermal performance.
Definition: irradiance / (pressure + 1e-3)
Significance: Captures atmospheric attenuation — lower pressure can increase irradiance intensity due to thinner air, affecting solar input quality.
Definition: 1 / (maintenance_count + 1)
Significance: Inversely related to maintenance frequency; higher values imply fewer maintenance events, which may correlate with performance degradation.
Definition: irradiance × inverse_maintenance
Significance: Combines maintenance and irradiance effects into a single metric — a proxy for cleanliness and upkeep impact on irradiance capture.
Definition: humidity / (pressure + 1e-3)
Significance: Captures the joint influence of atmospheric moisture and pressure — can affect irradiance scattering and panel cooling efficiency.
Definition: voltage / (current + 1e-3)
Significance: Represents electrical load behavior; deviations from expected ratios can signal inefficiency, shading, or partial system faults.
Definition: (installation_type == "dual-axis").astype(int)
Significance: Encodes mechanical tracking capability — dual-axis installations generally achieve higher irradiance capture and output stability.
Definition: irradiance × module_temperature
Significance: Reflects thermal and irradiance stress on modules; helps model nonlinear efficiency losses or degradation under extreme operating conditions.
Definition: error_code.notnull().astype(int)
Significance: Binary indicator for operational faults; helps the model explain anomalies in power output linked to logged system errors.
To ensure robustness, 5-Fold Cross Validation is employed across all base learners:
- The dataset is split into 5 folds.
- Each model trains on 4 folds and validates on the 5th.
- This process repeats 5 times, ensuring that every data point is used once for validation.
- Out-of-fold predictions are generated for each base model — these become input features for the meta-learner.
This ensures:
- No data leakage between training and validation stages.
- Reliable generalization assessment for the meta-layer.
The Ridge Regression meta-learner sits at the top of the stack, combining predictions from all 8 base learners.
Key aspects:
- Uses L2 regularization to balance contributions of each model.
- Learns which model to trust more for specific feature interactions.
- Aggregates base model outputs into a single, stable prediction.
Unlike a single model that learns one representation of data, stacking learns how models complement one another.
- Boosting models capture non-linear interactions.
- Linear models balance overfitting by identifying global trends.
- Random Forests provide stability through aggregation.
- KNN captures local data behaviors.
The ridge meta-learner then identifies optimal weightings for these models, minimizing validation loss across folds.
This architecture maximizes variance reduction and bias correction simultaneously, leading to superior out-of-sample performance.
- Frontend Integration: Couple this pipeline to a web-based interface for real-time regression on
.csvinputs. - Model Expansion: Experiment with adding Factorization Machines or Neural Networks as additional base learners.
- Interpretability: Introduce SHAP and LIME visualizations to interpret feature contributions.
- Efficiency Improvements: Optimize ensemble parallelization and memory footprint.
| Metric | Description |
|---|---|
| Rank | 61 / 700 (Top 9%) |
| Pipeline Type | 8-Model Stacked Ensemble |
| Meta-Learner | Ridge Regression |
| Cross Validation | 5-Fold CV |
| Primary Use Case | Solar Panel Performance Degradation Prediction |
[Steve Sojan]
Machine Learning Engineer | AI for Social Impact
📫 [https://www.linkedin.com/in/stevesojan/]
