GitHub - SasySpanish/NOAA-GSOD-Global-Weather-Analysis-with-PySpark-on-Databricks: Cloud-friendly data analysis of global weather data (NOAA GSOD) with PySpark on Databricks, storing in Data Lake — temperature trends, precipitation extremes, heatwaves, anomalies, interactive Plotly visualizations.

Global Weather Analysis Project

This project explores global weather patterns using NOAA's Global Surface Summary of the Day (GSOD) dataset, processed entirely on Databricks Free Edition. It demonstrates a complete ETL and analytical workflow for handling historical weather data (2000–2024), focusing on temperature trends, precipitation extremes, and variability across continents, countries, and cities.

Data Ingestion from NOAA on Amazon S3

The journey starts with ingesting raw daily weather summaries directly from NOAA's public S3 bucket (s3://noaa-gsod-pds/).
To stay within the free cluster's memory and compute limits, we avoided downloading the full archive and instead targeted a curated selection of 77 weather stations worldwide.
Data was read using PySpark, parsed from gzipped fixed-width files, and immediately cleaned (missing values → null, Fahrenheit → Celsius conversion, basic quality filtering).

Structured Division: 6 Continents → 5 Countries → 3 Cities

The dataset was intentionally divided into six continents:

Europe
Asia
Africa
North America
South America
Oceania

For each continent, five representative countries were selected, and within each country three major cities (stations) were chosen — resulting in the final set of 77 stations.
This hierarchical structure enabled modular processing (one notebook per continent) and ensured balanced geographic coverage without overwhelming resources.

Examples:

Europe → Italy (Bolzano, Rome Ciampino, Palermo), France (Paris CDG, Marseille, Brest), Germany, Spain, UK…
Asia → China (Beijing, Shanghai, Guangzhou), India (Delhi, Mumbai, Chennai), Japan, Russia, Indonesia…
Africa → South Africa (Cape Town, Johannesburg, Durban), Nigeria…

Storage in Delta Lake

After initial cleaning and enrichment (adding continent, country, hemisphere, season), data was saved in Delta Lake format — first per continent, then unioned into a single global table.
Delta provided schema enforcement, ACID transactions, and efficient querying even on limited hardware, making iterative analysis fast and reliable.

Feature Engineering & Statistical Calculations

Using PySpark aggregations, we computed a rich set of features by year, continent, country, and city:

Average / max / min temperature
Days >35°C, days <0°C (frost)
Precipitation days (>10 mm, >50 mm)
Diurnal temperature range (Tmax – Tmin)
Temperature variability (standard deviation)
Humidity %, wind speed averages
Extreme events (lightning, snow, hail from FRSHTT flags)
Anomalies vs 2000–2009 baseline per station

These metrics form the basis for all downstream insights.

Visualizations with Seaborn & Matplotlib

Final analysis and plots were created by converting Spark results to Pandas DataFrames and using Seaborn + Matplotlib for publication-quality charts.
Every plot avaiable Here Key visualizations include:

Top 6 Continents by Average Temperature in 2024
Heatmap of average days with >10 mm rain per continent and year
-
Trend of annual mean temperature per continent (2014–2024)
Temperature Variability (Standard Deviation) per Continent and Year (2014–2024)
Average Diurnal Temperature Range (Tmax – Tmin) per Country – Last 5 Years

The visualizations reveal strong tropical influence on highest temperatures, elevated heavy-rain days in Europe and parts of South America/Asia, greater variability in Asia and North America, and a general pattern of stable-to-slightly-warming continental averages over the decade.

All code runs end-to-end on free Databricks — from S3 ingestion to interactive Seaborn charts.

Author

Developed by Salvatore Spagnuolo

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
results		results
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Weather Analysis Project

Data Ingestion from NOAA on Amazon S3

Structured Division: 6 Continents → 5 Countries → 3 Cities

Storage in Delta Lake

Feature Engineering & Statistical Calculations

Visualizations with Seaborn & Matplotlib

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Global Weather Analysis Project

Data Ingestion from NOAA on Amazon S3

Structured Division: 6 Continents → 5 Countries → 3 Cities

Storage in Delta Lake

Feature Engineering & Statistical Calculations

Visualizations with Seaborn & Matplotlib

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages