This project explores global weather patterns using NOAA's Global Surface Summary of the Day (GSOD) dataset, processed entirely on Databricks Free Edition. It demonstrates a complete ETL and analytical workflow for handling historical weather data (2000–2024), focusing on temperature trends, precipitation extremes, and variability across continents, countries, and cities.
The journey starts with ingesting raw daily weather summaries directly from NOAA's public S3 bucket (s3://noaa-gsod-pds/).
To stay within the free cluster's memory and compute limits, we avoided downloading the full archive and instead targeted a curated selection of 77 weather stations worldwide.
Data was read using PySpark, parsed from gzipped fixed-width files, and immediately cleaned (missing values → null, Fahrenheit → Celsius conversion, basic quality filtering).
The dataset was intentionally divided into six continents:
- Europe
- Asia
- Africa
- North America
- South America
- Oceania
For each continent, five representative countries were selected, and within each country three major cities (stations) were chosen — resulting in the final set of 77 stations.
This hierarchical structure enabled modular processing (one notebook per continent) and ensured balanced geographic coverage without overwhelming resources.
Examples:
- Europe → Italy (Bolzano, Rome Ciampino, Palermo), France (Paris CDG, Marseille, Brest), Germany, Spain, UK…
- Asia → China (Beijing, Shanghai, Guangzhou), India (Delhi, Mumbai, Chennai), Japan, Russia, Indonesia…
- Africa → South Africa (Cape Town, Johannesburg, Durban), Nigeria…
After initial cleaning and enrichment (adding continent, country, hemisphere, season), data was saved in Delta Lake format — first per continent, then unioned into a single global table.
Delta provided schema enforcement, ACID transactions, and efficient querying even on limited hardware, making iterative analysis fast and reliable.
Using PySpark aggregations, we computed a rich set of features by year, continent, country, and city:
- Average / max / min temperature
- Days >35°C, days <0°C (frost)
- Precipitation days (>10 mm, >50 mm)
- Diurnal temperature range (Tmax – Tmin)
- Temperature variability (standard deviation)
- Humidity %, wind speed averages
- Extreme events (lightning, snow, hail from FRSHTT flags)
- Anomalies vs 2000–2009 baseline per station
These metrics form the basis for all downstream insights.
Final analysis and plots were created by converting Spark results to Pandas DataFrames and using Seaborn + Matplotlib for publication-quality charts.
Every plot avaiable Here
Key visualizations include:
-
Heatmap of average days with >10 mm rain per continent and year
- -
Temperature Variability (Standard Deviation) per Continent and Year (2014–2024)

-
Average Diurnal Temperature Range (Tmax – Tmin) per Country – Last 5 Years

The visualizations reveal strong tropical influence on highest temperatures, elevated heavy-rain days in Europe and parts of South America/Asia, greater variability in Asia and North America, and a general pattern of stable-to-slightly-warming continental averages over the decade.
All code runs end-to-end on free Databricks — from S3 ingestion to interactive Seaborn charts.
Developed by Salvatore Spagnuolo

