NOAA Global Historical Climatology Network
This dataset contains weather measurements for the last 120 years. Each row is a measurement for a point in time and station.
More precisely and according to the origin of this data:
GHCN-Daily is a dataset that contains daily observations over global land areas. It contains station-based measurements from land-based stations worldwide, about two-thirds of which are for precipitation measurements only (Menne et al., 2012). GHCN-Daily is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews (Durre et al., 2010). The archive includes the following meteorological elements:
- Daily maximum temperature
- Daily minimum temperature
- Temperature at the time of observation
- Precipitation (i.e., rain, melted snow)
- Snowfall
- Snow depth
- Other elements where available
The sections below give a brief overview of the steps that were involved in bringing this dataset into ClickHouse. If you're interested in reading about each step in more detail, we recommend to take a look at our blog post titled "Exploring massive, real-world data sets: 100+ Years of Weather Records in ClickHouse".
Downloading the data
- A pre-prepared version of the data for ClickHouse, which has been cleansed, re-structured, and enriched. This data covers the years 1900 to 2022.
- Download the original data and convert to the format required by ClickHouse. Users wanting to add their own columns may wish to explore this approach.
Pre-prepared data
More specifically, rows have been removed that did not fail any quality assurance checks by Noaa. The data has also been restructured from a measurement per line to a row per station id and date, i.e.
This is simpler to query and ensures the resulting table is less sparse. Finally, the data has also been enriched with latitude and longitude.
This data is available in the following S3 location. Either download the data to your local filesystem (and insert using the ClickHouse client) or insert directly into ClickHouse (see Inserting from S3).
To download:
Original data
The following details the steps to download and transform the original data in preparation for loading into ClickHouse.
Download
To download the original data:
Sampling the data
Summarizing the format documentation:
Summarizing the format documentation and the columns in order:
- An 11 character station identification code. This itself encodes some useful information
- YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
- ELEMENT = 4 character indicator of element type. Effectively the measurement type. While there are many measurements available, we select the following:
- PRCP - Precipitation (tenths of mm)
- SNOW - Snowfall (mm)
- SNWD - Snow depth (mm)
- TMAX - Maximum temperature (tenths of degrees C)
- TAVG - Average temperature (tenths of a degree C)
- TMIN - Minimum temperature (tenths of degrees C)
- PSUN - Daily percent of possible sunshine (percent)
- AWND - Average daily wind speed (tenths of meters per second)
- WSFG - Peak gust wind speed (tenths of meters per second)
- WT** = Weather Type where ** defines the weather type. Full list of weather types here.
 
- DATA VALUE = 5 character data value for ELEMENT i.e. the value of the measurement.
- M-FLAG = 1 character Measurement Flag. This has 10 possible values. Some of these values indicate questionable data accuracy. We accept data where this is set to "P" - identified as missing presumed zero, as this is only relevant to the PRCP, SNOW and SNWD measurements.
- Q-FLAG is the measurement quality flag with 14 possible values. We are only interested in data with an empty value i.e. it did not fail any quality assurance checks.
- S-FLAG is the source flag for the observation. Not useful for our analysis and ignored.
- OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am). Typically not present in older data. We ignore this for our purposes.
A measurement per line would result in a sparse table structure in ClickHouse. We should transform to a row per time and station, with measurements as columns. First, we limit the dataset to those rows without issues i.e. where qFlag is equal to an empty string.
Clean the data
Using ClickHouse local we can filter rows that represent measurements of interest and pass our quality requirements:
With over 2.6 billion rows, this isn't a fast query since it involves parsing all the files. On our 8 core machine, this takes around 160 seconds.
Pivot data
While the measurement per line structure can be used with ClickHouse, it will unnecessarily complicate future queries. Ideally, we need a row per station id and date, where each measurement type and associated value are a column i.e.
Using ClickHouse local and a simple GROUP BY, we can repivot our data to this structure. To limit memory overhead, we do this one file at a time.
This query produces a single 50GB file noaa.csv.
Enriching the data
The data has no indication of location aside from a station id, which includes a prefix country code. Ideally, each station would have a latitude and longitude associated with it. To achieve this, NOAA conveniently provides the details of each station as a separate ghcnd-stations.txt. This file has several columns, of which five are useful to our future analysis: id, latitude, longitude, elevation, and name.
This query takes a few minutes to run and produces a 6.4 GB file, noaa_enriched.parquet.
Create table
Create a MergeTree table in ClickHouse (from the ClickHouse client).
Inserting into ClickHouse
Inserting from local file
Data can be inserted from a local file as follows (from the ClickHouse client):
where <path> represents the full path to the local file on disk.
See here for how to speed this load up.
Inserting from S3
For how to speed this up, see our blog post on tuning large data loads.
Sample queries
Highest temperature ever
Reassuringly consistent with the documented record at Furnace Creek as of 2023.
Best ski resorts
Using a list of ski resorts in the united states and their respective locations, we join these against the top 1000 weather stations with the most in any month in the last 5 yrs. Sorting this join by geoDistance and restricting the results to those where the distance is less than 20km, we select the top result per resort and sort this by total snow. Note we also restrict resorts to those above 1800m, as a broad indicator of good skiing conditions.
Credits
We would like to acknowledge the efforts of the Global Historical Climatology Network for preparing, cleansing, and distributing this data. We appreciate your efforts.
Menne, M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R. Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used following decimal, e.g. Version 3.25]. NOAA National Centers for Environmental Information. http://doi.org/10.7289/V5D21VHZ [17/08/2020]
