Ocean and Land Temperature Anomalies¶

The Boyz are back¶

Team Members¶

Saffian Asghar
Alexis Culpin
Romaric Sallustre
Emilio Espinosa S.

Datasets¶

NOAA dataset¶

The dataset is hosted by NOAA's National Centers for Environmental Information (NCEI). It contains temperature anomaly data, representing deviations from a reference temperature over time.

Technical information

Data is collected from 1850 - 2023.
The data is in JSON format.
Columns of interest: year and data (yearly anomaly).

Field	Description
`DATE`	Period of time in years.
`DESCRIPTION`	Description of data set itself.
`DATA`	Anomaly in degrees Celsius.

Data transformation required

Read JSON.
Drop the description column.
Ensure every value within the date and data columns is numeric. The non-numeric values have to be drop.
Make sure the index (date) is an integer type value.
Rename data column to NOAAGlobalTemp and add the minimum and maximum years.

Link

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/12/1850-2023/data.json

License

Creative Commons Attribution 4.0 International license (CC-BY-4.0)

Berkley dataset¶

The dataset is associated with the Berkeley Earth project, an independent climate science organization, the dataset focuses on the "Annual Anomaly" column. Housed on Amazon S3, the data contains raw maximum temperature (TMAX) data, reflecting the highest recorded temperatures.

Technical information

Data collected from 1850 - 2023.
The dataset has missing values in 2023.
The data is in TXT format.
Columns of interest: year and annual anomaly (difference of temperature from a base reference).

Field	Description
`YEAR`	Period of time in years.
`MONTH`	Period of time in months.
`MONTHLY ANOMALY`	Monthly anomaly in degrees celsius.
`ANNUAL ANOMALY`	Yearly anomaly in degrees celsius.
`FIVE YEAR ANOMALY`	5 year rolling average anomaly in degrees celsius.
`TEN YEAR ANOMALY`	10 year rolling average anomaly in degrees celsius.
`TWENTY YEAR ANOMALY`	20 year rolling average anomaly in degrees celsius.

Data transformation required

Read a space-delimited text file into a pandas DataFrame, ignoring lines that start with "%".
Group the DataFrame by the 'year' column.
Calculate the mean of the 'anomaly' column for each year.
Reset the DataFrame index.
Set the 'year' column as the new index.
Convert the index values to integers.

Link

https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Raw_TMAX_complete.txt

License

Creative Commons Attribution 4.0 International license (CC-BY-4.0)

CRUT5 dataset¶

The dataset is derived from the Hadley Centre for Climate Science and Services at the UK Met Office, suggesting tabular data, possibly containing global monthly climate information. The dataset includes time series summaries for global climate analysis, incorporating columns with both upper and lower confidence limits.

Technical information

Data collected from 1850 - 2023.
The data is in CSV format.
Columns of interest: year and anomaly in degrees celsius (temperature anomaly).

Field	Description
`YEAR_MONTH`	Period of time in "YYYY-MM" format.
`ANOMALY IN DEGREES CELSIUS`	Monthly anomaly in degrees celsius.
`LOWER CONFIDENCE LIMIT (2.5%)`	Numbers at the lower end of the confidence interval.
`UPPER CONFIDENCE LIMIT (97.5%)`	Numbers at the upper end of the confidence interval.

Data transformation required

Read a CSV file into a pandas DataFrame, parsing the 'Time' column as dates.
Group the DataFrame by the year part of the 'Time' column.
Calculate the mean of the 'Anomaly (deg C)' column for each year.
Reset the DataFrame index.
Set the 'Time' column as the new index.
Convert the index values to integers.

Link

https://www.metoffice.gov.uk/hadobs/hadcrut5/data/HadCRUT.5.0.2.0/analysis/diagnostics/HadCRUT.5.0.2.0.analysis.summary_series.global.monthly.csv

License

Open Government License (OGL) for Public Sector Information.

Météo France dataset¶

Meteo Ardennes provides localized weather data for the Ardennes region, encompassing essential parameters like temperature, precipitation, and wind speed. Sourced from regional weather stations and collaborative efforts with meteorological agencies, the data offers valuable insights for informed decision-making. Available in CSV.GZ format, these compressed files store tabular weather information efficiently. The dataset contains climatological data from all French and overseas stations since their opening, for all available parameters.

Technical information

Daily data are available for download, by department and by period batch, in compressed csv format.
All parameters are provided for all weather stations.
Times are expressed in UTC for mainland France and in FU for overseas territories
Files are updated annually for historical data prior to 1950, monthly for data from 1950 to year -2, and daily for the last two years.

Field	Description
`NUM_POSTE`	Météo-France station number on 8 digits
`NOM_USUEL`	Common name of the station
`LAT`	Latitude, negative in the south (in degrees and millionths of a degree)
`LON`	Longitude, negative west of GREENWICH (in degrees and millionths of a degree)
`ALTI`	Altitude of the shelter base or rain gauge if no shelter (in meters)
`AAAAMMJJ`	Measurement date (year month day)
`RR`	Amount of precipitation fallen in 24 hours (from 06h UTC on day J to 06h UTC on day J+1). Value for day J is recorded at J+1 (in mm and tenths)
`TN`	Minimum temperature under shelter (in °C and tenths)
`HTN`	Time of TN (hhmm)
`TX`	Maximum temperature under shelter (in °C and tenths)
`HTX`	Time of TX (hhmm)
`TM`	Daily average of hourly temperatures under shelter (in °C and tenths)
`TNTXM`	Daily average (TN+TX)/2 (in °C and tenths)
`TAMPLI`	Daily thermal amplitude: difference between daily TX and TN (TX-TN) (in °C and tenths)
`TNSOL`	Daily minimum temperature 10 cm above ground (in °C and tenths)
`TN50`	Daily minimum temperature 50 cm above ground (in °C and tenths)
`DG`	Duration of frost under shelter (T ≤ 0°C) (in minutes)
`FFM`	Daily average wind force averaged over 10 minutes, at 10 m (in m/s and tenths)
`FF2M`	Daily average wind force averaged over 10 minutes, at 2 m (in m/s and tenths)
`FXY`	Daily maximum of maximum hourly wind force averaged over 10 minutes, at 10 m (in m/s and tenths)
`DXY`	Direction of FXY (in compass points of 360)
`HXY`	Time of FXY (hhmm)
`FXI`	Daily maximum of maximum hourly instantaneous wind force, at 10 m (in m/s and tenths)
`DXI`	Direction of FXI (in compass points of 360)
`HXI`	Time of FXI (hhmm)
`FXI2`	Daily maximum of maximum hourly instantaneous wind force, at 2 m (in m/s and tenths)
`DXI2`	Direction of FXI2 (in compass points of 360)
`HXI2`	Time of FXI2 (hhmm)
`FXI3S`	Daily maximum of maximum hourly wind force averaged over 3 seconds, at 10 m (in m/s and tenths)
`DXI3S`	Direction of FXI3S (in compass points of 360)
`HXI3S`	Time of FXI3S (hhmm)

Quality codes associated with each data point (e.g., T;QT):

9: Filtered data (the data has passed first-level filters/controls)
0: Protected data (the data has been definitively validated by the climatologist)
1: Validated data (the data has been validated by automatic control or by the climatologist)
2: Doubtful data under review (the data has been questioned by automatic control)

Link

License

Etalab Open Licence 2.0.

# Import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import urllib.request

C:\Users\baigs\AppData\Local\Temp\ipykernel_2792\1677011716.py:2: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

# Save URL into
NOAA_URL = "https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/12/1850-2023/data.json"
BERKLEY_URL = "https://berkeley-earth-temperature.s3.us-west-1.amazonaws.com/Global/Raw_TMAX_complete.txt"
HAD_CRUT5_URL = "https://www.metoffice.gov.uk/hadobs/hadcrut5/data/HadCRUT.5.0.2.0/analysis/diagnostics/HadCRUT.5.0.2.0.analysis.summary_series.global.monthly.csv"

# Daily RR (rain) -T( temperature)-Vent(wind) data for department 08, over the period 1871-1949
ardennes_RR_T_wind_1871_1949_url = "https://object.files.data.gouv.fr/meteofrance/data/synchro_ftp/BASE/QUOT/Q_08_1871-1949_RR-T-Vent.csv.gz"
# Daily RR (rain) -T( temperature)-Vent(wind) data for department 08, over the period 1950 - 2022
ardennes_RR_T_wind_1950_2022_url = "https://object.files.data.gouv.fr/meteofrance/data/synchro_ftp/BASE/QUOT/Q_08_latest-2023-2024_autres-parametres.csv.gz"
# Daily RR (rain) -T( temperature)-Vent(wind) data for department 08, over the period # Daily RR (rain) -T( temperature)-Vent(wind) data for department 08, over the period 1950 - 2023 - 2024
ardennes_RR_T_wind_2023_2024_url = "https://object.files.data.gouv.fr/meteofrance/data/synchro_ftp/BASE/QUOT/Q_08_latest-2023-2024_RR-T-Vent.csv.gz"

# field description : https://object.files.data.gouv.fr/meteofrance/data/synchro_ftp/BASE/QUOT/Q_descriptif_champs_RR-T-Vent.csv

Function to get data if it doesn't exist¶

from pathlib import Path
def save_data(urls, folder):
    subfolder = Path(f"./data/{folder}")
    subfolder.mkdir(parents=True, exist_ok=True)
    
    for key, url in urls.items():
        website_url = url.split("/")[2]
        file_extension = url.split(".")[-1]
        filepath = subfolder / f"{key}.{file_extension}"
        
        if not filepath.exists():
            urllib.request.urlretrieve(url, filepath)
            print(f"Data saved for {website_url} at {filepath}")
        else:
            print(f"Data already exists for {website_url} at {filepath}")

Global Temperature data¶

urls = {"noaa_df" : NOAA_URL, "berkley_df" : BERKLEY_URL, "had_crut5_df" : HAD_CRUT5_URL}
# saving in a subfolder called global_temperature
save_data(urls, "global_temperature")

Data saved for www.ncei.noaa.gov at data\global_temperature\noaa_df.json
Data saved for berkeley-earth-temperature.s3.us-west-1.amazonaws.com at data\global_temperature\berkley_df.txt
Data saved for www.metoffice.gov.uk at data\global_temperature\had_crut5_df.csv

subfolder = "./data/global_temperature/"

# Read Json file
df_noaa = pd.read_json("./data/global_temperature/noaa_df.json")

# Dataframe drop unnecessary columns ...
df_noaa = (
    df_noaa
    .drop('description', axis=1)
    .loc[pd.to_numeric(df_noaa.index, errors='coerce').notna()]
)
# Set index as int
df_noaa.index = df_noaa.index.astype(int)
# Rename columns with specific format
df_noaa = df_noaa.rename(columns=lambda x: f"NOAAGlobalTemp ({df_noaa.index.min()} - {df_noaa.index.max()})")
df_noaa.head()

	NOAAGlobalTemp (1850 - 2023)
1850	-0.06
1851	-0.08
1852	-0.01
1853	-0.12
1854	0.02

# read second dataset
had_crut5_df = pd.read_csv('./data/global_temperature/had_crut5_df.csv', parse_dates=['Time'])
# Group by year and calculate average per year.
had_crut5_df = (
    had_crut5_df
    .groupby(had_crut5_df['Time'].dt.year)['Anomaly (deg C)'].mean().reset_index()
    .set_index('Time')
)
# Set index as int
had_crut5_df.index = had_crut5_df.index.astype(int)
had_crut5_df.head()

	Anomaly (deg C)
Time
1850	-0.417711
1851	-0.233350
1852	-0.229399
1853	-0.270354
1854	-0.291521

# Read third dataset
berkley_df = pd.read_csv('./data/global_temperature/berkley_df.txt', comment="%", delim_whitespace=True, names= ["year", "month", "anomaly", "yearAvgAnomaly", "5yearAvgAnomaly", "10yearAvgAnomaly", "20yearAvgAnomaly"])
# Group by year and calculate average per year.
berkley_df = (
    berkley_df
    .groupby(berkley_df['year'])['anomaly'].mean().reset_index()
    .set_index('year')
)
# berkley_df = berkley_df.assign(realTemp = berkley_df['anomaly'] + 14.40)
berkley_df.index = berkley_df.index.astype(int)
berkley_df

C:\Users\baigs\AppData\Local\Temp\ipykernel_2792\1851333198.py:2: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  berkley_df = pd.read_csv('./data/global_temperature/berkley_df.txt', comment="%", delim_whitespace=True, names= ["year", "month", "anomaly", "yearAvgAnomaly", "5yearAvgAnomaly", "10yearAvgAnomaly", "20yearAvgAnomaly"])

	anomaly
year
1850	-1.141667
1851	-0.971583
1852	-1.007917
1853	-0.382333
1854	-0.170500
...	...
2019	1.199167
2020	1.391000
2021	1.160500
2022	1.205750
2023	1.610091

174 rows × 1 columns

# Join the two datasets based on the common 'index' (year)
merged_df = (
    df_noaa
    .join(had_crut5_df.rename(columns={'Anomaly (deg C)': 'HadCRUT5_Anomaly'}), how='left')
    .join(berkley_df.rename(columns={'anomaly':'Berkley_anomaly'}), how='left')
)

# Rename the new column as per your specified format
merged_df = (
    merged_df
    .rename(columns={'HadCRUT5_Anomaly': f"HAD_CRUT5 ({had_crut5_df.index.min()} - {had_crut5_df.index.max()})"})
    .rename(columns={'Berkley_anomaly': f"BerkleyEarth ({berkley_df.index.min()} - {berkley_df.index.max()})"})
)

merged_df

	NOAAGlobalTemp (1850 - 2023)	HAD_CRUT5 (1850 - 2023)	BerkleyEarth (1850 - 2023)
1850	-0.06	-0.417711	-1.141667
1851	-0.08	-0.233350	-0.971583
1852	-0.01	-0.229399	-1.007917
1853	-0.12	-0.270354	-0.382333
1854	0.02	-0.291521	-0.170500
...	...	...	...
2019	1.12	0.891073	1.199167
2020	0.83	0.922921	1.391000
2021	0.90	0.761906	1.160500
2022	0.83	0.801305	1.205750
2023	1.39	1.100057	1.610091

174 rows × 3 columns

merged_df.plot(kind='line', figsize=(8, 4), title='Global Temperature change')

<Axes: title={'center': 'Global Temperature change'}>

No description has been provided for this image