Manipulation & Handling of Time Series Data
Creating Time Series Data
This section will focus on creation of a time series dataset using random data values using python. You will need basic understanding on ho to create pandas data frames and date objects using python.
First , we import all the required packages
# Import required packages
import random
import pandas as pd
from datetime import datetime
After importing all packages, we will generate a list of numbers between 5 and 100, and sample 90 data values. Similarly, a list of dates that matches the length of the sampled dataset will be generated as shown below.
# Generate 90 random numbers between 5 and 100
randomlist = random.sample(range(5, 100), 90)
# Set the number of days
no_of_days = len(randomlist)
# Start date - Get today's date
start_date = datetime.today()
# Create a list of dates
datelist = pd.date_range(start_date, periods=no_of_days).tolist()
To create the time series data frame, we will need to convert the list of data values to a data frame and set the list of dates as an index. The resultant data frame will have date object as index and we can perform different manipulations of it.
# Create a pandas data frame
ts_df = pd.DataFrame({'data':randomlist})
# Set date as index
ts_df.index = datelist
At this moment we can visualize, slice, or subset the time series data as shown below. This ends the steps required to create a time series data set using python. You can download the entire code from my google Colab Notebook here. Next, we will explore how to import time series data into python.
# Plot time series
ts_df.plot()
# Select data for a specific date
ts_date = ts_df['2020-12-24':'2020-12-24']
# Select data within a given date range
ts_daterange = ts_df['2020-11-24':'2020-12-30']
Importing Time Series Data
In this section we will use python pandas to import time series data stored within a csv file. The steps outlined below can be applied to import time series data stored in different file formats. You will need to be familiar with importing files, and some basics of handling time series data within python. The dataset that we are going to use is available here. This data set consists of daily river discharge measurements recorded from 2015 to 2019. We will import, compute summary statistics, aggregate the time series data to different time scales, and perform rolling statistics. Below is the script and corresponding commentary. You can download the entire Notebook here.
# Import packages
import pandas as pd
# read csv file
flow_csv = pd.read_csv('flow.csv')
# Set date as index
flow_csv.index = pd.to_datetime(flow_csv['Date'])
# Resample to daily frequency - one way to detect missing dates
flow_df = flow_csv[['Flow']].resample('D').mean()
flow_df.index = pd.to_datetime(flow_df.index)
# Visualize the data
flow_df.plot(rot = 45)
# Summary statistics
flow_df.describe()
# Aggregate data - Monthly, quarterly, and annually
# compute monthly average
flow_month = flow_df.resample('M').mean()
# Compute quarterly average
flow_quarterly = flow_df.resample('Q').mean()
# Annual average
flow_yearly = flow_df.resample('A').mean()
# Rolling statistics - using window size of 12
flow_roll = flow_df.rolling(12).mean()