Manipulation & Handling of Time Series Data

Creating Time Series Data

This section will focus on creation of a time series dataset using random data values using python. You will need basic understanding on ho to create pandas data frames and date objects using python.

First , we import all the required packages


# Import required packages

import random

import pandas as pd

from datetime import datetime


After importing all packages, we will generate a list of numbers between 5 and 100, and sample 90 data values. Similarly, a list of dates that matches the length of the sampled dataset will be generated as shown below.


# Generate 90 random numbers between 5 and 100

randomlist = random.sample(range(5, 100), 90)

# Set the number of days

no_of_days = len(randomlist)

# Start date - Get today's date

start_date = datetime.today()

# Create a list of dates

datelist = pd.date_range(start_date, periods=no_of_days).tolist()


To create the time series data frame, we will need to convert the list of data values to a data frame and set the list of dates as an index. The resultant data frame will have date object as index and we can perform different manipulations of it.


# Create a pandas data frame

ts_df = pd.DataFrame({'data':randomlist})

# Set date as index

ts_df.index = datelist


At this moment we can visualize, slice, or subset the time series data as shown below. This ends the steps required to create a time series data set using python. You can download the entire code from my google Colab Notebook here. Next, we will explore how to import time series data into python.

# Plot time series

ts_df.plot()

# Select data for a specific date

ts_date = ts_df['2020-12-24':'2020-12-24']

# Select data within a given date range

ts_daterange = ts_df['2020-11-24':'2020-12-30']


Importing Time Series Data

In this section we will use python pandas to import time series data stored within a csv file. The steps outlined below can be applied to import time series data stored in different file formats. You will need to be familiar with importing files, and some basics of handling time series data within python. The dataset that we are going to use is available here. This data set consists of daily river discharge measurements recorded from 2015 to 2019. We will import, compute summary statistics, aggregate the time series data to different time scales, and perform rolling statistics. Below is the script and corresponding commentary. You can download the entire Notebook here.


# Import packages

import pandas as pd

# read csv file

flow_csv = pd.read_csv('flow.csv')

# Set date as index

flow_csv.index = pd.to_datetime(flow_csv['Date'])

# Resample to daily frequency - one way to detect missing dates

flow_df = flow_csv[['Flow']].resample('D').mean()

flow_df.index = pd.to_datetime(flow_df.index)

# Visualize the data

flow_df.plot(rot = 45)

# Summary statistics

flow_df.describe()

# Aggregate data - Monthly, quarterly, and annually

# compute monthly average

flow_month = flow_df.resample('M').mean()

# Compute quarterly average

flow_quarterly = flow_df.resample('Q').mean()

# Annual average

flow_yearly = flow_df.resample('A').mean()

# Rolling statistics - using window size of 12

flow_roll = flow_df.rolling(12).mean()