Data Cleaning For A Time Series

The art of uncovering the insights and trends in data has been around since ancient times. The ancient Egyptians accurately predicted the flooding of the Nile. Another example from modern time, the invention of computers dramatically enhanced the data analysis. The 1880 Census in the US took over 7 years to arrive at a final report. After the invention of Tabulating Machine, the next census finished in only 18 months. Today Data aggregation has grown so fast and how we think about data has also changed and will change more radically. Being a professional working in the technology space likes to consider this an opportunity. This and the following series of posts will be a documentation of the lessons learned and attempt to push the skill sets to the next level.

Time series is a series of data points sequenced in time that can reveal patterns. For example, analyzing past COVID-19 Time series data is very useful for predicting the future. The prediction uses a combination of statistical and machine learning methods. This series is to pull the confirmed COVID-19 cases in Australia from Johns Hopkins University GitHub repository and initiate the analysis. In future series will look at options to superimpose the findings on a map and finally automating all these processes for recurring usage. A close look at the datasets reveals that the data mapping is only on an axis.

Preparation:

The tools required for this exercise is JupyterLab, which is a web-based interactive development environment for similar tasks. JupyterLab can be installed using or . This document is capturing the installation through pip. Start the installation with JupyterLab followed by the Jupyter Notebooks, which is useful for writing and iterating on programming code for analyzing data. Programming for this analysis uses Python. Another tool to install is Pandas, which is a powerful data analysis and a manipulating tool built on top of Python. Once the installation is successful, launch the notebook.

$ pip3 install jupyterlab
$ pip3 install notebook
$ pip3 install pandas
$ jupyter notebook

For users preferring user interface to the terminal can download Anaconda specific for the operating system, run the installer and follow guidance for similar results.

Incubation:

Starting with importing Pandas under the alias pd is a general practice. Will use a Pandas function to read the file to a DataFrame, as the file in the GitHub repository is a csv file (comma-separated values). The file location is the mandatory parameter for this Pandas function. This function supports optionally, iterating or breaking of the file into chunks and many more. Refer documentation for more details.

Using the head function retrieves the top rows for the object based on position. By default, the function retrieves the first five lines. Head function is very handy especially for quickly testing if the object has the right type of data in it. When passing the number of lines for retrieving as a parameter to this function then it overrides the default value. For negative values, this function returns all rows except the last rows mentioned as values.

import pandas as pd
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
pd.read_csv(confirmed_url)
Image for post
Image for post
Displays the read data

Illumination:

Importing the csv file as DataFrame is for splitting the data between the x-axis and y-axis for creating the time series. The values in Country, Province, Latitude and Longitude field has to be in an axis and confirmed cases value for each day in another axis. Modifying the shape of DataFrame from wide to long format is easier with the melt function in Pandas. The melt function returns a DataFrame with a format where one or more columns are identifier variables defined through id_vars variable. In our example Country, Province, Latitude and Longitude field are the identifier variables. The columns not part of the identifier variables are the measured variables and values. In our example, the value of the confirmed case for each day. It is possible to rename the default variable name using the optional var_name parameter in the melt function. Similarly, the optional value_name parameter also exists for the values.

confirmed_cases = pd.read_csv(confirmed_url)
confirmed_cases.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name = 'Date', value_name = 'Confirmed')
Image for post
Image for post
Global data displayed after melting and renaming the columns

Verification:

The final step is packaging all these together in a Python function for reusing. Since our focus is only for COVID-19, confirmed cases among people of Australia need to include a filter to remove data of other nationalities before melting. Upon completion of typing the function, click the Run button in Jupyter notebook to ensure correct syntax. At this stage, no results will display because of returning the DataFrame. Verify the function by calling the function and save the DataFrame in a variable. Analyse the DataFrame using sample function in Pandas, which returns a random sample of items from an axis of the object and many more things.

def reshaped_data(file_name, status):
loaded_data = pd.read_csv(file_name)
auzzie_data = loaded_data[loaded_data['Country/Region'] ==
'Australia']
return auzzie_data.melt(id_vars=['Province/State',
'Country/Region', 'Lat', 'Long'], var_name = 'Date',
value_name = 'Confirmed')
confirmed_data = reshaped_data(confirmed_url, "Confirmed")
confirmed_data.sample()
Image for post
Image for post
Sample melted data for Australia

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store