Merging Data Sets: Uncovering Temperature Correlations in Chicago Bus Ridership

Using Pandas Built-In SQL-Like Commands to Join Disparate Data Sets

Posted by David Kaleko on Wed 01 February 2017

Merging Data Sets: Uncovering Temperature Correlations in Chicago Bus Ridership

Lately I've been having fun exploring data sets from the Chicago open data portal. I think it's great that the city is providing access to government data and encouraging people to explore it. One of the first data sets I looked at involved ridership on the public buses. It turns out it's very well documented... you can look at how many people rode buses on each route every day for the past several years. I went and downloaded a simple csv file here and began exploring it. If you're curious yourself, here is a link to similar data that might be useful. There is even an API for stop-by-stop data for each bus that live-updates which I haven't tried out yet. Here's an interesting blog post I found about it which describes a project done by someone in the Data Science for Social Good fellowship. It's an interesting read, if you too care about the public bus system in the City of Chicago!

After exploring of the dataset and training a few basic models to try and predict ridership, I began to wonder what other external features might be useful for such a prediction. Maybe I could scrape some information about public gatherings like sporting events to predict overcrowding on routes leading to stadiums? Perhaps some buses will have fewer than average riders on public school holidays? I decided to try and determine if people ride the bus more when it's nice outside. If you're looking for me and it's -20F out (thanks, polar vortex), don't bother checking buses because I'll absolutely be at home.

In this blog post, I will demonstrate how to clean and merge together two data sets from different sources. Here I will be combining historical bus riderships per day with weather data per day. I'll use timestamps to join the tables together and I'll try to search for correlations within the two data sets.

The Question

Does ridership on Chicago public buses correlate with air temperature?

The Data

The ridership data is publicly accessible through the Chicago open data portal, and (obviously) there's a plethora of historical weather data out there to be had. I chose to get my data from the National Oceanic and Atmospheric Administration (NOAA) website. You can request which features interest you (air temperature, precipitation type, precipitation amount, etc), select the date ranges and location of interest, and download the data for free. I downloaded my own csv for the Chicago area to include in this analysis.

The Tools

In this quick study, I'll be using numpy, matplotlib, and pandas. With the two datasets stored in pandas dataframes, there are some nice SQL-esque join functions that pandas offers.

The Analysis

The analysis in this post is quite simple. I want to compare public bus ridership on different routes with air temperature, so I'll develop a metric ($R^2$ from a linear fit for example) to determine amount of correlation. This blog post is mostly to demonstrate how to merge two data sets from different sources (ridership, and weather). That said, the results are pretty interesting, too!

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let's start by reading in the bus ridership csv file I downloaded, then doing some cleaning to prepare it to be merged with weather data.

In [2]:
myRidershipFile = 'data/CTA_-_Ridership_-_Bus_Routes_-_Daily_Totals_by_Route.csv'
df = pd.read_csv(myRidershipFile,index_col=False)
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 701321 entries, 0 to 701320
Data columns (total 4 columns):
route      701321 non-null object
date       701321 non-null object
daytype    701321 non-null object
rides      701321 non-null int64
dtypes: int64(1), object(3)
memory usage: 26.8+ MB

The 'route' column is self explanatory, there are 182 routes and the name is stored as a string (despite many of the route names being numbers). 'daytype' indicates whether the day is a weekday, Saturday, or Sunday/Holiday ('W', 'A', or 'U' respectively). Also, the date column is a string too. Since ultimately this will be the field on which we will perform our SQL-like join with weather data, it's probably best to convert this column to an actual python datetime object... we'll use the pandas build-in to_datetime function.

In [4]:
df['date_dt'] = pd.to_datetime(df['date'],format='%m/%d/%Y')
df['year'] = df['date_dt'].dt.year

Now let's read in the NOAA weather data and clean it up a little. We need to make sure we end up with a column called 'date_dt' that holds datetime objects. Having the same column name and type as in the bus ridership dataframe is crucial for the join we'll be doing later.

In [8]:
myWeatherFile = 'data/832070.csv'
wdf = pd.read_csv(myWeatherFile,index_col=False)
In [9]:
wdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5778 entries, 0 to 5777
Data columns (total 9 columns):
STATION         5778 non-null object
STATION_NAME    5778 non-null object
DATE            5778 non-null int64
PRCP            5778 non-null float64
SNWD            5778 non-null float64
SNOW            5778 non-null float64
TAVG            5778 non-null int64
TMAX            5778 non-null int64
TMIN            5778 non-null int64
dtypes: float64(3), int64(4), object(2)
memory usage: 451.4+ KB

With some quick exploring, I find that the TAVG column (average temperature) for some reason has several years worth of -999 values in it, but TMAX and TMIN seem to look better. I'll just define my own average temperature as TMAX+TMIN/2. There are a few days remaining where even this temperature is undefined, so we'll throw those out too.

In [11]:
wdf['MAXMINAVG'] = (wdf['TMAX'] + wdf['TMIN']) / 2.
wdf = wdf.query('MAXMINAVG > -999')

Let's make the datetime column in the weather dataframe now. The date string in this data is annoyingly a different format, but that's no problem for the to_datetime function.

In [12]:
wdf['date_dt'] = pd.to_datetime(wdf['DATE'],format='%Y%m%d')

Now comes the reason you're still reading! The pandas merge between the ridership dataframe and the weather dataframe. While the weather dataframe has exactly one entry per day, the ridership dataframe has 182 entries per day (since there are that many routes). I want my merged dataframe to also have 182 entries per day, with duplicates of the weather information for that day. Also I want to throw out any ridership days that don't have matched weather days, and vice versa. The prescribed SQL join to do this is an inner join. In SQL lingo this join should look something like:

SELECT * FROM df INNER JOIN wdf ON df.date_dt = wdf.date_dt

but in pandas it looks like this:

In [13]:
mergedf = df.merge(wdf, on='date_dt', how='inner')

With the merge completed, some quick simple plots can be made from the merged dataframe. Check this one out!

In red is the average air temperature each Saturday over a four year span, peaking at around 80F in the summer months, and at around 10F in the winter. In green is the daily ridership on those same Saturdays, specifically on Route 151. This certainly looks like a correlation to me. Though I have to admit, I chose the route and years that had the strongest correlation. They don't all look this good!

The best way to see the correlation between two variables directly is to plot one versus the other. In this case I want to plot ridership versus average daily temperature. Since I want to do this for all of the routes, and ridership between routes varies widely, I normalize by instead ploting the standard deviation from yearly means for each route. Here's what the most correlated route (Route 151) and what the least correlated route (Route 106) look like.

Curious how I determined the most and least correlated routes? I made this plot for each route, used numpy's polyfit package to do a least-squares polynomial fit of degree 1 (linear) and stored the resulting $R^2$ of the fit. The above two plots show that polyfit line in blue, with the $R^2$ value in the legend. Here's what the distribution of $R^2$ values look like for all of the routes.

And there you have it, a simple yet interesting example of using an SQL-like join command in pandas to merge together data sets from different sources. I hope you found this interesting, and perhaps educational. Thanks for reading!


Comments !