Lately I've been having fun exploring data sets from the Chicago open data portal. I think it's great that the city is providing access to government data and encouraging people to explore it. One of the first data sets I looked at involved ridership on the public buses. It turns out it's very well documented... you can look at how many people rode buses on each route every day for the past several years. I went and downloaded a simple csv file here and began exploring it. If you're curious yourself, here is a link to similar data that might be useful. There is even an API for stop-by-stop data for each bus that live-updates which I haven't tried out yet. Here's an interesting blog post I found about it which describes a project done by someone in the Data Science for Social Good fellowship. It's an interesting read, if you too care about the public bus system in the City of Chicago!
After exploring of the dataset and training a few basic models to try and predict ridership, I began to wonder what other external features might be useful for such a prediction. Maybe I could scrape some information about public gatherings like sporting events to predict overcrowding on routes leading to stadiums? Perhaps some buses will have fewer than average riders on public school holidays? I decided to try and determine if people ride the bus more when it's nice outside. If you're looking for me and it's -20F out (thanks, polar vortex), don't bother checking buses because I'll absolutely be at home.
In this blog post, I will demonstrate how to clean and merge together two data sets from different sources. Here I will be combining historical bus riderships per day with weather data per day. I'll use timestamps to join the tables together and I'll try to search for correlations within the two data sets.
Does ridership on Chicago public buses correlate with air temperature?
The ridership data is publicly accessible through the Chicago open data portal, and (obviously) there's a plethora of historical weather data out there to be had. I chose to get my data from the National Oceanic and Atmospheric Administration (NOAA) website. You can request which features interest you (air temperature, precipitation type, precipitation amount, etc), select the date ranges and location of interest, and download the data for free. I downloaded my own csv for the Chicago area to include in this analysis.
In this quick study, I'll be using numpy, matplotlib, and pandas. With the two datasets stored in pandas dataframes, there are some nice SQL-esque
join functions that pandas offers.
The analysis in this post is quite simple. I want to compare public bus ridership on different routes with air temperature, so I'll develop a metric ($R^2$ from a linear fit for example) to determine amount of correlation. This blog post is mostly to demonstrate how to merge two data sets from different sources (ridership, and weather). That said, the results are pretty interesting, too!
%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt
Let's start by reading in the bus ridership csv file I downloaded, then doing some cleaning to prepare it to be merged with weather data.
myRidershipFile = 'data/CTA_-_Ridership_-_Bus_Routes_-_Daily_Totals_by_Route.csv' df = pd.read_csv(myRidershipFile,index_col=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 701321 entries, 0 to 701320 Data columns (total 4 columns): route 701321 non-null object date 701321 non-null object daytype 701321 non-null object rides 701321 non-null int64 dtypes: int64(1), object(3) memory usage: 26.8+ MB
The 'route' column is self explanatory, there are 182 routes and the name is stored as a string (despite many of the route names being numbers). 'daytype' indicates whether the day is a weekday, Saturday, or Sunday/Holiday ('W', 'A', or 'U' respectively). Also, the date column is a string too. Since ultimately this will be the field on which we will perform our SQL-like join with weather data, it's probably best to convert this column to an actual python datetime object... we'll use the pandas build-in
df['date_dt'] = pd.to_datetime(df['date'],format='%m/%d/%Y') df['year'] = df['date_dt'].dt.year
Now let's read in the NOAA weather data and clean it up a little. We need to make sure we end up with a column called 'date_dt' that holds datetime objects. Having the same column name and type as in the bus ridership dataframe is crucial for the
join we'll be doing later.
myWeatherFile = 'data/832070.csv' wdf = pd.read_csv(myWeatherFile,index_col=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 5778 entries, 0 to 5777 Data columns (total 9 columns): STATION 5778 non-null object STATION_NAME 5778 non-null object DATE 5778 non-null int64 PRCP 5778 non-null float64 SNWD 5778 non-null float64 SNOW 5778 non-null float64 TAVG 5778 non-null int64 TMAX 5778 non-null int64 TMIN 5778 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 451.4+ KB
With some quick exploring, I find that the
TAVG column (average temperature) for some reason has several years worth of -999 values in it, but
TMIN seem to look better. I'll just define my own average temperature as
TMAX+TMIN/2. There are a few days remaining where even this temperature is undefined, so we'll throw those out too.
wdf['MAXMINAVG'] = (wdf['TMAX'] + wdf['TMIN']) / 2. wdf = wdf.query('MAXMINAVG > -999')
Let's make the datetime column in the weather dataframe now. The date string in this data is annoyingly a different format, but that's no problem for the
wdf['date_dt'] = pd.to_datetime(wdf['DATE'],format='%Y%m%d')
Now comes the reason you're still reading! The pandas merge between the ridership dataframe and the weather dataframe. While the weather dataframe has exactly one entry per day, the ridership dataframe has 182 entries per day (since there are that many routes). I want my merged dataframe to also have 182 entries per day, with duplicates of the weather information for that day. Also I want to throw out any ridership days that don't have matched weather days, and vice versa. The prescribed SQL join to do this is an inner join. In SQL lingo this join should look something like:
SELECT * FROM df INNER JOIN wdf ON df.date_dt = wdf.date_dt
but in pandas it looks like this:
mergedf = df.merge(wdf, on='date_dt', how='inner')
merge completed, some quick simple plots can be made from the merged dataframe. Check this one out!
In red is the average air temperature each Saturday over a four year span, peaking at around 80F in the summer months, and at around 10F in the winter. In green is the daily ridership on those same Saturdays, specifically on Route 151. This certainly looks like a correlation to me. Though I have to admit, I chose the route and years that had the strongest correlation. They don't all look this good!
The best way to see the correlation between two variables directly is to plot one versus the other. In this case I want to plot ridership versus average daily temperature. Since I want to do this for all of the routes, and ridership between routes varies widely, I normalize by instead ploting the standard deviation from yearly means for each route. Here's what the most correlated route (Route 151) and what the least correlated route (Route 106) look like.
Curious how I determined the most and least correlated routes? I made this plot for each route, used numpy's polyfit package to do a least-squares polynomial fit of degree 1 (linear) and stored the resulting $R^2$ of the fit. The above two plots show that polyfit line in blue, with the $R^2$ value in the legend. Here's what the distribution of $R^2$ values look like for all of the routes.
And there you have it, a simple yet interesting example of using an SQL-like
join command in pandas to merge together data sets from different sources. I hope you found this interesting, and perhaps educational. Thanks for reading!