Quantifying Difficulty of L.A. Times Daily Crosswords: Are Mondays Harder than Tuesdays?¶
One of the ways I keep in touch with a good college friend of mine is by occasionally doing crosswords over skype. Back when we were in school together, we would do the crossword in the school paper (which used the same crosswords as the L.A. Times) over lunch. We definitely aren't experts, but we could usually manage doing Monday, Tuesday, and Wednesday puzzles (crosswords get more difficult throughout the week, ending with Saturday being impossible for a simpleton like myself). Sundays are supposed to be easier than Saturdays but are larger and therefore tend to take more time.
The Question¶
For years now, we've had a hunch that the Monday puzzles are actually slightly harder than the Tuesdays, but haven't had a way to validate it. Unfortunately I haven't kept any data on our own solving ability (solve times, puzzle completion), though it would probably be statistics limited, and furthermore frought with outliers (taking large breaks to chat about other things, for example).
The Data¶
Since many people nowadays do the crossword online (as my friend and I do), I was hoping to get my hands on a database of user information... but to the best of my knowledge, this is proprietary and not publicly accessible. Therefore, I'll have to find a simpler solution.
Lucky me, I found this blog written by a man named Bill Butler, who is essentially a professional crossword solver. He posts in his blog each day the solution to the day's puzzle, along with the time he took to solve the puzzle. As I'll show in this post, he is remarkably consistent, and solves puzzles several times faster than my friend and I do.
The Analysis¶
My general approach to determine crossword difficulty as a function of day is to scrape through Bill's past blog posts, extract the time and day of the week for each puzzle, wrangle the data into a usable format (pandas dataframe), and crunch some numbers. I'll note that in my past blog posts, I've been transparent about all of the code I've used. In this case, most of the web scraping is remarkably similar to my post on John Bartholomew's influence on a popular chess website, so I'll hide it for readability.
I've hidden some cells in this notebook where I define functions that take as input a BeautifulSoup object, then scrape the webpage to return a specific thing. They throw an exception when they fail (for example, in the one drastically different blog post in which Bill doesn't solve a puzzle but instead announces changes to the blog formatting). Conveniently, each of Bill's blog posts corresponds to one puzzle. For each puzzle, I store Bill Butler's solve time (in seconds), the day of the week for that puzzle, and the author of the puzzle (though this isn't actually used).
def scrapeURL(myurl):
soup = getSoup(myurl)
time = getBillsTimeSeconds(soup)
day = getDay(soup)
author = getAuthor(soup)
return time, day, author
Now I loop over the past three years of Bill's blog posts (he only misses that one day!) and store what I want for each. I don't particularly like this way of storing the data, but it is only temporary. This format makes it easy to cast the results into a pandas dataframe.
days, times, authors = [], [], []
viable_days = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ]
# Loop over web pages, store the day and Bill's solve time in the dict
ndays = 365*3
for ipage in xrange(1,ndays,1):
# Form the URL based on an obvious pattern
myurl = 'http://www.laxcrossword.com/page/%d' % ipage
# Try to scrape the URL and catch/report any exceptions.
try:
thisTime, thisDay, thisAuthor = scrapeURL(myurl)
except Exception as e:
#print 'EXCEPTION: %s' % e
continue
# This covers rare failure modes where there are typos in the day of the week,
# or for some reason my scraper function returned a multi-word day of the week.
if thisDay not in viable_days or len(thisDay.split(' ')) != 1:
continue
# Store the scraped day (string), time (integer), and puzzle author (string)
days.append(thisDay)
times.append(thisTime)
authors.append(thisAuthor)
Now we use a useful pandas utility function, from_items(), that takes in lists and makes a dataframe.
mydf = pd.DataFrame.from_items([('Day', days), ('Time', times), ('Author', authors)])
mydf.head()
After looking at some quick histograms, there were a few potential outliers where Bill took longer than average to solve the puzzle. Later on in this post I will be fitting these histograms to gaussian (normal) distributions, so I've decided to throw out these outliers as they degrade the quality of the fit (I remove a few percent of posts). I remove outliers by determinining how many standard deviations each point is from the mean. I cut on points greater than 2 standard deviations away. While this idea is easy to implement, I'll be upfront in saying I took this line of code directly from a stack overflow post. I love you, stack overflow!
def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]
Next I draw area-normalized histograms of solve-time distributions for each day of the week (though I only draw Monday, Tuesday, and Wednesday below so as not to spam you with plots). I use scipy.stats.norm to fit a gaussian (normal) probability distribution function (pdf) to each. I store the mean and standard deviation of each fit.
means_errs = {}
bins = np.linspace(200,800,30)
colors = {'Monday':'r','Tuesday':'b','Wednesday':'g'}
# Loop over all seven days of the week
for day in viable_days:
# Query the dataframe for only entries corresponding to this particular day of the week
# and store the times in a numpy array
times = np.array(mydf.query('Day == "%s"'%day)['Time'])
# Drop the outliers
times_trunk = reject_outliers(times,m=2)
print "%s: \tFraction of data removed as outliers = %0.2f%%" % \
(day,100.*(1.-float(len(times_trunk))/len(times)))
# Fit the data to a normal distribution
mu, std = norm.fit(times_trunk)
# Store the results of the fit
means_errs[day] = (mu, std)
# Only plotting the first three days for blog post readability
if day not in ['Monday', 'Tuesday', 'Wednesday']: continue
# Use matplotlib to actually draw the histogram
myfig = plt.figure(figsize=(8,5))
myhist = plt.hist(times_trunk,bins=bins, alpha=0.4, \
label='%s: $\mu$=%0.1f, $\sigma$=%0.1f'%(day,mu,std), \
normed=True, color=colors[day])
# Plot the PDF on top of the histogram
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 500)
p = norm.pdf(x, mu, std)
dummy = plt.plot(x, p, color=colors[day], linewidth=4)
dummy = plt.grid(True)
dummy = plt.xlabel('Bill\'s Solve Times [sec]',fontsize=16)
dummy = plt.ylabel('Counts [Normalized]',fontsize=16)
dummy = plt.legend(loc=1)
We can see from the histograms that these data are approximately normal. That makes intuitive sense; Bill is remarkably consistent, and apparently the L.A. Times does a good job at keeping crossword puzzle difficulty consistent for each day of the week as well.
The Result¶
Now to make the money plot. I will plot the mean of the above PDF for each day of the week, and assign to it an uncertainty equal to one standard deviation (that is, if the distributions were truly normal, 68% of the time Bill will solve a puzzle within that band). I hide this code as well (it's very straightforward matplotlib).
Well would you look at that. It's clear that while Monday and Tuesday puzzles are obviously the easiest, Mondays are in fact not harder than Tuesdays. Perhaps when my friend and I solve puzzles there are external factors at play. Maybe starting off on a Monday after a long break leaves us "rusty" and thus slows us down... by the time we get to Tuesdays maybe we're warmed up. Who knows!
Also, Sundays are actually larger than Saturdays so simply comparing solve times isn't sufficient. If you normalize the above plot by the number of squares in each puzzle, Sundays end up being about as difficult as Thursdays.
I hope you enjoyed this post. Feel free to shoot me an e-mail with comments!
Comments !