## Implementing a Support Vector Machine (SVM) E-Mail Spam Classification Filter¶

If my middle school english teacher taught me one thing, it was that I should start every essay with a dictionary definition of a word and then struggle to connect it with the topic of the essay. So, in the words of our benevolent overlord, Wikipedia,

Support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier.

In this blog post, I briefly explore how SVMs work, and then show how some out-of-the-box sklearn SVM software can be used to build a spam classification filter for your e-mail! This project was inspired heavily by Andrew Ng's Machine Learning course on Coursera which I am currently completing in python at a reduced pace alongside my Ph.D. thesis work.

### Brief Overview of SVMs¶

Support Vector Machines (SVMs) are supervised learning models. They are similar to logistic regression (which I show implementations of in this previous blog post), but with a modified cost function that is computationally advantageous.

To demonstrate, below I show the shape of one term of the cost function for a logistic regression in blue, while I show the shape of the analogous term of the cost function for an SVM in red. I leave the x- and y- axes unlabeled to keep the plot abstract; the functional shapes are sufficient to demonstrate my point!

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
```

```
xvals = np.linspace(-3,3,100)
lregtermvals = -np.log(1/(1+np.exp(-xvals)))
svmtermvals = [0.625 - (2.5/4.0)*x if x < 1. else 0 for x in xvals]
plt.figure(figsize=(8,5))
plot1 = plt.plot(xvals,
lregtermvals,
label='Logistic Regression Cost Term',
color='blue',linewidth=4)
plot2 = plt.plot(xvals,
svmtermvals,
label='Analog SVM Cost Term',
color='red',linewidth=4)
plt.grid(True)
plt.title('Hypothetical Cost Term Diagram',fontsize=16)
myleg = plt.legend()
```

As you can see, logistic regression cost functions are continuous functions involving logs of exponentials while SVMs use simple piecewise linear cost terms. This provides the computational advantage to SVMs that I mentioned earlier.

I'll also note that this piecewise cost structure is what makes an SVM a *large margin classifier*, something which I won't describe in detail but I'll just note that this makes the decision boundary computed by an SVM more robust. I highly suggest checking out the aforementioned Coursera machine learning course in which Andrew Ng steps through the math behind large margin classifiers (though he marks it as "optional" because he uses vector scalar products in his derivation... cakewalk for a physicist!).

Another similarity SVMs have to logistic regressions is that they can be regularized to help find a balance between underfitting (high bias) and overfitting (high variance) the data. While in logistic regressions the regularization parameter is often called $\lambda$, in SVMs the letter $C$ is often used... note that $C$ is analogous to the *inverse* of $\lambda$. The best way to choose the value of $C$ that finds the middleground between underfitting and overfitting the data is to split up your data sample in to three groups: training, cross validation, and test. More on this below when I start actually implementing an SVM!

#### SVMs: Kernels¶

Lastly, SVMs are so computationally fast that they are able to use *kernels* without running egregiously slowly. In simple terms, a kernel function (as used here) is one that takes as input two data points and returns some measure of similarity between those two data points. Not any function can be considered a kernel though; it has to satisfy Mercer's Theorem, but I won't be going into that here. Two common kernels used are *gaussian kernels* and *linear kernels*.

In the case of the gaussian kernel, the similarity between two points is computed with a gaussian function based on the euclidean distance between those two points (with dimensionality equal to the number of features used to describe those points). One can choose the gaussian width, $\sigma$, to tune how spaced apart points can be to still have them be considered "similar". In the figure below, I draw two sample data sets and have used a gaussian kernel in an SVM to find a decision boundary between the two. The actual code I used to create this plot is shown in my Coursera class homework solutions.

```
from IPython.display import Image
Image("data/SVM_decisionboundary_fig.png")
```

In the case of the linear kernel, no kernel is used at all! As readers of this blog post, it's your lucky day; I'm using a linear kernel for this spam classifier!

### SVMs for Spam Classification¶

Alright enough text already, let's get to the code. In this post I will be using samples of spam and non-spam e-mails supplied by the SpamAssassin Public Corpus which I will subdivide into three groups as I mentioned earlier: a training set, a cross validation set, and a test set. An outline of what we're about to do is:

- Read in the e-mails.
- Preprocess the e-mails to turn their text into a format an SVM can understand.
- Subdivide the e-mails into the three sets (training, CV, test).
- Initialize an SVM, try different values of the $C$ parameter, training the SVM on the training set each time.
- Run each trained SVM on the cross validation set and determine which value of $C$ is most appropriate.
- Run the best trained SVM on the test set and report how well it worked.

#### What Does the Data Look Like?¶

```
import os
data_dir = 'data/svm_spam_data/'
#Let's read in the non-spam e-mails into an array:
nonspam_emails = []
for nonspam_email in os.listdir(data_dir+'easy_ham/'):
nonspam_emails.append(open( data_dir+'easy_ham/' + nonspam_email, 'r' ).read())
#Let's now read in the spam e-mails into an array:
spam_emails = []
for spam_email in os.listdir(data_dir+'spam/'):
spam_emails.append(open( data_dir+'spam/' + spam_email, 'r' ).read())
#Let's have a look at the first non-spam e-mail just to see...
print nonspam_emails[1]
```

#### Processing the E-Mails Into a Format an SVM Can Use¶

Wow the e-mail above is pretty gross. We're going to have to do some serious work to turn it into some sort of feature vector that an SVM can read in.

To start, let's get rid of all of that header stuff. I'm not sure the best way to do it, but looking at a bunch of these sample e-mails, all the header stuff generally ends in a double new line, '\n\n'. Let's double check that all e-mails (non-spam, and spam) have a double newline in it.

```
for nonspam_email in nonspam_emails:
if "\n\n" not in nonspam_email:
print "This non-spam e-mail didn't have a double newline!"
print nonspam_email[:500]
for spam_email in spam_emails:
if "\n\n" not in spam_email:
print "This spam e-mail didn't have a double newline!"
print spam_email[:500]
```

Weirdly the first spam_email seems to be a list of *mv* commands perhaps used when generating this data set... let's just throw that one out, and continue.

```
spam_emails = spam_emails[1:]
```

Since all of the e-mails (spam and non-spam) have a double newline, I'm going to be assuming the first occurence of the double newline designates the end of the header garbage. Now I loop through all e-mails and erase the headers.

```
nonspam_emails = [ email[email.find('\n\n')+4:] for email in nonspam_emails ]
spam_emails = [ email[email.find('\n\n')+4:] for email in spam_emails ]
```

Next I'm going to preprocess each e-mail, meaning I'm going to make everything lowercase, strip out html tags, etc. I'll also change any numbers to the string 'number', this way the SVM can easily treat all numbers the same, since I have no reason to believe '12345' is any more likely to show up in a spam e-mail than '12346'. I also do something similar to any string starting with 'http', strings I think are e-mail addresses, and dollar amounts.

```
import re #regular expression for e-mail processing
def preProcess( email ):
"""
Function to do some pre processing (simplification of e-mails).
Comments throughout implementation describe what it does.
Input = raw e-mail
Output = processed (simplified) email
"""
# Make the entire e-mail lower case
email = email.lower()
# Strip html tags (strings that look like <blah> where 'blah' does not
# contain '<' or '>')... replace with a space
email = re.sub('<[^<>]+>', ' ', email);
#Any numbers get replaced with the string 'number'
email = re.sub('[0-9]+', 'number', email)
#Anything starting with http or https:// replaced with 'httpaddr'
email = re.sub('(http|https)://[^\s]*', 'httpaddr', email)
#Strings with "@" in the middle are considered emails --> 'emailaddr'
email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email);
#The '$' sign gets replaced with 'dollar'
email = re.sub('[$]+', 'dollar', email);
return email
```

Next, I take each e-mail and tokenize it (break it up into 'words' which can be defined as strings separated by a space, or separated by some other delimiters that I use). Each token (or word) is then stemmed, which means I attempt to shorten each word to its root. That way, words like *mortgage* and *mortgaging* are both stemmed to *mortgag* and will therefore be treated the same way by the SVM. To do the actual stemming, I use a stemmer module from the Natural Language Toolkit package in python.

```
import nltk, nltk.stem.porter
def email2TokenList( raw_email ):
"""
Function that takes in preprocessed (simplified) email, tokenizes it,
stems each word, and returns an (ordered) list of tokens in the e-mail
"""
stemmer = nltk.stem.porter.PorterStemmer()
email = preProcess( raw_email )
#Split the e-mail into individual words (tokens) (split by the delimiter ' ')
#but also split by delimiters '@', '$', '/', etc etc
#Splitting by many delimiters is easiest with re.split()
tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
#Loop over each word (token) and use a stemmer to shorten it,
tokenlist = []
for token in tokens:
#Remove any non alphanumeric characters
token = re.sub('[^a-zA-Z0-9]', '', token);
#Use the Porter stemmer to stem the word
stemmed = stemmer.stem( token )
#Throw out empty tokens
if not len(token): continue
#Store a list of all stemmed words
tokenlist.append(stemmed)
return tokenlist
```

After an e-mail is handed to this function, *email2TokenList*, what comes out is a list of stemmed words. However, the number of unique stemmed words throughout all of the e-mails is very large... perhaps too large for each unique stemmed word to be used as a feature for the SVM to consider when it attempts to classify an e-mail as spam or non-spam. Also, words that occur very rarely in the training set might only occur in a few e-mails, which could lead to overfitting the data set. To handle these two problems, I'll do the following: loop over all the e-mails (spam, and non-spam) and take the 1000 most common words overall as my vocabulary list.

I accomplish this in a somewhat hacky way. I concatenate all emails to form one giant e-mail, tokenize and stem it, and use a function from within the collections module to count the number of occurrences of each word. I'll make a collections.Counter object, which is a high performance python container with super useful functionalities like the *most_common(N)* function, which returns the *N* most common tokens in the collection. I.E. exactly what I want!

```
giant_nonspam_email = [ ' '.join(nonspam_emails[:]) ]
giant_spam_email = [ ' '.join(spam_emails[:]) ]
giant_total_email = giant_nonspam_email[0]+giant_spam_email[0]
```

```
from collections import Counter
word_counts = Counter(email2TokenList(giant_total_email))
```

```
print 'The total number of unique stemmed tokens is: '
print len(word_counts)
print 'The ten most common stemmed tokens are:'
print [ str(x[0]) for x in word_counts.most_common(10) ]
```

What I do here is get the 1000 most common tokens and assign them each a unique number, 1 through 1000. I store this information in a dictionary, where the key is the stemmed token and the value is the unique number.

```
common_words = [ str(x[0]) for x in word_counts.most_common(1000) ]
vocab_dict = dict((item, i) for i, item in enumerate(common_words))
```

Up next is converting each e-mail into a list of these indices, then further converting that into a feature vector of shape (1000, 1). This vector will contain only 1's and 0's... for example, if the e-mail has only three words whose indices in the vocab_dict are 3, 9, 14, then the 3rd, 9th, and 14th element of the vector will be 1's while the other 997 elements in the vector will be 0's.

```
def email2VocabIndices( raw_email, vocab_dict ):
"""
Function that takes in a raw email and returns a list of indices corresponding
to the location in vocab_dict for each stemmed word in the email.
"""
tokenlist = email2TokenList( raw_email )
index_list = [ vocab_dict[token] for token in tokenlist if token in vocab_dict ]
return index_list
def email2FeatureVector( raw_email, vocab_dict ):
"""
Function that takes as input a raw email, and returns a vector of shape
(n,1) where n is the size of the vocab_dict.
The first element in this vector is 1 if the vocab word with index == 1
is in the raw_email, 0 otherwise.
"""
n = len(vocab_dict)
result = np.zeros((n,1))
vocab_indices = email2VocabIndices( raw_email, vocab_dict )
for idx in vocab_indices:
result[idx] = 1
return result
```

Woohoo! Now we can turn each e-mail into a feature vector and we can hand it to an SVM. Let's *finally* get into the actual SVM portion of this blog post!

#### Training the SVM for Spam Classification¶

What I want to end up handing an SVM is a training set with, say, *n* samples in it. I need to create two things:

- A feature matrix,
*Xtrain*with shape (*n*, 1000 ) [because my vocab dictionary is 1000 words long], which is formed by stacking together a feature vector for each e-mail. This matrix will contain both spam and nonspam e-mails. - A classification vector,
*ytrain*with shape (*n*, 1 ) which contains the labels of each corresponding e-mail in the*Xtrain*matrix (1 for spam, 0 for nonspam).

I've chosen the ratio of training set size to cross validation set size to test set size to be roughly 60/20/20. Let's build my *Xtrain* matrix and *ytrain* vector from the first 60% of the nonspam e-mails and the first 60% of the spam e-mails, then build the *Xcv* and *ycv* objects from the next 20%, and lastly the *Xtest* and *ytest* objects from the remaining 20%.

```
#Build the training X matrix and y vector
n_nonspam_train = int(len(nonspam_emails)*0.6)
n_spam_train = int(len(spam_emails) *0.6)
nonspam_train = [email2FeatureVector( x, vocab_dict )
for x in nonspam_emails[:n_nonspam_train]]
spam_train = [email2FeatureVector( x, vocab_dict )
for x in spam_emails [:n_spam_train ]]
Xtrain = np.concatenate(nonspam_train+spam_train,axis=1).T
ytrain = np.concatenate(
(np.zeros((n_nonspam_train,1)),
np.ones((n_spam_train,1))
), axis=0)
#Build the CV X matrix and y vector
n_nonspam_cv = int(len(nonspam_emails)*0.2)
n_spam_cv = int(len(spam_emails) *0.2)
nonspam_cv = [email2FeatureVector( x, vocab_dict )
for x in nonspam_emails[n_nonspam_train:n_nonspam_train+n_nonspam_cv]]
spam_cv = [email2FeatureVector( x, vocab_dict )
for x in spam_emails [n_spam_train:n_spam_train+n_spam_cv ]]
Xcv = np.concatenate(nonspam_cv+spam_cv,axis=1).T
ycv = np.concatenate(
(np.zeros((n_nonspam_cv,1)),
np.ones((n_spam_cv,1))
), axis=0)
#Build the test X matrix and y vector
n_nonspam_test = len(nonspam_emails) - n_nonspam_train - n_nonspam_cv
n_spam_test = len(spam_emails) - n_spam_train - n_spam_cv
nonspam_test = [email2FeatureVector( x, vocab_dict )
for x in nonspam_emails[-n_nonspam_test:]]
spam_test = [email2FeatureVector( x, vocab_dict )
for x in spam_emails [-n_spam_test: ]]
Xtest = np.concatenate(nonspam_test+spam_test,axis=1).T
ytest = np.concatenate(
(np.zeros((n_nonspam_test,1)),
np.ones((n_spam_test,1))
), axis=0)
```

Now we have our training set, our cross-validation set, and our test set finally in the format we want.

As a reminder, the purpose of the cross-validation set was to use it to choose our parameter, $C$ without baising ourselves when we report our final performance on the test set.

Let's create several instances of sklearn linear kernel SVM's, each with a different $C$ parameter. We'll train each on the training set and evaluate the performance of each on the cross-validation set. Finally, we'll choose the instance with the best performance and run it on the test set to see how well our model works!

```
from sklearn import svm #SVM software
# Run the SVM training (with C = 0.1) using SVM software.
myCs = [ 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0 ]
myErrors = []
myErrors_train = []
for myC in myCs:
# Make an instance of an SVM with C=myC and 'linear' kernel
linear_svm = svm.SVC(C=myC, kernel='linear')
# Now we fit the SVM to our Xtrain matrix, given the labels ytrain
linear_svm.fit( Xtrain, ytrain.flatten() )
# Determine how well this SVM works by computing the
# classification error on the cross-validation set
cv_predictions = linear_svm.predict(Xcv).reshape((ycv.shape[0],1))
cv_error = 100. * float(sum(cv_predictions != ycv))/ycv.shape[0]
myErrors.append( cv_error )
# While we're at it, do the same for the training set error
train_predictions = linear_svm.predict(Xtrain).reshape((ytrain.shape[0],1))
train_error = 100. * float(sum(train_predictions != ytrain))/ytrain.shape[0]
myErrors_train.append( train_error )
```

```
plt.figure(figsize=(8,5))
plt.plot(myCs,myErrors,'ro--',label='CV Set Error')
plt.plot(myCs,myErrors_train,'bo--',label='Training Set Error')
plt.grid(True,'both')
plt.xlabel('$C$ Value',fontsize=16)
plt.ylabel('Classification Error [%]',fontsize=14)
plt.title('Choosing a $C$ Value',fontsize=18)
plt.xscale('log')
myleg = plt.legend()
```

Here we can see that the performance on the training set improves as $C$ increases. With a small $C$ value, we are underfitting the training set and therefore the performance on the CV set is bad. With a large $C$ value, we are likely overfitting the training set as seen by the above blue curve. A good intermediate $C$ value can be extracted from the red curve on the above plot. Any value between 0.001 and 1 are acceptable, but above that point the CV set error begins increasing again due to overfitting. **Let's go with a $C$ value of 0.1.**

Huzzah! Let's see how well we perform on the test set!

```
best_svm = svm.SVC(C=0.1, kernel='linear')
best_svm.fit( Xtrain, ytrain.flatten() )
test_predictions = best_svm.predict(Xtest).reshape((ytest.shape[0],1))
test_acc = 100. * float(sum(test_predictions == ytest))/ytest.shape[0]
print 'Test set accuracy = %0.2f%%' % test_acc
```

### Awesome¶

We've successfully accomplished our goal of training a support vector machine to classify e-mails as spam, or not. Let's recap what we accomplished:

- We found a source of labeled spam, and non-spam e-mails online.
- We processed each e-mail by stripping miscellaneous headers, tokenizing, and stemming the resulting tokens.
- We sifted through all of the e-mails to get a "vocabulary list" of the most common tokens.
- We divided the e-mails into three sets: train, cross-validation, and test.
- We converted the e-mails into feature vectors, and built feature matrices and label vectors in the form that an SVM can understand.
- We constructed SVMs with various parameters and chose the model that worked best on the cross-validation set.

**Finally, we tested our best model on the test set and found it has 98.69% accuracy to identify spam e-mails!**¶

Thanks for coming along for the ride. I hope you enjoyed this blog post!

## Comments !