Modeling EPL Football
The problem(s)


There isn't a significant amount of readily usable data out there to model EPL football with. In fact the only site that had CSV formatted files was:

This means building custom text parsers to pull data out of html. There are many things I would rather be doing then building text parsers: squirting lemon juice into a paper cut, or smacking my pinky toe on the corner table being among those things, but I digress, the data. I used the football-data site as my main source, but also borrowed the fixtures list from one of those sports news sites.

After many wasted hours, I finally had my data - now what to do with it. First - to the cleaner! Why is it that no one can spell team names the same way? Is it Manchester United, ManU, Manchester, or ManUTD? And don't even get me started on Tottenham. After more countless hours of manual cleansing, I finally had data that was ready for analysis.


Unfortunately all of that effort did not leave me with a very innovative feature set: wins and losses, goals scored for and against, shots on target, corners and penalties for home and away teams. While the volume and complexity of the data set left much to be desired, it is often good advice to just build a quick and dirty model to see what shakes out before going back to step one of text parser hell to find more features.

In machine learning, there are two primary methods used:

The first:
Is the traditional approach known as supervised learning. Here a computer scientist will spend many hours (or years depending on the problem) defining features with which to feed the algorithm. While a good algorithm helps, better features and more data are typically the weapons of choice most reach for when trying to build a supervised model.

The second:
Unsupervised techniques are used to help an algorithm learn its own features. Unsupervised techniques can be powerful and effective ways of finding interesting and complex structures in the underlying data and often yield better results then a purely supervised approach.

While some may argue that the future will be owned by the unsupervised techniques, I often find that supervised models are a good place to start as a quick and dirty approach. So I built out a list of features - rolling statistics on points scored, games won and shots on target for both the home and away teams, and reached for the old logistic regression to build the first model.

The gitub for the text parsers, and model predictions is located: