In this article I will present machine learning model which predicts the winner of the game with 92% accuracy.
Football is considered one of the beautiful games in the world due to its uncertainty. The 2015/16 Premier League season is a great example, where the odds on league winner Leicester City F.C. were 5000/1. But recently, we are seeing that the unlikely events are deterministic and predictable, it is just a matter of data. Our capacity of processing and analyzing information is limited thus we are not able to predict these outcomes. This is where machine learning comes into play, it gives us the power to find patterns in the huge amount of data, and minimize the uncertainty.
There are plenty of medium articles and published papers about football predictions with varying levels of success. Most of the paper’s approach is to use single time step machine learning algorithms such as Logistic Regression, Neural Network, Random Forrest with state-of-the-art accuracy of 60%. But it is not how punters are placing their bets.
Apart from statistical indicators of the team, they are also taking into account the last games. The promising approach was shown in A Sure Bet by Stanford University where they used a sequential model(LSTM) but the wrong choice of sequence length resulted in poor results(47% accuracy).
Dataset and Features
The final dataset contains the combined features listed below from Premier League season 2019/20 and 2020/21. The format of sequential data is [sample, time step, sequence]. Each sample contains the features of the home team, away team, and odds from last five matches.
To make the form more accurate I created the rating column for each team with the initial value taken from Premier League Fantasy dataset. The rating is updated with each game, based on the ELO rating formula. The files with data preparation and ELO rating calculations are on my Github. The prefix ‘H’ defines the parameters for home team and ‘A’ for away team.
Dataset before normalization:https://medium.com/media/e6e73a3bb00d440314c252eb67e2489aData after normalisation to train model is available at my Github saved in .npy file
- Attacking style(AS) — taken from FIFA21 dataset
- Defensive style(DS) — taken from FIFA21 dataset
- Rating home (SOH)— initial ratings is from Premier League Fantasy dataset for current season and updated with each game using on ELO rating equation
- Rating away (SOA) — same as above but for away games
- xG — expected goals(attackers statistics)
- xGa — team’s ability to prevent scoring chances(goalkeepers statistics)
- ppda — passes allowed per defensive action(defence statistics)
- Mtime — managers time calculated from the day he started to the date of the game
- Odds for home team, draw, and away team
A GRU(Gated Recurrent Unit) is a neural network designed to process sequential data. It uses gates to select out the irrelevant information and keep the useful one. The update gate is determining how much previous knowledge should be passed into the next state, while the reset gate is deciding which past knowledge to forget. If you want to know more how Recurrent Neural Network works in general, check out my article.
Even though the most popular variation of recurrent neural networks are LSTMs, GRUs are working better with less amount of training data, and due to less complexity converged faster. In our case during the testing, GRU vastly outperformed LSTM, with around 20% in accuracy.
To find the optimal performance I tested different architectures and compare them based on the achieved accuracy. I experimented with Dense layers and descending units after GRU, but without any significant improvement. The table below describes all the results I achieved for particular structure.
The architectures with GRU layers on its own worked the best. So eventually I narrowed down the structures to the 3 and 4 layer GRU with 256, 512 units. After grid search, the training parameters listed below gave a stunning 92% of accuracy on testing data.
Although the model is overfitting a bit from the 50 epoch, it is generalizing well on the new data. The training process took around 2 minutes using Google Colab GPU accelerator on MacBook Pro with M1 chip and 8GB of RAM.
This is a simplistic model which is using the only fraction of available data and features that bookies find important. The results are quite satisfying but I think with more data it is possible to push it forward. The Fbref website provides much more detailed information, but it requires a lot of data processing, so if you are interested in going further with this project, hit me up.
These are the features that punters also get into consideration and should improve the accuracy of predictions :
- Live line-ups
- Live form of the players
- Game schedule — add Champions League and other Tournaments
- Weather conditions
- Relationship between teams
- Sentiment score
In this article I show that the sequential models perform much better than the other machine learning techniques on football predictions. Future work can be performed on this subject with different leagues, and features listed above, to reduce uncertainty below 1%.
Feel free to contact me for a collaboration or any question about the project.