Building An NHL Game Prediction Model, Part 1

Gary Schwaeber
Analytics Vidhya
Published in
4 min readJun 18, 2021

--

Credit: Chris Liverani, Unsplash

For my data science capstone project at Flatiron School, I built an NHL Game Prediction model. Over my next couple blog posts I am going through the various steps and lessons I took in order to build the model. Now without further ado….

Motivation

I have always been a huge hockey fan, and more recently also became a big data nerd. But even before jumping into the world of data and pursuing it as a career path, I was always interested in reading the journalists who delve deep into the analytics of the game. Over the past decade, I have seen a greater acceptance by people to use advanced statistics as more meaningful ways to evaluate the game. So when it came to choosing my capstone project, working with NHL data felt natural. Inspired by Dom Luszczyszyn, a writer for The Athletic who created a game prediction model which is one of if not the best publicly published models out there, I decided to embark on building my own game prediction model.

Business Problem

Trying to solve some problem with data is at the heart of any data science project. The problem an NHL game prediction model solves comes down to the simple idea: Can I bet on hockey games more intelligently than gut intuition.

Unlike in Football or Basketball where betting against the spread is the most popular type of betting, the moneyline is king in the NHL due to lower scoring games. So you’re just betting on who will win the game regardless of how much a team won by. There is a different cost to betting on the underdog vs the favorite, and that cost is the moneyline. When betting the moneyline, the way to gain an edge is if you know the truer probability of the game outcome then the implied probability from the moneyline. Over the course of the season, if your internally derived game probabilities are superior to the bookmakers’, you can be profitable.

Calculating Implied Probability

Let’s take a look at this example:

Game line courtesy of DraftKings

Tonight the Vegas Golden Knights are playing the Montreal Canadiens. We can see from the moneyline column that Vegas is favored because their moneyline figure is negative. The -175 means you would have to risk $175 to win $100 profit. For the underdog Canadiens, the +148 figure indicates you need to risk only $100 to profit $148.

To calculate the implied probability of Vegas (the favorite) winning, the formula is:

175 / (100+175) = 63.6%.

To calculate the implied probability of Montreal (the underdogs) winning, the formula is:

100 / (148+100) = 40.3%.

You may have noticed that those probabilities add up to more than 100%. That’s because the books take a commission (aka Vig) on both sides of the bet. They require you to risk more $ (or receive less payout depending on how you look at it) to make your ‘investment’.

Applying The Model Output To A Betting Strategy

With all the above in mind, I needed to create a model that would output the probability of each team in the matchup winning. If the model outputs a probability of a team winning that was higher than the implied probability from the moneyline, you would bet on that team. Of course you can always bring some qualitative analyses into your decision making, but that is the general gist of how you would use the model.

When I got to the modeling stage of my project, I needed to use machine algorithms that could output probabilities. For this reason I tried logistic regression, AdaBoost, Gradient Boosting and a Neural Network to see which one would score best. The scoring metric I used to judge the models was log-loss. Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value, Source.

Conclusion

In this post I went over my motivation for building an NHL game prediction model and the use for it. In later posts I will discuss how I got the data, cleaned it, engineered features, selected features, went through the modeling process, and provide some insights from the data and lessons I learned along the way. You can check out the Github repository for this project here. As of this writing, I am continuously working on fine tuning the model, and making the code more dynamic so that it’s ready to be rolled out for the 2021–2022 season.

--

--