Data Acquisition

Courtesy of Taylor Friehl via Unsplash

Overview

In part one of the series on creating an NHL game prediction model, I discussed my personal motivation and the application of the model to a betting strategy. In this post, I will go into how I got the data, cleaned it, and got it organized to be usable for modeling.

I wanted to start by making a top down model. This means I would be using team based stats as my features as opposed to building up a model from the stats of the players on the teams. The exception to this is the goalie. Goaltenders in hockey are considered to be the most important players on the ice. The play and quality of a goaltender can almost single handedly win a game for a team. For many teams, their starter and backup differ in quality significantly and who is starting will greatly impact the outcome.

In order to train the model I needed to get both the historical results as well as statistics that measure various elements of team and goaltending strength at that point in time. To get these point in time measurements, I would need game logs of statistics. Then, I would need to calculate statistics on a rolling game basis at every point in time.

Data Acquisition

Natural Stat Trick is one of the most popular websites hosting hockey stats. I had prior experience scraping data from NST, that I wrote about in a prior blog post. Essentially I was able to use the Pandas read_HTML method to scrape NST. I had to do numerous scrapes to get team game logs with 5v5 data, powerplay data, penalty kill data, as well as scrape game logs for each individual goalie.

Team

The function for the scraping team game logs is below. The key stats coming from this function are Corsi, Fenwick, Goals, Expected Goals, Shots, and High Danger Chances per game.

The nst_to_sched dictionary allows me to convert from the team name which comes from NST, to the team abbreviation which is was used when I later scrape the NHL API to get the official schedule and results.

The URL is dynamic. I can use this same function to scrape to different seasons and different situations ( 5v5, 5v5 score and venue adjusted, powerplay, and penalty kill).

I created a key by concatenating team abbreviation with the date. The keys are used to join the point in time statistics from this data frame with the correct game on the modeling data frame.

Goalie

I scraped game logs for individual goalies. Since the beginning of the 2016 season, there were approximately 150 different goalies who appeared in a game. Each goalie required one page request and only two seasons worth of games could be scraped at a time. With making such a large number of requests, I had to set a timer in between page requests so that the server could handle it and my IP would not get banned. I was able to get the goalie ids from scraping the NHL API. The goalie_ids dictionary contains the goalie name as the key and the goalie id from the NHL API as the value.

ELO Ratings

I wanted to include ELO rating as one of the features for the modeling. ELO is a general measure of team strength based on the game results, score, and quality of opponent. ELO ratings are more commonly used in board games such as chess but can be adapted for the NHL. Fortunately, after many google searches, I was able to find someone who was calculating and hosting historical ELO ratings that I could import. Thank you to Neil Paine for providing the data for that feature.

Official Game Results

In looking for official game results from the NHL, I initially found it difficult to engage with the NHL API as I found no official documentation. I was able to find the python library hockey_scraper that allowed me to scrape the schedule for the seasons I wanted to model on. In this code I am also creating the target variable, whether the home team won, and the keys I used to merge the features from the data frames onto this data frame for modeling and analysis later.

Feature Engineering

After acquiring the team, goalie, ELO, and schedule data, I now have three data frames holding my features, and one holding the schedule where I will combine all my features for modeling.,

In the team and goalie data frames, one row represents one game where the columns are different stats occurring in that game. I needed a way to efficiently calculate sums and averages of various statistics for the prior X number of games for every team on a rolling basis.

With some help from stack overflow, I found the basis of code that can accomplish what I wanted to do. The below code sample will sum the FF (Fenwick For, aka unblocked shot attempts) for a team from the prior 40 games.

Going through the code from left to right

Group by team

Then the transform will apply a function but maintain the same number of rows as what is before it

The lambda function is summing on a 40 game rolling basis

The .shift at the end will exclude the current row from the calculation. This is important because I do not want the stats from a game influencing the models ability to predict that game. I only want to know what the stats of the team were prior to them playing that game.

Similar code is applied to other team and goalie stats in order to generate the rolling statistics needed for modeling.

Combining It All Together

Now that I have the data for my features I need to combine the various data frames into one to be used for modeling. I use pandas merge function and the keys I created by concatenating the team abbreviation with the date to join the tables and import the features for the home and away teams.

Conclusion

In this blog, I displayed the various ways I was able to scrape the data, format it, calculate the features, and then combine into one data frame for modeling. Getting this code down was one of the more time consuming parts of the entire project. In the next post, I will do a deeper dive into the features I chose to use as well as how many rolling games ended up being most predictive. See the project github for all the code and results of the modeling. Thank you for reading and see you next time.

--

--