A Theoretical Win Probability Model For Goalie Evaluation

Introduction

In the public sphere, many live win probability models already exist for NHL games. Brian Burke, now doing NFL analytics for ESPN, created a theoretical model that models NHL scoring as a Poisson process. Stephen Pettigrew built a model for the Sloan Analytics Conference that depends on the score differential and penalty situation, as well as home-ice advantage. This model was expanded by Christophe Bernier to incorporate opening pre-game betting odds to account for team strength, and also includes win probability estimation during shootouts. MoneyPuck.com has a live win probability model that includes both score differential and penalty information, in addition to their own pre-game prediction model that accounts for several team-level statistics, including all-situations expected goals, team save percentage, and team shooting percentage, among others. Evolving Hockey has a section devoted to charting NHL games in real time that includes a live win probability model, although I don’t believe they’ve written anything about the model publicly. Micah Blake McCurdy has an expected points model that functions similarly to the idea of an in-game win probability model, and he has a prediction model (called Magnus) that to my knowledge hasn’t been applied to live predictions, but based on the model description, it likely could be applied to make predictions in real time.

If you’re looking for the best win probability model strictly in terms of predicting the result of an NHL game in real time, your best bet is probably one of these models. However, simply predicting the outcome of a hockey game is not the only reason why a win probability estimate is of interest. If you’re trying to analyze a coaching decision, you may want to estimate the change in win probability as a result of making a certain decision (such as in football, where win probabilities are used to analyze fourth-down decisions). While some decisions could be analyzed using the models above – such as pulling the goalie for an extra attacker as an approximate power-play situation – other decisions would require other inputs to these win probability models. For example, if you want to estimate the impact of changing up your forward lines, you can’t really use any of the above models to do this, as they don’t incorporate information of sufficient granularity to answer questions about line combinations within a single game. This isn’t to say these models are wrong or not useful because of this – it’s just that they can’t be used to answer certain questions we may have.

One such question that can’t be answered using these models is the one proposed in the last part of this project – when should we pull the starting goalie in favor of the backup goalie, if we want to maximize our chances of winning? Given that the only model above that incorporates goalie information is the MoneyPuck.com model (as far as I know), and that this model only includes team save percentage in its estimates, we need to figure out some other way to compute a live win probability estimate that incorporates live information about our goalies. This is the goal of this part of the project – how can we expand the model proposed in the previous part into a viable win probability model? That is, given that we are modeling a goalie facing shots as a sequence of Bernoulli trials and performing Bayesian inference on the goalie’s save probability, how can we adapt this mathematical model to estimate the probability of winning based on our current beliefs about our goalies? Since we are trying to adapt a theoretical process into a win probability model, it seems natural to attempt to build a theoretical win probability model, so again this will be a fairly mathematics-heavy discussion. However, we will also evaluate the resulting win probability model and explore how it works in practice. Since I know many of you reading this are probably most interested in the application over the theory, I’ll save the actual application of this model to goalie-pulling decisions for a separate post so you don’t have to read through a ton of theory and motivation here if you don’t want to.

The Model

Suppose that there are \(t \in [0, 60]\) minutes remaining in regulation, and that our starting goalie has allowed \(x\) goals on \(n\) shots against. The main result so far in this project is that if we assumed the starting goalie’s save probability \(\theta\) was distributed according to

\[ \theta \sim \text{Beta} \left( \alpha, \beta \right) \]

prior to the game, then after observing the starter allow \(x\) goals on \(n\) shots, we would believe that their save probability is then distributed according to the posterior distribution

\[ \theta \mid x \sim \text{Beta} \left( \alpha + n - x, \beta + x \right) \]

Since this posterior distribution essentially represents our best guess as to how \(\theta\) is distributed given the current information, any projections we want to make about how many goals our goalie allows in the remaining \(t\) minutes should be based on this posterior distribution. If we knew how many shots our opponent was going to take in the remaining \(t\) minutes, then the number of goals our goalie would allow in those \(t\) minutes would follow a binomial distribution. More concretely, if we knew that our opponent was going to take \(m\) shots in the remaining \(t\) minutes, the number of goals \(X(t)\) that our goalie allows in the remaining \(t\) minutes would be distributed according to

\[ X(t) \mid \theta \sim \text{Binomial} \left( m, 1 - \theta \right) \]

where \(\theta\) is distributed according to the posterior distribution above (since the number of goals our goalie allows in the rest of the game is dependent on the time remaining, we explicitly denote this dependence on \(t\)). Then, with those two distributions on \(\theta\) given \(x\) and \(X(t)\) given \(\theta\), we could find the probability distribution of the number of goals our goalie will allow in the remaining \(t\) minutes, which could easily be used to compute a win probability by also considering the opponent’s goalie and the current score. However, this would require us to know \(m\), which is impossible to know beforehand. Therefore, we’ll represent the number of shots our opponent will take in the remaining \(t\) minutes as a random variable \(Z(t)\) – but what is the distribution of \(Z(t)\)?

Finding the Distribution of \(Z(t)\)

Since the number of shots our opponent will take in the remaining \(t\) minutes is an integer-valued random variable, a reasonable first guess for the distribution of \(Z(t)\) would be a Poisson distribution – after all, Brian Burke’s theoretical model is based on the assumption that goal scoring roughly follows a Poisson process, so maybe we think shots also follow a Poisson process, albeit with a different rate parameter. While the actual definition is more technical, for the purposes of this project, a Poisson process with rate \(\lambda\) is a counting process in continuous time where the number of events (the things we’re counting) that occur in a time interval of length \(s\) follows a Poisson distribution with mean \(\lambda s\), independently of when in the process we are counting. That is, if we take any interval of length \(s\), regardless of the starting point, the number of events that occur during that interval is a \(\text{Poisson} \left( \lambda s \right)\) random variable.

In general, it’s difficult to determine if an observed process follows a Poisson process. However, we can determine if shots on goal roughly follow a Poisson process if two things happen: single-game shot totals roughly follow a Poisson distribution, and shots on goal are roughly equally likely to occur at any time during a game. Using the same MoneyPuck.com dataset as in the previous part, we have data for all shot attempts starting in the 2007-2008 season, with 124 features for each shot. Again, for now we’ll restrict our attention to regular-season games between the 2014-2015 and 2020-2021 seasons, and we’ll only look at shots and goals that occur in regulation time.

To determine if shot totals follow a Poisson distribution, we can fit a Poisson distribution to observed shot counts using maximum-likelihood estimation and visually assess whether the fit looks good. We can count the number of shots taken by each team in each game in our dataset, which gives the following distribution of single-game regulation shots-on-goal totals, overlaid with a Poisson distribution (the black line) computed using the maximum likelihood estimate of \(\lambda\), which in this case is \(\hat{\lambda} \approx 30.175\):

Unfortunately, the maximum-likelihood Poisson distribution does not seem to match the observed distribution of single-game shot totals. Extreme shot totals are more common than the Poisson distribution would indicate, and median shot totals are less common than the Poisson distribution would indicate. While the fit isn’t terrible, it clearly doesn’t look right, so we’ll need to consider some other distributions that shot totals might follow.

While the Poisson distribution is the most common count distribution for modeling, since it has one parameter and is defined for all non-negative integers, there is another distribution that can be used to model counts. The negative binomial distribution is defined for all non-negative integers, and is characterized by two parameters \(r > 0\) and \(p \in [0, 1]\). The probability mass function is given by

\[ \Pr \left( Y = k \mid r, p \right) = \binom{k + r - 1}{k} p^{r} \left( 1 - p \right)^{k} \]

This distribution is most commonly used to count the number of observed failures in a sequence of Bernoulli trials with success probability \(p\) before the \(r\)th success, but it can be easily generalized to non-integer values of \(r\) using the gamma function, and it can work well as a count distribution even when the random process we wish to describe isn’t defined by a sequence of Bernoulli trials (we’ll see why shortly). With this in mind, let’s compute the maximum likelihood estimates for \(r\) and \(p\) for a negative binomial distribution using our observed shot totals – these are given by \(\hat{r} \approx 68.657\) and \(\hat{p} \approx 0.695\):

With the exception of a few irregularities around the mode of the distribution, this fit is unnecessarily good, so it is probably reasonable to say that single-game shot totals follow a negative binomial distribution. To further visualize the difference between these two approximate distributions, we can overlay the observed distribution of shot totals with both the approximate Poisson and negative binomial distributions:

While learning that shot totals approximately follow a negative binomial distribution is a promising result, note that it only applies when we are trying to predict the number of shots a team will take in an entire game, but what we want to find is the distribution of \(Z(t)\), the number of shots a team will take in the last \(t\) minutes of regulation. Since shots taken don’t follow a Poisson process (as the number of shots in a game doesn’t follow a Poisson distribution), we can’t use any of its helpful properties – most importantly, we can’t conclude that the number of shots taken in the final \(t\) minutes follows a Poisson distribution with mean \(\lambda t\), since our estimate for \(\lambda\) doesn’t appropriately fit the distribution of shots on goal. However, we can use a clever fact about the negative binomial distribution – it can be alternatively characterized by a compound Poisson distribution. That is, if \(Y \sim \text{NegativeBinomial} \left( r, p \right)\), then

\[ Y \overset{d}{=} \sum_{k = 1}^{N} Z_{k} \]

where \(N \sim \text{Poisson} \left( \lambda \right)\) for \(\lambda = -r \log \left( p \right)\), and \(Z_{k}\) are independent and identically distributed \(\text{Logarithmic} \left( 1 - p \right)\) random variables that are independent from \(N\) (the \(d\) over the equals sign means that \(Y\) and the random sum of \(Z_{k}\) have the same distribution). Essentially, we therefore have that while single-game shot totals don’t exactly follow a Poisson distribution, they roughly follow a generalized form of the Poisson distribution – in a regular Poisson distribution, the things we’re counting are events, but in a compound Poisson distribution, each event has a random weight \(Z_{k}\), and the thing we’re “counting” is the sum of the weights on the events.

Returning to our problem, note that if we use the maximum-likelihood estimates for \(r\) and \(p\) computed above, the compound Poisson distribution for shot totals is dictated by a Poisson distribution with mean

\[ \hat{\lambda} = -\hat{r} \log \left( \hat{p} \right) \approx 25.011 \]

and the weights on each event follow a \(\text{Logarithmic} \left( 0.305 \right)\) distribution. These two distributions are plotted below:

Based on the data we have available, there is no natural interpretation for what exactly this compound distribution represents, and attempting to extract the underlying Poisson and logarithmic distributions from a negative binomial distribution that exactly counts shots on goal is difficult to do. That is, while we have a compound Poisson distribution representation of shot totals, it isn’t clear how to distill this strict counting process into a different underlying counting process with random weights, so any claims we make about the underlying distributions are not rigorous. Because of this, please feel free to yell at me on Twitter for what I’m about to do.

Nevertheless, by inspection, a reasonable guess might be that the underlying Poisson distribution is counting what I’m going to call shot sequences – any sequence of shots in succession that includes an initial shot on goal and all rebounds that that shot immediately generates. From this, the weight \(Z_{k}\) would then naturally represent the number of shots in the \(k\)th shot sequence. Essentially, if we look at raw shot totals, we might guess that the totals are composed of shot sequences during the game, with each one resulting in a random number of shots \(Z_{k}\). Again, by inspection, teams averaging \(\hat{\lambda} \approx 25.011\) shot sequences per game seems reasonable, and from the logarithmic distribution, roughly 84% of these sequences don’t produce any rebounds, roughly 13% produce one rebound, and roughly 3% produce two rebounds (the probability of a sequence producing more than two rebounds is negligible). Again, I want to emphasize that this model of shot sequences following a Poisson distribution and each sequence producing a random number of rebounds (independently of the other sequences) was not obtained rigorously, but by guesswork relating to the breakdown of the compound Poisson distribution characterization of our distribution of single-game shot totals. However, given the information we have and the fact that shots on goal almost surely do not follow a negative binomial distribution based on Bernoulli trials, this seems like a reasonable way to justify why single-game shot totals appear to follow a negative binomial distribution.

Assuming you haven’t already angrily closed out this window, we can now begin to figure out how to generalize our negative binomial distribution on single-game shot totals to only the final \(t\) minutes of regulation. The compound Poisson process is a generalization of the Poisson process that counts the sum of weights on arrivals – instead of just the number of arrivals – in a given time interval. That is, if \(N(t)\) follows a Poisson process, and \(Z_{k}\) are i.i.d. random variables that are independent of \(N(t)\), then

\[ Z(t) = \sum_{k = 1}^{N(t)} Z_{k} \]

follows a compound Poisson process. In the context of our problem, if shot sequences follow a Poisson process, then shots on goal would follow a compound Poisson process, and we’d have a way to compute the distribution of \(Z(t)\). To determine whether shot sequences follow a Poisson process, I’m going to use the same modeling techniques as before – smoke-and-mirrors and hand-waving. The following histogram displays the distribution of times when shots on goal occur:

With the exception of the first and last minutes of each period, shots on goal appear to be roughly equally likely to occur at any time, so it’s not entirely unreasonable to claim that shot sequences roughly follow a Poisson process. Once you’re done yelling at me through the computer, we can finally figure out the distribution of shots on goal in the final \(t\) minutes of a game.

Suppose that the number of shots on goal taken by one team in a single game follows a \(\text{NegativeBinomial} \left( r, p \right)\) distribution. By the equivalent characterization of this distribution as a compound Poisson distribution, we’re assuming that shot sequences follow a Poisson process with rate \(\lambda = -r \log \left( p \right)\) for the entire game, and the number of shots that occur in the \(k\)th shot sequence is a \(\text{Logarithmic} \left( 1 - p \right)\) random variable that is independent of the other sequences. Because we’re assuming shot sequences follow a Poisson process, if there are \(t\) minutes left in the game, the number of shot sequences \(N(t)\) that occur in the final \(t\) minutes follows a \(\text{Poisson} \left( \lambda t / 60 \right)\) distribution (since \(t\) is in minutes but \(\lambda\) represents a rate per game). However, this means that the number of shots taken in the final \(t\) minutes of the game is given by

\[ Z(t) = \sum_{k = 1}^{N(t)} Z_{k} \]

where \(Z_{k}\) are i.i.d. \(\text{Logarithmic} \left( 1 - p \right)\) random variables and \(N(t) \sim \text{Poisson} \left( \lambda t / 60 \right)\). Note then that

\[ N(t) \sim \text{Poisson} \left( -\left( r t / 60 \right) \log \left( p \right) \right) \]

and thus \(Z(t)\) is an alternate characterization of the \(\text{NegativeBinomial} \left( r t / 60, p \right)\) distribution from the previous discussion. With all of this in mind, under all of the assumptions necessary to get to this point, we have that if the total number of shots taken in a single game is a \(\text{NegativeBinomial} \left( r, p \right)\) random variable, then the total number of shots taken in the final \(t\) minutes of a game is a \(\text{NegativeBinomial} \left( r t / 60, p \right)\) random variable. Using this information, we can now find the distribution of \(X(t)\), the number of goals our goalie will allow in the final \(t\) minutes of regulation.

Finding the Distribution of \(X(t)\)

In total, we have three related probability distributions governing the number of goals our goalie will allow in the final \(t\) minutes of regulation:

  1. A posterior distribution on \(\theta\), the goalie’s save probability given the number of goals \(x\) they’ve allowed on the \(n\) shots they’ve already faced
  2. A distribution on \(Z(t)\), the number of shots our opponent will take in the final \(t\) minutes of the game
  3. A distribution on \(X(t)\) given \(\theta\) and \(Z(t)\), the number of goals our goalie will allow given their save probability and the number of shots they’ll face in the final \(t\) minutes

These three distributions are given as follows:

\[\begin{align} \theta & \sim \text{Beta} \left( \alpha + n - x, \beta + x \right) \\ Z(t) & \sim \text{NegativeBinomial} \left( r t / 60, p \right) \\ X(t) \mid \theta, Z(t) = m & \sim \text{Binomial} \left( m, 1 - \theta \right) \end{align}\]

Using the law of total probability, we can find that

\[ \Pr \left[ X(t) = k \mid \theta \right] = \binom{k + \left( r t / 60 \right) - 1}{k} \left[ p^{*}(\theta, p) \right]^{r t / 60} \left[ 1 - p^{*}(\theta, p) \right]^{k} \]

where

\[ p^{*}(\theta, p) = \frac{p}{1 - \theta \left( 1 - p \right)} \in (0, 1) \]

for \(\theta, p \in (0, 1)\). This is exactly the negative binomial probability mass function with shape \(r t / 60\) and probability \(p^{*}(\theta, p)\) (it will be helpful to note the dependence of this probability on both \(\theta\) and \(p\), which is why I’ve written it in this notation), so

\[ X(t) \mid \theta \sim \text{NegativeBinomial} \left( r t / 60, p^{*}(\theta, p) \right) \]

Using the law of total probability again, we have that

\[ \Pr \left[ X(t) = k \right] = \int_{0}^{1} \Pr \left[ X(t) = k \mid \theta \right] \pi \left( \theta \mid x \right) \ d \theta \]

Substituting in the probability mass function of \(X(t)\) given \(\theta\) and the posterior density of \(\theta\) given \(x\), we can write this integral in terms of only the parameters defining the two distributions, \(\theta\), and \(k\) (because it’s exceptionally long, I haven’t written it out here). While complicated, this is an integral that we can solve numerically, which gives us a way to compute the probability of our goalie allowing exactly \(k\) goals in the final \(t\) minutes of regulation. It will be helpful to summarize the variables we need to compute this integral, and where they come from in our simpler distributions:

  • \(\alpha\) and \(\beta\) represent our pre-game beliefs about the distribution of \(\theta\), the goalie’s save probability
  • \(x\) is the number of goals our goalie has allowed in this game, and \(n\) is the number of shots our goalie has faced in this game
  • \(r\) and \(p\) are the parameters representing the opponent’s distribution of total shots taken in a single game
  • \(p^{*}(\theta, p)\) is a function of \(\theta\) and \(p\) that returns a new probability parameter (the function is given above)
  • \(t\) is the number of minutes remaining in regulation

We are integrating over \(\theta\), and \(\Pr \left[ X(t) = k \right]\) should be a function of \(k\), so we have identified every variable in this integral. With this information, we can now begin to work on computing a win probability.

Win Probability

It shouldn’t be difficult to see that we can use the exact same process we just performed to find the distribution of goals allowed by the opponent’s goalie in the remaining \(t\) minutes of the game – all we need to do is assume a prior distribution on the opponent’s starter’s save probability, record how many shots on goal we’ve taken and how many of them were goals, and find the values of \(r\) and \(p\) that represent our team’s distribution of shots on goal. We’ll then have a distribution on both the number of goals our goalie will allow, and on the number of goals our opponent’s goalie will allow. Taking the current score differential into account, we can then consider all possible combinations of goals scored by our team and by our opponent where we wind up ahead on the scoreboard, and by summing up the probabilities of each of these combinations occurring, we’ll find the probability that we win the game. We also need to consider combinations where the score is tied at the end of regulation, and assuming that overtime is essentially a coin flip, we can weight the probability of each of these combinations occurring by 1/2, and then add those to the win probability.

While we now have a win probability model that depends on a Bayesian analysis of goalies, before using this model to optimize goalie play in-game, we need to make sure the model works as intended – after all, while we’re willing to make some concessions in prediction accuracy compared to the existing win probability models, if the model doesn’t still perform well, then it’s essentially as useless as the existing models. Therefore, we need to evaluate the model, and once we’ve done that, it will be helpful to explore some example games to see how it works.

Model Evaluation

Before evaluating the model at all, I’m going to generalize the situation we’ve been in to this point. As I am not a hockey coach, the concepts of “our team” and “our opponent” don’t have any standardized meaning – therefore, I’ll instead consider the two teams as simply the home team and the away team, and for win probability estimates I’ll compute the win probability of the home team.

To evaluate the model, I’m going to use all games from 2018-2019 regular season as a testing set. We’ll randomly sample 5 times from each period of each game and compute the win probability of the home team, for a total of 19,065 observations. However, since we have a theoretical model, we first need to choose our priors on the goalies’ save probabilities, and we need to estimate each team’s distribution of regulation shot totals. Because this model is theoretical, there are several ways in which we can estimate these parameters, both in terms of the method of estimation and the prior data we want to use, and thus it is completely infeasible to optimize our estimation over a full range of parameters. I’m only going to try a few combinations of parameters now for demonstration purposes, but I’ll also probably try some more methods later, and I’ll write about any that perform particularly well, either here or on Twitter.

Now, since we are estimating probabilities, we can use standard diagnostic statistics like the log-loss and ROC curves and AUC (which I will still provide), but the easiest way to visualize if our win probability model is acceptable is by binning our estimated probabilities, and then computing the proportion of observations in each bin that resulted in the home team winning. To do this, we can round each probability to the nearest multiple of 0.05, and then create a scatterplot comparing the binned probabilities and the observed win probability of each bin. If the points are well-approximated by the line \(y = x\), then we have a reasonable win probability model (this method is adapted from this Michael Lopez article). Remember – we’re not trying to beat existing models; we’re just trying to create an acceptable model that can help answer our specific question about goalie-pulling.

Model 1: Uninformative Priors/General Teams

For the first model, let’s first assume uniform priors on both goalies’ save probabilities. As discussed previously, this prior is almost certainly wrong, but it may be the case that an uninformative prior is the most flexible in allowing us to identify bad starts before they completely tank a team’s chances. For the distributions on shot totals, we’ll use all league-wide games from the previous season to compute the maximum likelihood estimates of \(r\) and \(p\), so this model won’t factor in team strength. The binned vs. observed win probability graphs are shown below:

Clearly, this model is not particularly good. While the second and third period probability estimates look okay, there is something visibly incorrect about how we’re predicting games in the first period. After some thought, a reasonable conclusion would be that this model is wildly oversensitive to early shots on goal – since our prior assumes little information, the first few shots on goal would have large effects on the contributions of the save probability posteriors to the win probability computation. Therefore, we can safely conclude that this prior choice makes for a bad win probability model that won’t help us much. Nevertheless, the ROC curves (broken out by period) for our predicted win probabilities are shown below:

This model gives a log-loss of 0.71 for the first period, 0.555 for the second period, and 0.327 for the third period. The AUC for the three ROC curves are 0.616 for the first period, 0.8 for the second period, and 0.931 for the third period. Heuristically, this model performs poorly in the first period, moderately in the second period, and excellently in the third period. This should make sense – there is so much uncertainty in the first period, especially with this choice of prior, but as the game progresses, we know quite a bit about the two teams within each specific game, and our predictive power increases.

Model 2: Informative Priors/General Teams

For the second model, we’ll use the same method of estimating shot distributions, so we’ll use the full slate of league-wide games from the previous season, and again we won’t be factoring in team strength just yet. For the priors on the goalies’ save probabilities, let’s now use all league-wide goalie starts from the previous season and compute the maximum-likelihood estimates of \(\alpha\) and \(\beta\), which I’ll use for the prior for both goalies. The binned vs. observed win probability graphs are shown below:

This model clearly performs better than the one using uniform priors, especially in the first period – however, there are still some irregularities. Home teams that we predict to win around 40-45% or 55-60% of the time seem to win closer to 50% of the time, and in the third period there are some points very far from the line \(y = x\). Some of this could be chalked up to the fact that we aren’t factoring in power plays and empty net situations yet, and for the third period, the far-off points are small, indicating that there are maybe only one or two games where we have third-period win probabilities in those bins. A deeper cause could be that in close games, our intuition about which teams have a better chance to win may not line up with what the model suggests. For example: if a team opens up the game outshooting their opponent 8-1 but the score is still 0-0, while we’d probably believe that the team leading in shots would have a better chance to win, this modeling technique would believe that the goalie who’s made 8 saves is playing better than the goalie who’s only made one save. This could be a modeling flaw, but in any case, having more confidence in the goalies before the game vastly improves the model over assuming uninformative priors. The ROC curves for this model are shown below:

This model gives a log-loss of 0.65 for the first period, 0.529 for the second period, and 0.32 for the third period. The AUC for the three ROC curves are 0.645 for the first period, 0.808 for the second period, and 0.932 for the third period. We get better performance from this model than the first model in all aspects (as the log-loss values are lower for each period and the AUC values are higher for each period), but the increases in model accuracy are not particularly large. Nevertheless, this model seems far more appropriate than the first model.

Model 3: Informative Priors/Team-Specific

For the third model, we’re again going to assume informative priors on both goalies, using the maximum-likelihood estimates of \(\alpha\) and \(\beta\) based on all league-wide games from the previous season. However, we’re now going to obtain team-specific estimates of \(r\) and \(p\) that also account for home-ice advantage. To do this, we’re again going to use all games from the previous season, but now we’re going to fit a negative binomial regression using the team and a binary home/away variable as predictors. This regression will assume a constant value of \(r\), so all teams will have the same estimated negative binomial shape parameter \(\hat{r}\), but the estimates of \(\hat{p}\) will account for team-strength and home-ice advantage. It should also be noted that these estimates are based solely on the previous season, so any significant changes in team strength from 2017-2018 to 2018-2019 will not be reflected as our evaluation progresses through the season. The binned vs. observed win probability graphs are shown below:

The improvement from the previous model doesn’t appear significant, although this model does look to perform a bit better. To improve the team-specific estimates, instead of using all of the previous season’s games for the entire season, we could use the previous \(k\) league-wide games to train our negative binomial regression model (so we’d be factoring games from the current season as we progress), for some reasonably large number of games \(k\) (this model looks fairly good for now, so I’m not going to change the prior data in this article). The ROC curves for this model are shown below:

This model gives a log-loss of 0.646 for the first period, 0.527 for the second period, and 0.319 for the third period. The AUC for the three ROC curves are 0.657 for the first period, 0.81 for the second period, and 0.933 for the third period. In every metric, this model performs better than the previous two models, although the improvements are all marginal at best. Regardless, for the remainder of this article, we’ll use this particular model to explore how this win probability framework works in practice.

Example Win Probability Graphs

While evaluating the win probability model is one thing, we also want to make sure it looks appropriate for specific games. I’ve cherry-picked a few games from the 2018-2019 season, and we’ll use the third model above to visualize the in-game win probability estimates over the course of each of those games.

October 4, 2018: Washington 6 at Pittsburgh 7 (OT)

I picked this game for a few reasons: first of all, it was early in the season, so the goalie priors and team-specific shot distribution parameter estimates from the previous season should be fairly good (compared to late-season games). Additionally, this game was exceptionally high-scoring, while also being a fairly close game throughout, and remarkably, neither goalie was pulled – despite 5 goals being scored in the first 8 minutes, both Braden Holtby and Matt Murray played the entire game. Holtby finished with 34 saves on 41 shots for an observed save percentage of 0.829, and Murray finished with 30 saves on 36 shots for an observed save percentage of 0.833. While we’re not looking into goalie-pull recommendations just yet, this game would likely be the perfect case-study for when pulling the starter can provide a big boost in win probability. Using the third model above, we have the following win probability graph:

It should be apparent when the goals occur, especially since this game isn’t lopsided and doesn’t have any garbage-time goals, but in future updates to this project I’ll add indicators at the points when goals are scored. The smaller changes in win probability are due to saves, as every save improves our beliefs about the goalie, and thus the chances of that goalie’s team winning improve slightly. Over time, saves have less of an effect on the win probability, since fewer shots occur with less time remaining, and by the end of the game we are fairly confident in both goalies’ save probabilities. The win probability converges to 0.50 as the game ends, since we are only considering regulation time and are treating overtime as an essential coin flip. Overall, this graph looks fairly appropriate, although some sort of smoothing may also be useful for the particularly jagged segments – I’ll go over how we might do this in a later post.

January 12, 2019: NY Rangers 2 at NY Islanders 1

Unlike the first game, this game was extremely low-scoring, so the win probability graph should be more stable early on, and it would also probably be an example of a game where we’d see a decrease in win probability by pulling either starter. The graph is shown below:

Unlike the first game where most of the first several shots were goals, this game began with a series of saves, so we can see that each early save had a relatively large impact on win probability leading up to the first goal. Overall, this game isn’t particularly interesting, especially since Alexandar Georgiev faced only 25 shots and Robin Lehner faced only 26 shots, but this is a typical low-scoring NHL game, so win probability graphs like this are probably fairly common. We also want our final decision rule to apply to games where pulling the starter would be a bad idea, so games like this should serve to verify that this model wouldn’t always suggest pulling the starter.

April 6, 2019: Vegas 2 at Los Angeles 5

Unlike the first two games, which were both one-goal games on opposite extremes of goal-scoring, the score of this game is imbalanced, but seems to be a typical NHL final score. A 4-2 game with a late empty-net goal feels fairly common in the modern NHL, so this is another good example of a game we might see fairly often. Furthermore, it was one of the last games in the season, so the prior data used to estimate the team-specific shot distribution parameters is almost certainly not good, and we can get some insight into just how incorrect it looks in practice. The graph is shown below:

At the beginning of the game, we can see the effects of the fact that our estimated shot distributions aren’t highly accurate – Vegas had clinched a playoff spot, while Los Angeles was solidly at the bottom of the Western Conference, but at the very beginning of the game, this model only gives Vegas a slight advantage over Los Angeles. Nevertheless, as the game progresses, it seems like the current game state begins to have a larger impact on the win probability than the two teams’ estimated shot distributions, which is what we’d want from a good win probability model. Marc-Andre Fleury and Jonathan Quick both faced 31 shots, which again seems to be a median shot total of an NHL team. Los Angeles did score an empty-net goal, which isn’t accurately accounted for yet in this model, but it was late in the game and Los Angeles was already up 2 goals, so accounting for the empty-net situation here probably wouldn’t make much difference.

Model Limitations

I’ve mentioned it multiple times, but this model is not perfect, and it is almost certainly not better than the models mentioned at the beginning of this post. I haven’t accounted for any differences in strength-state yet, both via power plays and empty-net situations. This could be fixed by computing different estimates for the shot distribution parameters of each team based on their current number of skaters, and breaking up the remaining \(t\) minutes of regulation into separate intervals, with each interval representing a constant strength-state. It would be more difficult to compute separate estimates for the shot distribution parameters for each possible strength-state, but assuming we had them, we could easily account for uneven strength-states.

Another potential flaw has also been mentioned, but I’ll repeat it here – our intuition about which team has a better chance to win may not always line up with what this win probability model indicates. Every shot a goalie saves increases their team’s chance of winning according to this model, so a goalie that makes several saves early would cause this model to favor that goalie’s team a bit more going forward. However, that goalie can only make those saves if their team gives up those shots, and we’d probably expect that the team getting caved early would have a lower chance to win. Some of this imbalance could be made up for by the two teams’ estimated shot distributions, as the team bleeding shots early would probably be the worse team on average, but the disconnect between our intuition and this particular model still exists.

Finally, it’s worth mentioning again that the process of creating this model was absolutely not mathematically rigorous. The final model seems to work in practice and is based on a theoretical idea about how hockey games work, but the steps involved in deriving this theoretical model are very speculative, so you should take everything going forward with a grain of salt. If the work I’ve done to create this model feels weird or wrong to you, that’s a good thing – it means you see room for improvement in the framework I’ve established to evaluate goalies in-game, which can help coaches make better decisions.

Conclusion

While it isn’t perfect, we now have a win probability model that is dependent on Bayesian estimates of goalies’ save probabilities. It certainly doesn’t outperform win probability models created and trained in other ways, but it is unique in the fact that it is dependent on our live, subjective beliefs about goalies within a single game, which means it can be used to inform the process of determining when the backup goalie might give a team a better chance to win than the starter. While we haven’t explored how to use this model to inform these decisions just yet, the general framework is now established, and all that remains is to apply the model to NHL games, and later figure out if this improved decision rule actually provides a benefit to NHL teams in the long-run.