LAEGAP Methodology

August 28, 2013 5 minute read

This piece is dedicated to detailing my specific methodology for the calculation of LAEGAP. If you hate math or would just like a basic explanation, head over to my original post, which also includes some of my insight of the resulting data. All data is from the 2012-13 season.

Fixing Recorder Bias

NHL has had an infamous lack of consistency in its statistics across arenas that is almost embarrassing compared to other professional sports leagues in North America. Things like hits, takeaways, giveaways, blocked shots, and shot location are all recorded differently in different arenas. Before we use shot location data provided by the NHL, we will first attempt to fix the inconsistencies.

NHL records shot location data on a (x, y) plane. The X axis spans from -99 to 99 and the Y axis spans from -42 to 42. A standard NHL arena is 200 feet by 85 feet, so each X value resembles slightly less than 1 foot and each Y value resembles exactly 1 foot.

Past studies have attempted to fix the recording bias by using distance as a reference, or simply ignored it and attempted to ease it by regressing each point locally. Since shot location data is not recorded by distance and angle from the net (1st study), but rather by x, y values, measuring bias of positive x and y values, and negative x and y values is much better than measuring bias of distance and angle from the net. We will be using regression to improve our data but we won’t be relying on it to fix the recording bias.

I measured each team’s average positive and negative x, y values on away games with no more than two games’ data recorded in the same arena. I choice to limit the amount of games to avoid any variation that would be caused by having division rivals that have a heavy bias, which would in turn have a possible effect on the team average.

I then recorded each arena’s average positive and negative x, y values for its visiting teams. This is used to compare with its expected coordinates average, which is the average team coordinates multiplied by their number of visits in proportion, to arrive at each arena’s coordinates bias. Each arenas coordinates bias will later be used to adjust the data later on.

team_name pos_x pos_y neg_x neg_y avg_dis
Philadelphia Flyers 1.57895 -0.526316 -1 0 -1.26014
Winnipeg Jets -0.117647 2.17647 -1.70588 0.176471 -0.272135
Los Angeles Kings 1.47619 -1.52381 0 0.714286 -1.19223
Boston Bruins -0.105263 0.315789 -1.31579 -0.947368 -0.233984
MontrŽal Canadiens -0.619048 -1.52381 -1.71429 3.04762 -1.61362
New York Islanders -4.31579 3.36842 2.47368 -1.94737 4.23246
Tampa Bay Lightning 1.52632 -0.736842 -4.84211 -0.105263 -2.96685
Florida Panthers 2.85 -0.55 0.1 0 -1.35492
St. Louis Blues -3.09091 1.31818 0.954545 -0.227273 2.14256
Nashville Predators -1.45 -0.35 1.7 1.75 0.778731
Dallas Stars 1.57143 0.428571 -0.190476 -0.285714 -0.626039
Minnesota Wild 2.2381 -1.47619 -2.33333 0.714286 -2.51173
Vancouver Canucks 2.52632 -0.473684 -3.63158 -2.26316 -2.28123
Buffalo Sabres 1.40909 1.31818 -0.181818 0.0909091 -0.459828
New York Rangers -4 2 4.55 -1.7 4.62993
Calgary Flames -2.42857 -0.333333 -3.19048 -0.333333 -0.296126
Phoenix Coyotes -2.27273 0.409091 -2.40909 0.818182 -0.134597
Toronto Maple Leafs 2.31579 -1.36842 -0.315789 0.0526316 -1.49276
Ottawa Senators -1.47619 -2.42857 0.0952381 2.04762 -0.452255
Columbus Blue Jackets -0.25 -1.6 0.5 3.7 -1.11734
Washington Capitals 1.4 1.15 -0.25 -0.9 -0.284193
Carolina Hurricanes -2.22727 -0.636364 3.45455 0.136364 2.2673
New Jersey Devils -1.18182 0.227273 -0.318182 1.04545 0.176491
Detroit Red Wings 1.2 -2.4 2.55 -0.45 0.103993
Chicago Blackhawks -3.40909 0.227273 -1.45455 -0.363636 1.04302
Colorado Avalanche -1.15789 -1.63158 1.57895 -0.315789 0.878376
Edmonton Oilers 0 1.65 1.1 -2.4 1.42726
Pittsburgh Penguins -0.571429 -0.52381 -0.380952 0.190476 -0.0799169
San Jose Sharks 3.59091 0.272727 -0.227273 0.590909 -1.83297
Anaheim Ducks -2 1.42105 2.15789 -1.26316 2.46048

As expected, MSG, notorious for its inaccuracy, records each shot on average 4.6 feet closer to the net than it is. Meanwhile, Consol Energy Center does the best job in recording shot location data accurately in the league, according to the aggregate average of the entire eastern conference and some of the western conference.

Location Specific Shooting Percentage

With the 634 games of data available from NHL, every even strength, non-empty net shot and goal is recorded and its shooting percentage (goals/(shots+goals)) is calculated and its location adjusted for arena bias. All of this data’s x, y value is flipped to only be in the east half of the rink. There are some occasional shots that are taken from a player’s own half that might be transformed when it wasn’t suppose to, and vice versa, but as it is almost entirely reliant on luck to score from so far I consider these few data points insignificant . Because there are not a lot of data on low scoring areas, the shooting percentage is varies greatly from point to point at those locations. For example, at (30, 9), five feet inside the blue line, the shot percentage is 50% (1 / (1+1)). It’s neighbor, (30, 11) has a shot percentage of 0% (0/(0+1)). Clearly, the 50% shooting percentage is not sustainable. To fix this, I regressed each point with its neighbors in a radius of 5 feet from the original point with an exponential weighting function.

exponential forumla

w = weight; d = distance

This means that each point further away will have a weight of 75% of the point that is 1 feet closer to the original. Here is the heat map of the shooting percentage distributed in one half of the rink, with red being the highest shooting percentage and blue being the lowest. Missing squares indicates that no data is available at that location.

heatmap

As we expected, the closer you are to the net, the more likely you are to score.

Expected Goals For/Against

I created Location Adjusted Expected GoAls Percentage (LAEGAP) to improve Corsi’s lack of account for shot quality. Corsi was invented to improve +/-‘s lack of sample size. It is natural that LAEGAP will resemble a form of +/-, essentially what EGF/EGA is.

Expected Goals For is an approximation of the number of goals scored by a players team that the player will be on the ice for, if all shots taken in the respective locations have a shot percentage (chance of scoring) of the league average. Expected Goals against is an approximation of the number of goals scored by the opponent team that a player will be on the ice for, under the same conditions. Essentially, it is a metric of the amount of shots multiplied by its quality.

Now that we have EGF, a measure of on-ice offensive events, and EGA, a measure of on-ice defensive, or lack of therefore, events, we can calculate the difference, because, ultimately, hockey is a game you win by scoring more than your opponent. Simply subtracting EGA from EGF is not good enough. This is because it gives high event players an advantage, or disadvantage, depending if they were a positive possession player. For example, if our imaginary player, John, was on ice during a shot for, at a 10% shooting percentage location, and a shot against, at a 5% shooting percentage location, every shift, for 100 shifts, he would have a EGF of 10, and a EGA of 5, a difference of 5 goals. Now image a 2nd player, Ethan, who had the same percentages every shift, but had 1000 shifts. His expected goals difference/expected +/- would be 50 goals, even though he and John both performed equally, possession wise, every time they were on-ice.

The solution is simple. Calculate the percentage of EGF in the total expected goals events (EGF + EGA). Now, both John and Ethan have the same EG%: 66.7% (10/(10+5) = 100/(100+50)). This raises another problem. What if a 3rd imaginary player, Jacob, had one shift that logged a shot for at 5% before he suffered a season ending pinky toe strain? He would have a EG% of 100% (5/(5+0)). Clearly that’s not sustainable once the sample size increases (more shifts), so how do we differentiate small sample size error margins and actual performance?

We will utilize a variation of binomial proportion confidence interval, Wilson score interval, invented by mathematician Edwin Bidwell Wilson in 1927. LAEGAP is the lowest value Wilson score interval with n being total EG events (EGA+EGF) and p being the proportion of EGF over n and a confidence of 95%.

Wilson Score Interval

 

By taking the lowest value of the interval players low-event but positive possession (>0.50 EG%) players will receive a penalty and likewise, high-event but negative possession players will receive a advantage, but this is more preferable to the contrasting solution.

Data

Below is the top 30 players with the best LAEGAP:

name position team games LAEGAP EGF EGA shots for shots against shot diff
Dan Boyle Defenseman San Jose Sharks 41 0.493526 37.9177 23.3686 437 312 125
Brendan Gallagher Right Wing Montr̩al Canadiens 41 0.492882 27.3263 15.1572 311 186 125
David Clarkson Right Wing New Jersey Devils 43 0.491641 29.4998 16.9234 340 209 131
Jonathan Toews Center Chicago Blackhawks 43 0.487849 35.4267 21.9762 388 272 116
Jake Muzzin Defenseman Los Angeles Kings 39 0.485027 28.0701 16.3438 356 200 156
Max Pacioretty Left Wing Montr̩al Canadiens 41 0.484933 27.5553 15.9449 336 203 133
Marian Hossa Right Wing Chicago Blackhawks 36 0.475947 27.4272 16.5591 294 207 87
Joe Thornton Center San Jose Sharks 43 0.472452 30.4976 19.4209 332 248 84
Patrick Marleau Left Wing San Jose Sharks 43 0.472194 30.7604 19.6678 360 248 112
Brandon Saad Left Wing Chicago Blackhawks 42 0.468371 30.411 19.7317 343 246 97
Lubomir Visnovsky Defenseman New York Islanders 29 0.465838 27.1036 17.1213 314 205 109
Patrik Elias Left Wing New Jersey Devils 43 0.465605 27.2111 17.2328 298 210 88
Justin Williams Right Wing Los Angeles Kings 42 0.463848 29.9752 19.7833 373 241 132
Tyler Seguin Center Boston Bruins 42 0.462289 31.7253 21.4821 396 275 121
Zach Parise Left Wing Minnesota Wild 43 0.461388 31.1538 21.0652 377 281 96
Logan Couture Center San Jose Sharks 43 0.459966 29.7073 19.9178 339 254 85
Mark Fayne Defenseman New Jersey Devils 27 0.459596 17.7727 9.748 193 136 57
Evgeni Malkin Center Pittsburgh Penguins 26 0.459346 21.0059 12.4287 239 159 80
Mikko Koivu Center Minnesota Wild 43 0.456033 30.4973 21.0142 375 281 94
Henrik Sedin Center Vancouver Canucks 40 0.452746 29.5615 20.4875 350 268 82
Ryan Getzlaf Center Anaheim Ducks 38 0.450954 28.5721 19.7529 306 234 72
Anton Stralman Defenseman New York Rangers 41 0.45036 30.8343 21.9066 375 274 101
Marc-Edouard Vlasic Defenseman San Jose Sharks 43 0.447783 33.4906 24.6981 395 327 68
Patrice Bergeron Center Boston Bruins 36 0.446412 25.8487 17.658 346 228 118
Andy Greene Defenseman New Jersey Devils 43 0.445498 28.1978 19.9277 344 275 69
Matt Irwin Defenseman San Jose Sharks 35 0.444544 27.63 19.4851 337 251 86
Ryan McDonagh Defenseman New York Rangers 41 0.444263 37.3756 28.9067 475 369 106
Derek Stepan Center New York Rangers 41 0.442697 31.42 23.2883 377 267 110
Alexandre Burrows Right Wing Vancouver Canucks 40 0.441986 25.1964 17.432 298 244 54
P.K. Subban Defenseman Montr̩al Canadiens 40 0.441122 28.8511 20.9794 360 259 101
games games played
EGF Expected Goals For
EGA Expected Goals Against
Shots For Number of shots on goal for player's team while player is on ice
Shots Against Number of shots on goal against player's team while player is on ice
Shots Diff Shot Differential = Shots For – Shots Against. Pretty much same as Corsi except missed shots don't count.
LAEGAP Location Adjusted Expected Goals Percentage

For the record, Crosby is 37th.