Introducing HBox: Predicting Wins Based on Historical Data and Trends

Apr 15, 2015; Los Angeles, CA, USA; Magic Johnson attends ceremonies to commemorate Jackie Robinson Day before the game between the Los Angeles Dodgers and the Seattle Mariners at Dodger Stadium. Mandatory Credit: Jayne Kamin-Oncea-USA TODAY Sports
Apr 15, 2015; Los Angeles, CA, USA; Magic Johnson attends ceremonies to commemorate Jackie Robinson Day before the game between the Los Angeles Dodgers and the Seattle Mariners at Dodger Stadium. Mandatory Credit: Jayne Kamin-Oncea-USA TODAY Sports /
facebooktwitterreddit
Apr 15, 2015; Los Angeles, CA, USA; Magic Johnson attends ceremonies to commemorate Jackie Robinson Day before the game between the Los Angeles Dodgers and the Seattle Mariners at Dodger Stadium. Mandatory Credit: Jayne Kamin-Oncea-USA TODAY Sports
Apr 15, 2015; Los Angeles, CA, USA; Magic Johnson attends ceremonies to commemorate Jackie Robinson Day before the game between the Los Angeles Dodgers and the Seattle Mariners at Dodger Stadium. Mandatory Credit: Jayne Kamin-Oncea-USA TODAY Sports /

A few years into the modern “databall era”, plus/minus analysis has been the basis for a new generation of metrics and stats. It’s not just about terms like APM or RAPM, but the more recent like basketball-reference’s BPM[1. Box score Plus/Minus, from the family of “Statistical Plus/Minus” metrics like our DRE, which assign weights to box score contributions as a way of estimating a player’s RAPM value without requiring the computing power of sample size of actually building and running such a model.] also use plus/minus data to construct the weights for their variables. That can work really well, as they’re an improvement over the old metrics like PER, but it can be dangerous if all our stats rely on the same component. If there’s a systematic error in plus/minus data, it is being propagated to a host of public metrics. Thus, I thought I’d build something that uses an entirely different method, and there’s the added benefit in that it can use decades of NBA data, not just everything since play-by-play logs have been available.

The origin of HBox[1. “Historical Boxscore” in my clever naming convention.] is that I feared applying modern data to the 70’s and 80’s when the game was played differently on offense and defense due to rule changes and dramatically different styles and schemes.  What was the value of spacing in an era before three-pointers were a major weapon? How important are drive-and-dish guards before the hand-checking rules were changed? Which old-school players will be penalized, judged by the standards of 21st century basketball?[2. Ed. I’ve long thought Orginal Isiah Thomas gets a short shrift historically because of his pre-modern game in a semi-modern league. One would think a player with Zeke’s overall skill level would have found a way to become a passable or better three point shooter if he came up in an era when deep shooting was an important trait for guards. But I digress.] HBox is box-score metric on the 100 possession scale (i.e. +5 means a player is, basically, adding five points in point differential to the average team) based on team-level results.

Methodology

With some ingenuity, I figured out a way to use regular season data going back to 1978 in predicting team-level performance with individual stats. Essentially, all I’m doing is reducing the error in predicting a team’s rating, and consequently team wins, from player performance in adjacent seasons. While there’s a lot of noise there, with enough seasons and by giving more weight to teams with less roster continuity there’s enough data for useful results.

The point of this exercise is to derive an historical metric usable for any season in which turnovers were recorded. You can actually project this back to 1974 with a turnover estimate[3. Team turnovers were recorded starting during the 1974 season in the NBA, but individual turnovers were not until 1978. The ABA recorded turnovers individually in every season, which goes back to 1968, but like the NBA steals and blocks weren’t recorded until 1974.], but any further back and you’re in the statistical dark without steals, blocks, or even rebounds split into offensive and defensive boards.

The final model was completed using the glmnet function in R, which uses a ridge penalty (like you see in RAPM) and an alpha parameter that drops insignificant variables. As the last step, ratings are adjusted so they sum to every team’s particular rating like what was done here, which greatly improved the prediction power. This is, however, an exploratory version of the metric as I work out the final issues and continue tweaking.

Variables

Since I’m in the beta version of HBox, essentially, I don’t want to provide full results, but there were a few discoveries that I found interesting. The first finding is the majority of the stats relevant to the modern era show up as impactful in the past as well. High assist, high usage players are the most valuable in today’s drive-and-dish game just as they were 80’s. Spacing[4. This is measured by three-point attempts per 100 possessions, but position was also a factor. Centers get the biggest boost for position and point guards the lest] was valuable and measurable on offense — even years before the seven-seconds-or-less Suns put pace-and-space on the map.

One variable that doesn’t show up, however, is offensive rebounding. Even with a low threshold in variable selection, offensive rebounding has very little impact. I used an interaction term between assists and rebounding, meaning offensive rebounds do matter to some extent, but I repeated the analysis without that term and offensive rebounding still didn’t show up. In fact, for the model to include it I either had to force the model not to drop it or exclude seasons 2001 to 2015, but the coefficient was so small it was nearly negligible. For example, during Rodman’s best season rebounding on that end of the court, the boost he’d receive to his rating would only be 0.2 points per 100 possessions. That’s a small effect, and Rodman’s at the extreme end of rebounding. Meanwhile, even without the assist-rebounding interaction term, during Rodman’s best season with defensive rebounding, he’d receive a boost of 2.3 points per 100 possessions compared to the average player.

Basically, according to those results, offensive rebounders don’t lead to wins and are entirely replaceable, which runs contrary to previous metrics like Win Shares or even basketball-reference’s new BPM. Most systems assume defensive rebounds are a lot less valuable. The traditional view is that an offensive rebound is an added possession, which is very valuable, and the credit is entirely given to the rebounder. But this is a radically different view of rebounding value. Actually, this effect is well-known at the team-level, as the best teams often rebound less on offense. This is probably why offensive boards don’t show up in my model, as the model relies on team results.

This might shed some light as to why previous models possibly overrating offensive rebounding. Obviously, an offensive rebound can’t be worth an entire possession on the individual level, because offensive rebounds at at least partially replaceable. A team which loses its best offensive rebounder, doesn’t “lose” every one of his rebounds because the player taking his place will at least grab a few of those. Plus/minus data seemingly should deal with his issue.

The following graph shows the value of offensive and defensive rebounding in concert with assists. Over a realistic range of offensive rebound percentage with the defensive rebound rate set at league average, there is only a modest change in value. But defensive rebounding has a huge influence. And interestingly, the model only finds substantial value in offensive rebounding with high assist players.

Rebounding value
Rebounding value /

However, there is a distinction between what makes a lineup successful,[5. The basis of plus/minus models.] and what makes a team successful. Perhaps teams do better with more offensive rebounding, but it’s more replaceable when  year-to-year changes are considered because so much of offensive rebounding is about what position and how close to the basket one tends to be. Another theory might be that truly impactful offensive rebounding might be correlated with other variables, like defensive rebounding or other stats.

Diving the philosophy and tactics of basketball, the problem with offensive rebounders is that they occupy a valuable physical space that can clog the lane and offensive rebounding is associated with having fewer skills. If you’re a high level basketball player on offense, you’ll either have the ball, show your skills on the perimeter or in an active play, not fighting for the miss off-ball. There are exceptions, of course, including certain MVPs, and this might just be an issue with the method I’ve chosen, but it’s something to consider.

Another surprise is that shot usage has a negative sign — being a high usage player hurts your rating in that respect. Of course, I have interaction terms[6. An interaction term is when multiple variables are multiplied against each other so that each one has an effect on the other variable. For instance, shot volume*assists is a popular interaction term where the players helped most are high volume shot creators who also assist teammates, as opposed to the black holes who shoot often while rarely passing.], so it’s not straightforward. Overall, a player with healthy rates of assists will see an increase in value when the shot rate rises. In other words, a player like Michael Jordan will receive a benefit for shooting more often because he always had an above average rate of assists, but that benefit would be larger with more assists. I tried different terms and combinations, but the best results led to the negative coefficient.

Finally, the only other surprise was that steals were not as valuable as you see in other modern models. Again, my metric is new and it’s not necessary the Word of God, and I’m using a conservative method that shrinks coefficients, but it is an interesting result. Plus/minus models love steals, but why is that? They end a possession and lead to a more efficient possession on offense often due to fast breaks, but is there something about how lineup data is structured that the players getting credited with steals are being overrated?[7. Ed. Perhaps in the same way certain models overcredit individuals for defensive rebounds, the error is wholly crediting what is in many instances a team defensive event to only one player in its entirety?] I’m not entirely sure, but there might be some truth here.

Results

For a look at what the metric produces, I included a table showing the top three players every season since 1978 using some rough guide of value[8. This is shown in the VAL column, which is simply rating*MP/3000.] with every MVP noted. It reflects popular consensus fairly well for an advanced metric, where 18 out of the possible 38 MVP winners had the highest “VAL” for the respective season with another six behind by a slim margin. It’s not supposed to replicate what MVP voters think, of course, but it passes the laugh test — generally, the best players having higher ratings. As I was producing the results and testing new things, I always checked 1978 and how Kareem was rated versus Walton. That remains one of the most interesting MVP races. It was a high-scoring legend versus a team-first defensive newcomer, and I think it’s a good sign that HBox values Walton so highly.

PlayerSeasonMPRatingVALRankMVP
Kareem Abdul-Jabbar197822656.855.171
Bill Walton197819297.674.9321
Artis Gilmore197830674.084.183
Kareem Abdul-Jabbar197931576.977.331
Sam Lacey197926273.853.372
Artis Gilmore197932653.013.273
Moses Malone197933902.763.1141
Kareem Abdul-Jabbar198031436.066.3511
Larry Bird198029554.003.942
Magic Johnson198027954.153.873
Kareem Abdul-Jabbar198129764.584.541
Julius Erving198128744.624.4321
Larry Bird198132393.613.903
Magic Johnson198229915.505.491
Larry Bird198229234.924.802
Jack Sikma198230493.984.053
Moses Malone198233981.942.20121
Magic Johnson198329075.945.751
Larry Bird198329825.225.192
Julius Erving198324213.863.123
Moses Malone198329223.093.0051
Larry Bird198430285.415.4611
Magic Johnson198425676.305.392
Jeff Ruland198430823.403.493
Larry Bird198531615.996.3111
Magic Johnson198527816.205.752
Michael Jordan198531444.104.293
Larry Bird198631135.986.2011
Magic Johnson198625786.255.372
Charles Barkley198629525.145.063
Magic Johnson198729047.867.6111
Larry Bird198730056.376.392
Charles Barkley198727405.715.223
Michael Jordan198833116.026.641
Larry Bird198829656.055.982
Magic Johnson198826375.955.233
Magic Johnson198928868.668.3311
Michael Jordan198932557.307.922
Charles Barkley198930885.425.583
Magic Johnson199029378.668.4811
Michael Jordan199031975.776.142
Charles Barkley199030855.405.553
Magic Johnson199129338.908.701
Michael Jordan199130346.406.4721
David Robinson199130955.826.003
Michael Jordan199231025.615.8011
David Robinson199225646.535.582
John Stockton199230025.125.123
Hakeem Olajuwon199332426.126.621
Charles Barkley199328596.306.0021
David Robinson199332115.415.793
David Robinson199432417.928.561
Hakeem Olajuwon199432774.945.4021
Shaquille O’Neal199432244.504.843
David Robinson199530745.665.8011
Karl Malone199531264.774.972
John Stockton199528675.154.923
David Robinson199630196.036.071
Michael Jordan199630905.355.5121
Karl Malone199631134.865.043
Grant Hill199731476.777.111
Karl Malone199729985.815.8121
Michael Jordan199731064.314.463
Karl Malone199830305.575.621
David Robinson199824575.614.592
Grant Hill199832944.054.453
Michael Jordan199831812.542.69111
Karl Malone199918324.993.0511
Grant Hill199918524.853.002
Jason Kidd199920604.302.963
Shaquille O’Neal200031636.286.6211
Karl Malone200029475.054.962
Kevin Garnett200032434.574.943
Shaquille O’Neal200129246.045.891
Kevin Garnett200132024.564.872
Karl Malone200128954.924.753
Allen Iverson200129792.162.14241
Tim Duncan200233295.686.3011
Kevin Garnett200231755.275.582
Dirk Nowitzki200228914.324.173
Kevin Garnett200333216.306.971
Tim Duncan200331816.156.5221
Dirk Nowitzki200331175.045.243
Kevin Garnett200432317.367.9311
Tim Duncan200425275.975.032
Andrei Kirilenko200428954.624.453
Kevin Garnett200531216.947.221
LeBron James200533884.775.392
Dirk Nowitzki200530204.814.843
Steve Nash200525732.922.50161
LeBron James200633615.396.041
Chauncey Billups200629255.255.112
Kevin Garnett200629575.034.953
Steve Nash200627963.763.50121
Tim Duncan200727266.205.631
Dirk Nowitzki200728204.914.6121
LeBron James200731904.324.593
LeBron James200830276.566.621
Chris Paul200830065.445.452
Chauncey Billups200825225.464.593
Kobe Bryant200831924.134.4041
LeBron James200930549.479.6411
Chris Paul200930026.916.912
Dwyane Wade200930486.496.593
LeBron James201029669.189.0811
Dwyane Wade201027925.334.962
Dwight Howard201028434.884.623
LeBron James201130636.316.451
Chris Paul201128804.434.252
Dwight Howard201129354.104.013
Derrick Rose201130263.923.9541
LeBron James201223267.035.4511
Kevin Durant201225465.104.332
Chris Paul201221814.553.313
LeBron James201328778.267.9211
Kevin Durant201331197.067.342
James Harden201329854.314.293
Kevin Durant201431226.787.0611
LeBron James201429026.226.012
Kevin Love201427976.265.843
James Harden201529816.526.481
Stephen Curry201526136.295.4821
Russell Westbrook201523026.675.123

The metric doesn’t agree with the choices of Iverson and Nash for MVP, but they were fairly controversial anyway. Moses Malone captured three MVPs in reality, but the numbers don’t back him — he has a tiny rate of assists for a high scorer, and assists are an influential part of the model, plus he doesn’t have the defensive box-score power that many other big men MVPs have. If you’re wondering why a beloved player doesn’t show up, it might have to do with the limitations of the box score.

If you looked at the table closely, you probably noticed something surprising: Magic Johnson and Michael Jordan have apparently switched places in the pantheon. Jordan doesn’t look like some insurmountable legend statistically, but “just another” superstar. Yet it’s Magic Johnson with some of the highest rated seasons according to the metric. The numbers value versatile box-score stuffers, and Magic and his 138 triple doubles certainly qualify. There’s also the previously discussed penalty to shot usage, which stifles Jordan’s stats.

No metric is perfect, of course, and this one certainly rewards a specific type a player, one who racks up a lot of assists, rebounds, and other counting stats, while perhaps not accounting for poor defense[9. Always a weakness of box score-based analysis]. This includes Magic as well as Charles Barkley, Kevin Love, and a few others. The worst offender is probably Troy Murphy, who had an all-star rating one season due to his outside shooting and defensive rebounding, but his porous defense isn’t easy to identify with box score stats.

Yet we should not dismiss any odd result and rely only on conventional wisdom; otherwise we wouldn’t learn anything. In 1994, the highest rated players were David Robinson, Shaquille O’Neal, Scottie Pippen, Karl Malone, John Stockton, Shawn Kemp, Charles Barkley, Hakeem Olajuwon, and … Oliver Miller, who actually had the third highest rating that season. Oliver Miller is a semi-obscure player now and only known for his weight issues and being arrested, but back in the mid-90’s he had the steal and block numbers of Olajuwon with sweet passing and rebounding with a highly efficient season in 1994.[10. Think Boris Diaw, but bigger including a huge wingspan, stronger and possibly even more skilled.] Miller’s weight problems didn’t surface until joining a terrible Detroit squad the next season, reportedly showing up to camp out of shape. He then bounced around the NBA and leagues abroad, his weight ballooning and his interest declining. The list of players with at least 4 blocks and 6 assists per 100 possessions is just him, Kirilenko, Kareem, and David Robinson. It suggested high potential and a fascinating career, but we really only got one peak season.

Finally, the below table shows the top ratings for the 2015 season. There are really no surprises here, and less renowned players are substantiated with great numbers from different metrics like the two Green’s. For what it’s worth, the correlation coefficient between the ratings below and the ratings from ESPN’s Real Plus/Minus is 0.75. Basically, the results are very similar with the exceptions of Marc Gasol, Pau Gasol, Love, and Gobert — and those guys are no slouches, depending on your personal opinion of Pau’s impact in Chicago.

PlayerSeasonMPRatingVAL
James Harden201529816.526.48
Stephen Curry201526136.295.48
Russell Westbrook201523026.675.12
LeBron James201524936.105.07
Chris Paul201528575.064.82
Marc Gasol201526873.923.51
Anthony Davis201524554.273.49
Tim Duncan201522274.703.49
Draymond Green201524903.963.29
John Wall201528373.333.15
Damian Lillard201529253.022.94
DeAndre Jordan201528202.822.65
DeMarcus Cousins201520133.942.64
Pau Gasol201526812.712.42
Kevin Love201525322.832.39
Rudy Gobert201521583.302.37
Blake Griffin201523562.952.32
Paul Millsap201523902.842.26
Kawhi Leonard201520333.292.23
Gordon Hayward201526182.412.10
Danny Green201523122.652.04

Conclusion

Due to the nature of HBox, which is alone in its own family of metrics right now[11. I do want to stress the metric’s uniqueness. Statistical plus/minus metrics are quite popular now, and this one was derived completely independently.], certain actions might be over- or undervalued. It might not rate offensive rebounding highly since it’s negatively correlated to team wins. But every method has its strengths and weaknesses, and it’s good to use a wide range of tools to minimize glaring systematic errors. Nonetheless, the results are remarkably close to the ones given by the popular plus/minus models. And most importantly, pre-2001 data was used to build the model. The numbers don’t explain everything, and the box score is a small slice of what happens on the basketball court, but if we’re going to use statistical metrics we may as well use the best options available and attack it from all angles.