Introducing HBox: Predicting Wins Based on Historical Data and Trends
By Justin
A few years into the modern “databall era”, plus/minus analysis has been the basis for a new generation of metrics and stats. It’s not just about terms like APM or RAPM, but the more recent like basketball-reference’s BPM[1. Box score Plus/Minus, from the family of “Statistical Plus/Minus” metrics like our DRE, which assign weights to box score contributions as a way of estimating a player’s RAPM value without requiring the computing power of sample size of actually building and running such a model.] also use plus/minus data to construct the weights for their variables. That can work really well, as they’re an improvement over the old metrics like PER, but it can be dangerous if all our stats rely on the same component. If there’s a systematic error in plus/minus data, it is being propagated to a host of public metrics. Thus, I thought I’d build something that uses an entirely different method, and there’s the added benefit in that it can use decades of NBA data, not just everything since play-by-play logs have been available.
The origin of HBox[1. “Historical Boxscore” in my clever naming convention.] is that I feared applying modern data to the 70’s and 80’s when the game was played differently on offense and defense due to rule changes and dramatically different styles and schemes. What was the value of spacing in an era before three-pointers were a major weapon? How important are drive-and-dish guards before the hand-checking rules were changed? Which old-school players will be penalized, judged by the standards of 21st century basketball?[2. Ed. I’ve long thought Orginal Isiah Thomas gets a short shrift historically because of his pre-modern game in a semi-modern league. One would think a player with Zeke’s overall skill level would have found a way to become a passable or better three point shooter if he came up in an era when deep shooting was an important trait for guards. But I digress.] HBox is box-score metric on the 100 possession scale (i.e. +5 means a player is, basically, adding five points in point differential to the average team) based on team-level results.
Methodology
With some ingenuity, I figured out a way to use regular season data going back to 1978 in predicting team-level performance with individual stats. Essentially, all I’m doing is reducing the error in predicting a team’s rating, and consequently team wins, from player performance in adjacent seasons. While there’s a lot of noise there, with enough seasons and by giving more weight to teams with less roster continuity there’s enough data for useful results.
The point of this exercise is to derive an historical metric usable for any season in which turnovers were recorded. You can actually project this back to 1974 with a turnover estimate[3. Team turnovers were recorded starting during the 1974 season in the NBA, but individual turnovers were not until 1978. The ABA recorded turnovers individually in every season, which goes back to 1968, but like the NBA steals and blocks weren’t recorded until 1974.], but any further back and you’re in the statistical dark without steals, blocks, or even rebounds split into offensive and defensive boards.
The final model was completed using the glmnet function in R, which uses a ridge penalty (like you see in RAPM) and an alpha parameter that drops insignificant variables. As the last step, ratings are adjusted so they sum to every team’s particular rating like what was done here, which greatly improved the prediction power. This is, however, an exploratory version of the metric as I work out the final issues and continue tweaking.
Variables
Since I’m in the beta version of HBox, essentially, I don’t want to provide full results, but there were a few discoveries that I found interesting. The first finding is the majority of the stats relevant to the modern era show up as impactful in the past as well. High assist, high usage players are the most valuable in today’s drive-and-dish game just as they were 80’s. Spacing[4. This is measured by three-point attempts per 100 possessions, but position was also a factor. Centers get the biggest boost for position and point guards the lest] was valuable and measurable on offense — even years before the seven-seconds-or-less Suns put pace-and-space on the map.
One variable that doesn’t show up, however, is offensive rebounding. Even with a low threshold in variable selection, offensive rebounding has very little impact. I used an interaction term between assists and rebounding, meaning offensive rebounds do matter to some extent, but I repeated the analysis without that term and offensive rebounding still didn’t show up. In fact, for the model to include it I either had to force the model not to drop it or exclude seasons 2001 to 2015, but the coefficient was so small it was nearly negligible. For example, during Rodman’s best season rebounding on that end of the court, the boost he’d receive to his rating would only be 0.2 points per 100 possessions. That’s a small effect, and Rodman’s at the extreme end of rebounding. Meanwhile, even without the assist-rebounding interaction term, during Rodman’s best season with defensive rebounding, he’d receive a boost of 2.3 points per 100 possessions compared to the average player.
Basically, according to those results, offensive rebounders don’t lead to wins and are entirely replaceable, which runs contrary to previous metrics like Win Shares or even basketball-reference’s new BPM. Most systems assume defensive rebounds are a lot less valuable. The traditional view is that an offensive rebound is an added possession, which is very valuable, and the credit is entirely given to the rebounder. But this is a radically different view of rebounding value. Actually, this effect is well-known at the team-level, as the best teams often rebound less on offense. This is probably why offensive boards don’t show up in my model, as the model relies on team results.
This might shed some light as to why previous models possibly overrating offensive rebounding. Obviously, an offensive rebound can’t be worth an entire possession on the individual level, because offensive rebounds at at least partially replaceable. A team which loses its best offensive rebounder, doesn’t “lose” every one of his rebounds because the player taking his place will at least grab a few of those. Plus/minus data seemingly should deal with his issue.
The following graph shows the value of offensive and defensive rebounding in concert with assists. Over a realistic range of offensive rebound percentage with the defensive rebound rate set at league average, there is only a modest change in value. But defensive rebounding has a huge influence. And interestingly, the model only finds substantial value in offensive rebounding with high assist players.
However, there is a distinction between what makes a lineup successful,[5. The basis of plus/minus models.] and what makes a team successful. Perhaps teams do better with more offensive rebounding, but it’s more replaceable when year-to-year changes are considered because so much of offensive rebounding is about what position and how close to the basket one tends to be. Another theory might be that truly impactful offensive rebounding might be correlated with other variables, like defensive rebounding or other stats.
Diving the philosophy and tactics of basketball, the problem with offensive rebounders is that they occupy a valuable physical space that can clog the lane and offensive rebounding is associated with having fewer skills. If you’re a high level basketball player on offense, you’ll either have the ball, show your skills on the perimeter or in an active play, not fighting for the miss off-ball. There are exceptions, of course, including certain MVPs, and this might just be an issue with the method I’ve chosen, but it’s something to consider.
Another surprise is that shot usage has a negative sign — being a high usage player hurts your rating in that respect. Of course, I have interaction terms[6. An interaction term is when multiple variables are multiplied against each other so that each one has an effect on the other variable. For instance, shot volume*assists is a popular interaction term where the players helped most are high volume shot creators who also assist teammates, as opposed to the black holes who shoot often while rarely passing.], so it’s not straightforward. Overall, a player with healthy rates of assists will see an increase in value when the shot rate rises. In other words, a player like Michael Jordan will receive a benefit for shooting more often because he always had an above average rate of assists, but that benefit would be larger with more assists. I tried different terms and combinations, but the best results led to the negative coefficient.
Finally, the only other surprise was that steals were not as valuable as you see in other modern models. Again, my metric is new and it’s not necessary the Word of God, and I’m using a conservative method that shrinks coefficients, but it is an interesting result. Plus/minus models love steals, but why is that? They end a possession and lead to a more efficient possession on offense often due to fast breaks, but is there something about how lineup data is structured that the players getting credited with steals are being overrated?[7. Ed. Perhaps in the same way certain models overcredit individuals for defensive rebounds, the error is wholly crediting what is in many instances a team defensive event to only one player in its entirety?] I’m not entirely sure, but there might be some truth here.
Results
For a look at what the metric produces, I included a table showing the top three players every season since 1978 using some rough guide of value[8. This is shown in the VAL column, which is simply rating*MP/3000.] with every MVP noted. It reflects popular consensus fairly well for an advanced metric, where 18 out of the possible 38 MVP winners had the highest “VAL” for the respective season with another six behind by a slim margin. It’s not supposed to replicate what MVP voters think, of course, but it passes the laugh test — generally, the best players having higher ratings. As I was producing the results and testing new things, I always checked 1978 and how Kareem was rated versus Walton. That remains one of the most interesting MVP races. It was a high-scoring legend versus a team-first defensive newcomer, and I think it’s a good sign that HBox values Walton so highly.
Player | Season | MP | Rating | VAL | Rank | MVP |
Kareem Abdul-Jabbar | 1978 | 2265 | 6.85 | 5.17 | 1 | |
Bill Walton | 1978 | 1929 | 7.67 | 4.93 | 2 | 1 |
Artis Gilmore | 1978 | 3067 | 4.08 | 4.18 | 3 | |
Kareem Abdul-Jabbar | 1979 | 3157 | 6.97 | 7.33 | 1 | |
Sam Lacey | 1979 | 2627 | 3.85 | 3.37 | 2 | |
Artis Gilmore | 1979 | 3265 | 3.01 | 3.27 | 3 | |
Moses Malone | 1979 | 3390 | 2.76 | 3.11 | 4 | 1 |
Kareem Abdul-Jabbar | 1980 | 3143 | 6.06 | 6.35 | 1 | 1 |
Larry Bird | 1980 | 2955 | 4.00 | 3.94 | 2 | |
Magic Johnson | 1980 | 2795 | 4.15 | 3.87 | 3 | |
Kareem Abdul-Jabbar | 1981 | 2976 | 4.58 | 4.54 | 1 | |
Julius Erving | 1981 | 2874 | 4.62 | 4.43 | 2 | 1 |
Larry Bird | 1981 | 3239 | 3.61 | 3.90 | 3 | |
Magic Johnson | 1982 | 2991 | 5.50 | 5.49 | 1 | |
Larry Bird | 1982 | 2923 | 4.92 | 4.80 | 2 | |
Jack Sikma | 1982 | 3049 | 3.98 | 4.05 | 3 | |
Moses Malone | 1982 | 3398 | 1.94 | 2.20 | 12 | 1 |
Magic Johnson | 1983 | 2907 | 5.94 | 5.75 | 1 | |
Larry Bird | 1983 | 2982 | 5.22 | 5.19 | 2 | |
Julius Erving | 1983 | 2421 | 3.86 | 3.12 | 3 | |
Moses Malone | 1983 | 2922 | 3.09 | 3.00 | 5 | 1 |
Larry Bird | 1984 | 3028 | 5.41 | 5.46 | 1 | 1 |
Magic Johnson | 1984 | 2567 | 6.30 | 5.39 | 2 | |
Jeff Ruland | 1984 | 3082 | 3.40 | 3.49 | 3 | |
Larry Bird | 1985 | 3161 | 5.99 | 6.31 | 1 | 1 |
Magic Johnson | 1985 | 2781 | 6.20 | 5.75 | 2 | |
Michael Jordan | 1985 | 3144 | 4.10 | 4.29 | 3 | |
Larry Bird | 1986 | 3113 | 5.98 | 6.20 | 1 | 1 |
Magic Johnson | 1986 | 2578 | 6.25 | 5.37 | 2 | |
Charles Barkley | 1986 | 2952 | 5.14 | 5.06 | 3 | |
Magic Johnson | 1987 | 2904 | 7.86 | 7.61 | 1 | 1 |
Larry Bird | 1987 | 3005 | 6.37 | 6.39 | 2 | |
Charles Barkley | 1987 | 2740 | 5.71 | 5.22 | 3 | |
Michael Jordan | 1988 | 3311 | 6.02 | 6.64 | 1 | |
Larry Bird | 1988 | 2965 | 6.05 | 5.98 | 2 | |
Magic Johnson | 1988 | 2637 | 5.95 | 5.23 | 3 | |
Magic Johnson | 1989 | 2886 | 8.66 | 8.33 | 1 | 1 |
Michael Jordan | 1989 | 3255 | 7.30 | 7.92 | 2 | |
Charles Barkley | 1989 | 3088 | 5.42 | 5.58 | 3 | |
Magic Johnson | 1990 | 2937 | 8.66 | 8.48 | 1 | 1 |
Michael Jordan | 1990 | 3197 | 5.77 | 6.14 | 2 | |
Charles Barkley | 1990 | 3085 | 5.40 | 5.55 | 3 | |
Magic Johnson | 1991 | 2933 | 8.90 | 8.70 | 1 | |
Michael Jordan | 1991 | 3034 | 6.40 | 6.47 | 2 | 1 |
David Robinson | 1991 | 3095 | 5.82 | 6.00 | 3 | |
Michael Jordan | 1992 | 3102 | 5.61 | 5.80 | 1 | 1 |
David Robinson | 1992 | 2564 | 6.53 | 5.58 | 2 | |
John Stockton | 1992 | 3002 | 5.12 | 5.12 | 3 | |
Hakeem Olajuwon | 1993 | 3242 | 6.12 | 6.62 | 1 | |
Charles Barkley | 1993 | 2859 | 6.30 | 6.00 | 2 | 1 |
David Robinson | 1993 | 3211 | 5.41 | 5.79 | 3 | |
David Robinson | 1994 | 3241 | 7.92 | 8.56 | 1 | |
Hakeem Olajuwon | 1994 | 3277 | 4.94 | 5.40 | 2 | 1 |
Shaquille O’Neal | 1994 | 3224 | 4.50 | 4.84 | 3 | |
David Robinson | 1995 | 3074 | 5.66 | 5.80 | 1 | 1 |
Karl Malone | 1995 | 3126 | 4.77 | 4.97 | 2 | |
John Stockton | 1995 | 2867 | 5.15 | 4.92 | 3 | |
David Robinson | 1996 | 3019 | 6.03 | 6.07 | 1 | |
Michael Jordan | 1996 | 3090 | 5.35 | 5.51 | 2 | 1 |
Karl Malone | 1996 | 3113 | 4.86 | 5.04 | 3 | |
Grant Hill | 1997 | 3147 | 6.77 | 7.11 | 1 | |
Karl Malone | 1997 | 2998 | 5.81 | 5.81 | 2 | 1 |
Michael Jordan | 1997 | 3106 | 4.31 | 4.46 | 3 | |
Karl Malone | 1998 | 3030 | 5.57 | 5.62 | 1 | |
David Robinson | 1998 | 2457 | 5.61 | 4.59 | 2 | |
Grant Hill | 1998 | 3294 | 4.05 | 4.45 | 3 | |
Michael Jordan | 1998 | 3181 | 2.54 | 2.69 | 11 | 1 |
Karl Malone | 1999 | 1832 | 4.99 | 3.05 | 1 | 1 |
Grant Hill | 1999 | 1852 | 4.85 | 3.00 | 2 | |
Jason Kidd | 1999 | 2060 | 4.30 | 2.96 | 3 | |
Shaquille O’Neal | 2000 | 3163 | 6.28 | 6.62 | 1 | 1 |
Karl Malone | 2000 | 2947 | 5.05 | 4.96 | 2 | |
Kevin Garnett | 2000 | 3243 | 4.57 | 4.94 | 3 | |
Shaquille O’Neal | 2001 | 2924 | 6.04 | 5.89 | 1 | |
Kevin Garnett | 2001 | 3202 | 4.56 | 4.87 | 2 | |
Karl Malone | 2001 | 2895 | 4.92 | 4.75 | 3 | |
Allen Iverson | 2001 | 2979 | 2.16 | 2.14 | 24 | 1 |
Tim Duncan | 2002 | 3329 | 5.68 | 6.30 | 1 | 1 |
Kevin Garnett | 2002 | 3175 | 5.27 | 5.58 | 2 | |
Dirk Nowitzki | 2002 | 2891 | 4.32 | 4.17 | 3 | |
Kevin Garnett | 2003 | 3321 | 6.30 | 6.97 | 1 | |
Tim Duncan | 2003 | 3181 | 6.15 | 6.52 | 2 | 1 |
Dirk Nowitzki | 2003 | 3117 | 5.04 | 5.24 | 3 | |
Kevin Garnett | 2004 | 3231 | 7.36 | 7.93 | 1 | 1 |
Tim Duncan | 2004 | 2527 | 5.97 | 5.03 | 2 | |
Andrei Kirilenko | 2004 | 2895 | 4.62 | 4.45 | 3 | |
Kevin Garnett | 2005 | 3121 | 6.94 | 7.22 | 1 | |
LeBron James | 2005 | 3388 | 4.77 | 5.39 | 2 | |
Dirk Nowitzki | 2005 | 3020 | 4.81 | 4.84 | 3 | |
Steve Nash | 2005 | 2573 | 2.92 | 2.50 | 16 | 1 |
LeBron James | 2006 | 3361 | 5.39 | 6.04 | 1 | |
Chauncey Billups | 2006 | 2925 | 5.25 | 5.11 | 2 | |
Kevin Garnett | 2006 | 2957 | 5.03 | 4.95 | 3 | |
Steve Nash | 2006 | 2796 | 3.76 | 3.50 | 12 | 1 |
Tim Duncan | 2007 | 2726 | 6.20 | 5.63 | 1 | |
Dirk Nowitzki | 2007 | 2820 | 4.91 | 4.61 | 2 | 1 |
LeBron James | 2007 | 3190 | 4.32 | 4.59 | 3 | |
LeBron James | 2008 | 3027 | 6.56 | 6.62 | 1 | |
Chris Paul | 2008 | 3006 | 5.44 | 5.45 | 2 | |
Chauncey Billups | 2008 | 2522 | 5.46 | 4.59 | 3 | |
Kobe Bryant | 2008 | 3192 | 4.13 | 4.40 | 4 | 1 |
LeBron James | 2009 | 3054 | 9.47 | 9.64 | 1 | 1 |
Chris Paul | 2009 | 3002 | 6.91 | 6.91 | 2 | |
Dwyane Wade | 2009 | 3048 | 6.49 | 6.59 | 3 | |
LeBron James | 2010 | 2966 | 9.18 | 9.08 | 1 | 1 |
Dwyane Wade | 2010 | 2792 | 5.33 | 4.96 | 2 | |
Dwight Howard | 2010 | 2843 | 4.88 | 4.62 | 3 | |
LeBron James | 2011 | 3063 | 6.31 | 6.45 | 1 | |
Chris Paul | 2011 | 2880 | 4.43 | 4.25 | 2 | |
Dwight Howard | 2011 | 2935 | 4.10 | 4.01 | 3 | |
Derrick Rose | 2011 | 3026 | 3.92 | 3.95 | 4 | 1 |
LeBron James | 2012 | 2326 | 7.03 | 5.45 | 1 | 1 |
Kevin Durant | 2012 | 2546 | 5.10 | 4.33 | 2 | |
Chris Paul | 2012 | 2181 | 4.55 | 3.31 | 3 | |
LeBron James | 2013 | 2877 | 8.26 | 7.92 | 1 | 1 |
Kevin Durant | 2013 | 3119 | 7.06 | 7.34 | 2 | |
James Harden | 2013 | 2985 | 4.31 | 4.29 | 3 | |
Kevin Durant | 2014 | 3122 | 6.78 | 7.06 | 1 | 1 |
LeBron James | 2014 | 2902 | 6.22 | 6.01 | 2 | |
Kevin Love | 2014 | 2797 | 6.26 | 5.84 | 3 | |
James Harden | 2015 | 2981 | 6.52 | 6.48 | 1 | |
Stephen Curry | 2015 | 2613 | 6.29 | 5.48 | 2 | 1 |
Russell Westbrook | 2015 | 2302 | 6.67 | 5.12 | 3 |
The metric doesn’t agree with the choices of Iverson and Nash for MVP, but they were fairly controversial anyway. Moses Malone captured three MVPs in reality, but the numbers don’t back him — he has a tiny rate of assists for a high scorer, and assists are an influential part of the model, plus he doesn’t have the defensive box-score power that many other big men MVPs have. If you’re wondering why a beloved player doesn’t show up, it might have to do with the limitations of the box score.
If you looked at the table closely, you probably noticed something surprising: Magic Johnson and Michael Jordan have apparently switched places in the pantheon. Jordan doesn’t look like some insurmountable legend statistically, but “just another” superstar. Yet it’s Magic Johnson with some of the highest rated seasons according to the metric. The numbers value versatile box-score stuffers, and Magic and his 138 triple doubles certainly qualify. There’s also the previously discussed penalty to shot usage, which stifles Jordan’s stats.
No metric is perfect, of course, and this one certainly rewards a specific type a player, one who racks up a lot of assists, rebounds, and other counting stats, while perhaps not accounting for poor defense[9. Always a weakness of box score-based analysis]. This includes Magic as well as Charles Barkley, Kevin Love, and a few others. The worst offender is probably Troy Murphy, who had an all-star rating one season due to his outside shooting and defensive rebounding, but his porous defense isn’t easy to identify with box score stats.
Yet we should not dismiss any odd result and rely only on conventional wisdom; otherwise we wouldn’t learn anything. In 1994, the highest rated players were David Robinson, Shaquille O’Neal, Scottie Pippen, Karl Malone, John Stockton, Shawn Kemp, Charles Barkley, Hakeem Olajuwon, and … Oliver Miller, who actually had the third highest rating that season. Oliver Miller is a semi-obscure player now and only known for his weight issues and being arrested, but back in the mid-90’s he had the steal and block numbers of Olajuwon with sweet passing and rebounding with a highly efficient season in 1994.[10. Think Boris Diaw, but bigger including a huge wingspan, stronger and possibly even more skilled.] Miller’s weight problems didn’t surface until joining a terrible Detroit squad the next season, reportedly showing up to camp out of shape. He then bounced around the NBA and leagues abroad, his weight ballooning and his interest declining. The list of players with at least 4 blocks and 6 assists per 100 possessions is just him, Kirilenko, Kareem, and David Robinson. It suggested high potential and a fascinating career, but we really only got one peak season.
Finally, the below table shows the top ratings for the 2015 season. There are really no surprises here, and less renowned players are substantiated with great numbers from different metrics like the two Green’s. For what it’s worth, the correlation coefficient between the ratings below and the ratings from ESPN’s Real Plus/Minus is 0.75. Basically, the results are very similar with the exceptions of Marc Gasol, Pau Gasol, Love, and Gobert — and those guys are no slouches, depending on your personal opinion of Pau’s impact in Chicago.
Player | Season | MP | Rating | VAL |
James Harden | 2015 | 2981 | 6.52 | 6.48 |
Stephen Curry | 2015 | 2613 | 6.29 | 5.48 |
Russell Westbrook | 2015 | 2302 | 6.67 | 5.12 |
LeBron James | 2015 | 2493 | 6.10 | 5.07 |
Chris Paul | 2015 | 2857 | 5.06 | 4.82 |
Marc Gasol | 2015 | 2687 | 3.92 | 3.51 |
Anthony Davis | 2015 | 2455 | 4.27 | 3.49 |
Tim Duncan | 2015 | 2227 | 4.70 | 3.49 |
Draymond Green | 2015 | 2490 | 3.96 | 3.29 |
John Wall | 2015 | 2837 | 3.33 | 3.15 |
Damian Lillard | 2015 | 2925 | 3.02 | 2.94 |
DeAndre Jordan | 2015 | 2820 | 2.82 | 2.65 |
DeMarcus Cousins | 2015 | 2013 | 3.94 | 2.64 |
Pau Gasol | 2015 | 2681 | 2.71 | 2.42 |
Kevin Love | 2015 | 2532 | 2.83 | 2.39 |
Rudy Gobert | 2015 | 2158 | 3.30 | 2.37 |
Blake Griffin | 2015 | 2356 | 2.95 | 2.32 |
Paul Millsap | 2015 | 2390 | 2.84 | 2.26 |
Kawhi Leonard | 2015 | 2033 | 3.29 | 2.23 |
Gordon Hayward | 2015 | 2618 | 2.41 | 2.10 |
Danny Green | 2015 | 2312 | 2.65 | 2.04 |
Conclusion
Due to the nature of HBox, which is alone in its own family of metrics right now[11. I do want to stress the metric’s uniqueness. Statistical plus/minus metrics are quite popular now, and this one was derived completely independently.], certain actions might be over- or undervalued. It might not rate offensive rebounding highly since it’s negatively correlated to team wins. But every method has its strengths and weaknesses, and it’s good to use a wide range of tools to minimize glaring systematic errors. Nonetheless, the results are remarkably close to the ones given by the popular plus/minus models. And most importantly, pre-2001 data was used to build the model. The numbers don’t explain everything, and the box score is a small slice of what happens on the basketball court, but if we’re going to use statistical metrics we may as well use the best options available and attack it from all angles.