Deconstructing a New SportVU Metric: Part 1

Mar 9, 2015; Phoenix, AZ, USA; Golden State Warriors guard Stephen Curry (30) reacts against the Phoenix Suns at US Airways Center. The Warriors defeated the Suns 98-80. Mandatory Credit: Mark J. Rebilas-USA TODAY Sports

I’ve had my head buried in stats for a few days now, trying the impossible task of explaining player value with countable statistics. I have to admit this is a perplexing endeavor and defense in particular is incredibly frustrating. But I build NBA metrics for the ancillary benefits—by studying the best ways of linking value and different statistics we can figure out which stats we should be paying attention to and what the consequences are.

One of the most powerful stats, for example, is one that doesn’t even have a name and few NBA fans, even die-hards, even consider. We know the debate over usage is over–the effects exist and lineups generally do better when they have more high usage (i.e. players who shoot a lot) players. But there’s a variation of this that’s even better at explaining how players help teams win—the product of usage rate and assist rate, or similar measures, is a lot better than plain ol’ usage. There’s no official name for this–if you name it after a player, I would propose LeBron because he is, consistently, the king of usage%*assist%–but it’s built into Basketball-Reference’s BPM, a new player metric that makes PER and Win Shares look awfully outdated. The “LeBron” is used in many statistical plus/minus metrics because of its high-performance and that it tracks pretty well in describing who’s the dominant force on offense, punishing guys who don’t shoot like Rondo or others who don’t pass like Amare Stoudemire on the seven-seconds-or-less Suns[1. By the way, if you’re wondering if the “LeBron” is only relevant in the current NBA because of rule changes and style differences like how prevalent the drive-and-kick game is, I experimented with a historical metric using seasons 1978 to 2000 and found it still had power even in the time of Kareem and Bird.]

Let me frame this in reality. I don’t like blackbox statistics either. If I see an important variable, I want to at least have a sane theory as to why that variable has value. The old usage debates ended when the evidence piled up, but this wasn’t boring numbers-based rhetoric. It was argued that high-usage players often bailed out their team on offense, taking the tough shots that the role players couldn’t. But it goes deeper than that. They’re generally guarded by the opposing team’s best defender, they draw a lot of attention, and at their best they can draw hard double teams, creating open looks for their teammates. However, those effects are mitigated if you, well, don’t (or can’t) pass the ball. It’s the big difference between Carmelo and LeBron. Carmelo has the spiffy footwork and touch that impresses fans and professionals alike, but he doesn’t operate in such a way that he draws the attention to create better shots elsewhere; by contrast, LeBron barrels into the lane, draws help, and sets up a corner three-pointer several times a game. That, generally speaking, has more value.

And now that we’ve entered the databall era with SportVU tracking and other sources, we should be looking for the next hidden stats and trying to figure out how to equate our disparate information, so we know a rough estimate of the tradeoff between different types of players. Otherwise we could miss the forest for the trees and rank J.J. Hickson over Kosta Koufos because we’re impressed with Hickson’s rebounding and field-goal percentage numbers.

The point here isn’t to “solve” basketball. Rather it’s about exploring the NBA with new data, asking questions, and being skeptical every step of the way. That will hopefully lead to a sliver of illumination.

Methodology

Essentially, I’m building a statistical plus-minus model using one-and-a-half years of data and testing it against the second half of the 2015 seasons to help form conclusions about which variables are important and how they can predict new data. If you want to read about statistical plus/minus models, there are a few articles you can find online. It’s what Basketball-Reference uses now, it’s what FiveThirtyEight often uses, and it’s an important component of ESPN’s Real Plus-Minus. It’s a generation beyond something like PER because the variable weights are actually tested, not just theorized.

When you build a statistical plus-minus model, you need a benchmark. Your dependent variable is some estimate of player value so you can figure out which stats can “predict” that value estimate. A common error is using something like Real Plus-Minus (RPM), which already has stats like steals built into it. You are not proving anything if your model likes steals – of course it would; it’s part of RPM. Thus, I used what’s known as NPI-RAPM, or non-prior informed ridge regression adjusted plus-minus. In simpler terms, it’s not using any information from other seasons or from any box score statistics. It’s just a pure plus/minus model over 1.5 seasons. In the final model, I used one provided by NylonCalculus’ own Layne Vashro, but I also tested it using one from EvanZ.

For all my independent variables, I collected nearly everything from stats.NBA.com and NBAminer.com. Most of these variables aren’t useful, but a large number of non-box score stats were tested and utilized, including a stat for drawing an offensive foul and a few variations of stats I derived from the SportVU shot logs and rebound logs. You can see similar rim protection numbers here, but I have something different for every shot on the floor and I can break down contested rebounds by offense and defense.

After I have all my data, I can dredge up these estimates and figure out which variables matter. To do this, I used the glmnet package in R varying alpha and lambda with the Caret package. You don’t need a college degree in math to understand these at a basic level: it’s a regression model that teases out the relative values of different variables. The glmnet package allows you to penalize, or shrink, these coefficients with lambda so you don’t get overfitting issues with bogus results and the alpha allows you to drop insignificant variables. This type of modeling works by slicing up your data into pieces and testing how well the variables do in predicting the results on a piece that held out of the analysis – this helps build to model that does better with new information (i.e. a new season.)

The problem, however, is that I only have one-and-a-half seasons for testing. It’s not enough to make sound judgments. But I have a similar model using 14 seasons with many of the same variables. With some tweaks, I integrated that information into my new models. Thus, the new variables, like the SportVU ones, had to offer a significant improvement and enhance an already stable model.

With those models built, I can now track how well they predict the second half of the season by using player minute totals for every game and their estimated value via the models. I can also tune the coefficients to see how certain stats perform with new data and compare them to the standard benchmarks of BPM and RPM, which I had saved at the All-Star break.

Feb 25, 2015; Atlanta, GA, USA; Atlanta Hawks guard Kyle Korver (26) attempts a three-point basket against Dallas Mavericks forward Dirk Nowitzki (41) in the third quarter of their game at Philips Arena. The Hawks won 104-87. Mandatory Credit: Jason Getz-USA TODAY Sports

Breaking Down the Metric: Offense

For clarification, I put everything in terms of 100 possessions, as is the standard with both player and team stats. I also won’t divulge the complete details of every variable. Here are the offensive components of the metric.

Efficiency

I used a term I call “Points Over Average.” All I’m doing is calculating how many points a player gives a team via efficiency over the league average. This is like true-shooting percentage adjusted for the amount of shots you take and the league average.

There are two ways to think about efficiency. One is that they directly tell you player value by stating what happened. Another is that shooting percentage is a bit volatile and for predictions a healthy dose of regression to the mean works well. Thus, I don’t completely credit a player for his efficiency. But it’s an important consideration and will make or break an efficient phenom like Brandan Wright or a player in a shooting slump like Batum.

You can see the leaders so far this season in this efficiency measure. As I said, it’s like a combination of usage and true-shooting percentage—balancing the playing field for high volume scorers and extremely efficient role players.

Table: Points over league average leaders, 1000 MP min.

Player	USG%	TS%	Per 100 poss.
Stephen Curry	28.8	63.1	5.4
Kyle Korver	14.3	70.7	4.7
James Harden	60.8	60.8	4.5
Brandan Wright	13.8	68.1	4.3
Anthony Davis	27.7	60.5	4.3
Tyson Chandler	12.8	70.8	4.2
Jonas Valanciunas	18.8	61.9	3.3
J.J. Redick	20.3	60.9	3.1
James Johnson	16.7	62.7	3.0
DeAndre Jordan	13.0	65.1	3.0
Klay Thompson	27.5	58.7	3.0
LeBron James	33.3	57.7	2.8
Rudy Gobert	13.6	63.6	2.7
Isaiah Thomas	27.2	58.0	2.5
Kyrie Irving	25.4	57.7	2.3
Jimmy Butler	21.6	58.2	2.2
Tyler Zeller	19.1	58.9	2.2
Amir Johnson	15.5	60.6	2.2
Wesley Matthews	19.8	58.5	2.1
Dwight Howard	23.0	58.2	2.1

*Through the games of March 9th, via basketball-reference

Playmaking/passing

For the pre-SportVU era, I used a variation of the “LeBron” stat (usg%*ast%) that’s the most important component of my model. With player tracking data, we might have a marginal improvement. It’s no state-secret that a player tracking model is doing well right now in the annual APBR win prediction contest. And I’ve stolen a variable from it, which is doing well with my method though we both spanned most of the same time frame (the 2014 season.)

Passing efficiency: (Assists + free-throw assists + secondary assists)/Passes

I tweaked this variable further. For whatever reason, giving a bonus to a three-point assist didn’t help the results. But I made another small change that significant improved the results across the board in every model.

Passing efficiency is a significant upgrade over the conventional assists per game and assist rate. For what it’s worth, assist rate by itself was not a useful variable for my 14 season model. The closest thing to a “LeBron” stat now is the passing efficiency score paired with a shot volume measure, which really helps Westbrook’s estimate given his Wilt/Iverson-esque gunning.

Finally, I have another variable that gives an additional boost to shot volume with a tweak that helps players who create the offense on their own. I know that’s oblique, but the real fun here is the SportVU stuff.

Turnovers

I’m taking another stat from this model and, of course, adjusting it a bit. The basic form is turnovers per touch, which helps guys who pick up a lot of turnovers because they have the ball a lot and punishes players with stone hands who turn the ball over whenever they do manage the grab the ball. I adjusted this with turnovers per frontcourt touch because I wanted to ignore inconsequential touches after a defensive rebound where a player throws it immediately back to the point guard. That variable performed extremely well in every iteration of the model. I still included turnovers per 100 possessions, however; the results were best when the two were used in conjunction. (I may rethink this because raw turnovers will just penalize ballhandlers who do most of the work.)

Spacing

The power of spacing should be obvious now, except to Byron Scott, Charles Barkley, and a few old-school holdovers. It’s not just about the efficiency of those three-pointers; it’s about opening up the lane and drawing defenders away from the basket, as well as having more opportunities for offensive rebounds because the field-goal percentage is lower. You can see my in-depth analysis here. Even BPM gives credit to spacing, although I have an additional factor that gives a large boost for position. (You can use something like position in units of 1 through 5 multiplied by three-point attempts per possession; just be careful with position because many sources are inaccurate.)

I tried a couple variations, but the most stable and explanatory factors were just based on three-pointer attempts and position. In fact, looking at the 1.5 year RAPM data, three-pointers were extremely important and players like Pero Antic rated a lot better than you’d expect. I thought I had a fancy new spacing metric thanks to the SportVU shotlogs, but it hasn’t proven itself yet. I looked at the distance from the defender on catch-and-shoot three-point attempts released within 1.5 seconds–that should give you a proxy for the “gravity” an outside shooter has, but it wasn’t significant. I might need more data or the measure might need more tweaking.

The last spacing variable is a position-adjusted measure of long two-pointers (16 feet to the 3PT line.) The effect is a bit weak and didn’t correlate too high with the RAPM data, but it helps out high volume outside shooting big men like Dirk and Aldridge decently well. And with more experimentation I might be able to incorporate the average defender distance on those shots too, which will hopefully be a decent proxy of defensive attention on the perimeter.

Offensive rebounding

Based on some initial analysis of rebounding, I found that contested rebounds were an order of magnitude more valuable than uncontested ones. However, this wasn’t true when applying contested/uncontested rebounds to this dataset. Perhaps it’s because guys who crash the boards are often unskilled and clog–I’m not sure. I’ll need to work more on this, but somehow uncontested offensive rebounds were the only rebound type that was valuable for offense. But this is the advantage with my methodology: I used the results from my previous multi-season model for offensive rebounding. It’s a lot less valuable than you’d see in similar metrics, but offensive rebounding is still fairly important.

Miscellaneous

Even on offense, steals are important, and they’re a pretty big factor in most statistical plus-minus models. Steals are probably the best example of the dichotomous of value in these metrics. There are the direct value stats like efficiency. They’re valuable because they’re related to a player directing adding points for his team. Steals have this effect too even on offense where steals lead to fast break opportunities. Then there are the value stats that are correlative. This is like using demographic information like height. Steals are like this too on offense. It’s often been argued that steals correlate well with superstars because it’s related to offensive awareness and quickness, for example.

The second miscellaneous variable is minutes per game. Basketball-Reference’s article on BPM has a small section on why this factor is important, but I think some of those stated points are misleading. First of all, BPM was built from a 14-year RAPM model where, mathematically, the extremely low minute players are bunched around the slightly below average range[1. This may not be an issue with BPM because of how big the dataset was and how many players were filtered out, but it’s something to keep in mind if you do your own analysis.]. This is one of the advantages of RAPM over the more basic adjusted plus-minus models. Secondly, it’s a good probabilistic bet that low MPG players aren’t very good. Also, a modest boost for high MPG guys may cover a bit of the game not covered by the stats, like pick and roll skills. I will note, however, that the more information I supplied the less relevant MPG became, and in some iterations it vanished completely.

Part Two of this post, coming soon, will cover the defensive components and share some numeric results from the model. Stay tuned!