Calculating RAPM
May 17, 2015; Houston, TX, USA; Los Angeles Clippers guard Chris Paul (3) reacts after a play during the fourth quarter against the Houston Rockets in game seven of the second round of the NBA Playoffs at Toyota Center. The Rockets defeated the Clippers 113-100 to win the series 4-3. Mandatory Credit: Troy Taormina-USA TODAY Sports
The general concepts behind ESPN’s new stat, RPM, have been pretty well covered. Here’s their introductory post and here’s [dead link] Kevin Ferrigan with some more on how it works. But I’ve seen a lot of people wondering about the actual mechanics of how it’s calculated, so I’ll delve into that, showing people how to actually calculate the bare bones form of RPM. View it as an updated for RAPM and more easily implementable version of this breakdown of APM by Eli Witus, now of the Houston Rockets.
We’ll calculate RAPM from the 2010-11 season, including the playoffs, the last non-lockout season the data is publicly available on basketballvalue. I’ve already cleaned up the data, and you can download it in zipped form here.
The basis of the APM approach is a massive regression. A regression is a statistical method to explain how one or more sets of variables (the independent variables) affect another singular variable (the dependent variable). The APM regression tries to explain margin of victory (dependent variable) over stints of possessions with no substitutions with who is on the court (the independent variables). In our dataset there are about 37,000 of these stints, each of which will show up as a row on the excel sheet you downloaded above. The regression also weights each stint by it’s number of possessions.
Now we have to set up the independent variables. Each of the 400 players who play in our dataset is an independent variable and there’s a column for each independent variable. Download this sheet, that has a list of player IDs. Transpose that list along the top row of the initial sheet (should be J1:RC1) then fill this formula in cells J2:RC37264
=IF(ISNUMBER(FIND(” “&J$1&” “,$B2)),1,IF(ISNUMBER(FIND(” “&J$1&” “,$C2)),-1,0))
This step is pretty computer processor intensive, so it’d be best to turn off other applications running and fill in the formula in chunks. This formula basically looks at the two units on the floor, puts a one if the player for the column the formula is on the home unit, negative-one if he’s on the away unit, and zero if he’s not on the floor at all.
Now, we’re going to actually calculate RAPM. I do it in the free, open-source statistical program R. It takes a bit of time to get used to, especially for those like me who have trouble thinking of stats outside of the context of a spreadsheet. But R is an incredibly powerful tool, and even though you don’t need to know much about it to calculate RAPM, I’d encourage everybody to at least try it out. You can download R at the above link, and it should be pretty easy to get running.
You’ll need to install one package to calculate RAPM, the glmnet package. Installing it is as simple as entering the command:
install.packages(“glmnet”)
You’ll also need to get the data we’ve put together above into a more importable setting. Open a new excel file and copy and paste the possession, rebound rate, and margin columns. Copy all the player columns and use the paste values option to put their values into the new document. Save this file as a csv. Now we can run the regression. Here’s my R code, with annotations behind pound signs:
That last command spits out all the player IDs along with their coefficients in predicting point margin, AKA RAPM. You can then just copy and paste this into the player ID spreadsheet and matchup RAPMs with names.
Here were my top-20 for the 2010-11 season and playoffs:
Name | Poss | RAPM |
Chris Paul | 10737 | 12.03245132 |
Dirk Nowitzki | 9586 | 11.86931091 |
Jeff Foster | 3635 | 11.1418261 |
Steve Nash | 9896 | 10.97194387 |
Dwight Howard | 11145 | 9.31183253 |
Pau Gasol | 11537 | 8.69799588 |
Kevin Durant | 11922 | 8.5788819 |
LeBron James | 11686 | 8.26608807 |
Roy Hibbert | 8891 | 8.15484021 |
Paul Millsap | 9995 | 8.01879397 |
Manu Ginobili | 9541 | 7.91254714 |
Kevin Garnett | 8426 | 7.47747804 |
Tim Duncan | 8381 | 7.10338862 |
Ryan Anderson | 5449 | 6.95622277 |
Nick Collison | 5771 | 6.6692413 |
Jason Collins | 2183 | 6.62688638 |
Chase Budinger | 6828 | 6.24481974 |
Dwyane Wade | 10791 | 6.24125335 |
Jason Terry | 9754 | 6.22546122 |
Kyle Lowry | 10141 | 6.1217469 |
There’s a Jeff Foster here, a Chase Budinger there, but subjectively everything looks pretty good.
Remember, this is just the bare bones RAPM framework. There are endless possibilities as to what can be incorporated. Coaches, arenas, number of rest days, point differential at the time of the possession can all be added. And the “x” in xRAPM comes from a boxscore prior. The boxscore gives us a decent amount of knowledge on how good players are, so instead of the ridge penalty regressing everybody to zero, it regresses them to our prior knowledge of their abilities.
The RAPM framework can be used for other stats too, like rebounding. You can sub the RebMarg vector in where Marg is used to calculate a rebounding plus-minus. In that, somebody like Nene can get his due. While he ranked only 90th in the league in TRB%, the ridge regression estimates that he makes a 7.5% impact on TRB% margin, which ranks 17th in the league. Jeremias Englemann, the guy behind RPM, has posted all sorts of cool stuff like this on his site.
I hope this explainer can become a resource for those interested in calculating their own RAPM and those just curious about the actual mechanics behind it. If you have any questions, if you try to repeat what I did and something goes wrong, let me know.