Nylon Calculus Discusses One-Number Metrics

Flickr | frankieleon

Seth Partnow (@SethPartnow): So, friend of the blog Neil Greenberg of the Washington Post tweeted this out on Monday:

Bill James in '86 Abstract on trying to summarize everything a player can do in one number. Applies to NHL especially pic.twitter.com/Ybu0601oxL
— Neil Greenberg (@ngreenberg) March 23, 2015

Though Neil was specifically referencing the rise of hockey analytics, this obviously has relevance to basketball. We’ve had some fun internally with the back-and-forth over the utility of “YAPM’s[1. “Yet Another Plus/Minus Model” courtesy of TNC guest contributor Hannes Becker.],” but I think that Bill James quote encapsulates a lot of my thoughts on the matter. Without putting further words in his mouth, I’d venture to say he’d agree that one-number metrics can be a decent starting point and I certainly agree with that.

Where I get off the train is that seeking perfection in those models without a full understanding of what we are trying to measure is both unproductive and a bit of a fool’s errand. It’s why I gravitate more towards Andrew’s work with PT-PM because that’s not just trying to show “who’s better. who’s best” but also about figuring out which things that happen on the court matter and which are largely ephemera.

I know a lot of you disagree to an extent, and I’d like to hear why.

Krishna Narsu (@Knarsu3): My biggest issue with the APMs/RAPMs etc. has always been the “why” of the metric. So Khris Middleton has a 3.70 DRPM. Cool, so we can be reasonably confident he’s a good defensive player. But anyone want to tell me how or why? And that’s where RAPM/APM etc. metrics stop. But with the SPM[2. Statistical plus/minus, usually derived by regressing some set of more traditional statistics onto an APM model to estimate the value of each input.] metrics, like Andrew’s, we can break it down into components and see “oh Khris Middleton is good defensively because he’s contesting 3s at a high rate” (I have no idea if that’s true, just using it as an example). So I’m definitely in favor of metrics where we can look at more than just the one number and see why that metric is telling us this player is good offensively or defensively etc.

We can be reasonably confident [Middleton is] a good defensive player. But anyone want to tell me how or why?

Of course, as you pointed out Seth, no all in one metric will ever be even close to ideal because there’s so much we’re missing in the metric no matter how much we try to add in. But to not try is also foolish. It’s still a good starting point and we always need a starting point. For example, knowing that Khris Middleton’s DRPM is 3.70 might not tell us much but it does tell us something—we can say with decent confidence that Middleton’s a good defensive player based on his high DRPM. That’s a useful bit of information, even if it’s not much.

Nathan Walker (@bbstats): The only objection I have to that Bill James statement is a philosophical one. Specifically where he says that the “best” evaluation is a subjective one. I’m not disagreeing with his conclusion, more his premise.

I think it’s a fallacy to separate subjective and objective thinking. Instead, I think that subjectivity is the *lens* by which we should interpret “objective” information. While some might describe the eye test as “equally important” to statistical analysis, in reality we are always processing information the same way. Even if you start with “objective” information, your evaluation will always be through the lens of our own perception. I do think that with the weakness of the human mind, stats can do things much better than us. Ultimately, most of our own “instinctive judgments” (i.e. “Anthony Davis’ length and athleticism is unparalleled, he’s going to make a huge jump this season!”) can be boiled down into two categories:

1) Judgments that are statistical in nature and are therefore less accurate without data or data analysis (i.e. the “you can’t watch every game, stats can” argument)

2) Judgments that have not yet been measured in any meaningful way (i.e. before this year most of us typically measured “great passers” by their assist totals only!)

Furthermore, I think that data, and one-number systems can very well estimate a player’s current value to their team in the NBA. I do not think that data can tell us “How will Player X do when he’s put in a scenario he’s never been in before” – for example, what would happen when the Thunder played Russell Westbrook at point guard right out of college.

On the other hand, #2 points out something that human intuition does very well; see patterns. When I observe x, even though it is difficult to measure, y is the typical result. Under those circumstances, subjectivity is king. However, statistics can do the same thing.

Back to the point at one, one thing that plus-minus tends to measure that is in opposition to the eye test is players that do things which are extremely difficult to observe. Not turning the ball over is a great example of this – I think this is one of the reasons Dirk was typically top 5 in RAPM stats in the 2010s. We can see when a player turns the ball over. Less so, we can see the rate at which they do so. Finally and least of all we are barely able to take notice of a player’s low turnover rate. So, to some degree there is a “blind faith” in plus-minus data. Players that can look particularly bad can be making up for it in other, smaller ways and vice versa. This is where the benefit of out-of-sample testing and prediction methods come in to play, and why I think RPM is the low fruit we should often be reaching for — if in the past a player’s data x,y, and z make their team so much better, after enough samples we have to shrug and say “this is what the model says.”

At any rate – to *purely* rate players by their “one number rating” is silly. But at the *very* least, a one-number rating can be a hand-rail that guide us up the stairs to understanding player value.

a one-number rating can be a hand-rail that guide us up the stairs to understanding player value.

Andrew Johnson (@CountingBaskets): I certainly wouldn’t say I disagree with the initial point, it’s Bill James after all! It’s incredibly dangerous to think that any one number has all of the information one needs evaluate a player or all of their contributions.

However, like Nathan, I think James might be underestimating how hard it is to balance all the different measures we have in our heads at the same time, to say nothing of the things we can’t (or don’t yet) measure. People just aren’t very good at this, just watch someone try to do long division in their heads; their eyes practically bug out. So, one number metrics are a bit of a convenient and helpful accounting method.

Further, subjective reputations can be overly sticky[4. Ed: as we see in All-Defense voting every season.], which is why I think less analytically inclined teams have a tendency to go after guys that used to be a name but aren’t very good anymore. We don’t necessarily need a one number metric to point some of those guys out, but…

On the other hand, an issue with one number metrics the James quote doesn’t touch on, probably because it is less of an issue in baseball, is the importance of context and role, which is even more important with the more black box metrics. For example, Timo Mozgov is a player contributing more value to Cleveland than the simple addition of his prior statistics from Denver might have indicated, but that may well not transfer to yet another team.

Partnow: If I can flip this around on Andrew, is the problem then as much looking for the platonic best player as opposed to the player who can contribute in areas of X, Y and Z to a new team in the most efficient (in both on-court and contractual terms) manner? My complaint about the top-down metrics is that time spent on them is time NOT spent finding out who’s a good rebounder? Playmaker? Screener?

Now some of those things we don’t have the input data to really judge on more than ad hoc, eye-test basis. But we’re also exposed to exponentially more and better data than we were even a few months ago.

Johnson: I think there’s definitely some truth to that, Seth. But, I think teams and analytics need to go both bottom up and top down, if that makes sense. That said, I think it’s hard to find evidence that NBA teams outside of central Texas are good at exploiting context in a consistent way to find value.

Yes, we’re getting all this new data, so we’re going through the steps of describing it and categorizing it. Then testing how stable the measures are and trying to put them into context.

In fact, one of things that we could look at doing with one number metrics is using them as a tool to compare the expected performance to actual lineup and team performance to see if there is something they’re systematically missing in the new data. Spacing is the one area that has been done a number times.

Michael Murray (@MichaelMurrays): I don’t want to respond for Andrew, but the distinction between what is and what is not a one-number metric is something that’s really interesting to me (YAPM vs “who’s a good rebounder”). For something to be beyond scrutiny, to me, it needs to have a discrete outcome. A shot goes in or not, is worth a certain number of points, and a player makes a certain combination of these things over time. When we start combining these three things into new metrics (eFG%, TS%) are they one-number metrics? There’s definitely a sort of uncanny valley where it feels like a metric tries to mean too much, but where is that? So the resulting argument for pursuing these metrics is that in the same way a shot going in is a discrete outcome, so is a basketball game, and given perfect information I should be able to represent both of those accurately. You not only have this sort of intuitive justification but you have a whole lot of incentive. At their core, teams are usually trying to maximize their team’s winning ability relative to their costs, and a single metric is ideal. Also, you have sportsbooks which don’t offer lines for efficiency in screening. Finally, you have the public space, where articles are written out of interest and edification. But, it’s hard to write articles that drive down to the minutiae of basketball and there can be a high barrier of entry to read them.

Austin Clemens (@AustinClemens): I like the phrase ‘platonic best player’. I can’t find the exchange right now but Talking Practice tweeted something about how he doesn’t like to use RPM for players who change teams — does anyone else remember that? Basically he was taking the logical endpoint of Seth’s argument; because RPM can fluctuate so much when someone changes teams, he doesn’t think it’s a useful prior for players who are in a new role. That seems a bit far-fetched to me… like saying that once someone changes teams you can’t use their old FG% to predict their future FG%. You know what would be a cool thing to develop is like a measure of how much a player’s situation has changed from one year to the next — so it’s high either when they change teams or when the team changes dramatically around them – and then you could see if particular stats are more/less consistent for high values of change vs low values of change. Does that seem dumb?

Walker: TP has said he/they generate two ratings for a player who has been on two teams for just that reason.

Partnow: As a general matter, it would be fascinating to look into “predicted role” for a guy who’s in a different context. However, how much work has really been done to describe “role?” We have some usage stats, I’ve messed around with point guard styles a little, but I’m not even sure we have the statistical vocabulary to describe a player’s role in a manner lending itself to either external or internal comparison.

Jacob Rosen (@WFNYJacob): I’ve always like Jon Nichols'[5. Now of the Cavaliers.] position-adjusted classification as something I think does a good job of defining “role”.

Nick Restifo (@Itsastat): RAPM, APM, and the other one number metrics are decently predictive. They appear to do a good job at predicting how good a team will be as the sum of its parts and how valuable a player is to his team’s success. What we still can’t get at with these numbers is fit, or how valuable players will be when asked to change their role. RAPM and other metrics aren’t nearly as predictive when players change teams. Luckily, I think the first question is far more important. If player X is valuable to his current team’s success, he is likely to be valuable to another team’s success as well, whether it be in a similar of different role. But players can definitely be used optimally or poorly.

It is also important to remember that RAPM, APM, FFAPM, etc. are estimates of a player’s plus-minus, and that they do not reflect an exact count of plus-minus in any way. They don’t tell us how or why a player is valuable, which also speaks to their inability to evaluate changes of roles well. From the lens of RAPM, the Pistons signing Josh Smith was an objectively good move. RAPM would not have predicted the Pistons would waive Smith and eat his salary for years just to get rid of him.

[It’s important] to be aware of what it is and what it isn’t. In a static environment, numbers like RAPM and APM are an impressively accurate measure of player value. The NBA is not a static environment

When digesting these numbers, it is most important, like any number, to be aware of what it is and what it isn’t. In a static environment, numbers like RAPM and APM are an impressively accurate measure of player value. The NBA is not a static environment, however. Players/teams/coaches change, and one number metrics do not yet excel in measuring this. One number metrics like RAPM, APM, etc. are the best we have with regards to estimating player value, and are at their most usable when you need one number to encapsulate a player’s worth (like in many forms of statistical analysis and modeling). If you wish to evaluate how valuable a player actually is, why they are that valuable, or how they will do on a new team, you need to consider other methods of evaluation, like metrics and film, in addition to one number metrics.

Ian Levy (@HickoryHigh): I think Nick’s last paragraph hit on something important for me. All statistics are a shorthand to some degree, a way of taking many data points and turning them into something descriptive or predictive. Even something like points per game just gives an impression of a player’s scoring quantity, it doesn’t describe the entirety of a player’s scoring quantity. Some games they score more, some games they score less. All of these one-number-metrics are similar in intent to something basic like field goal percentage—take a mountain of information and distill it into something smaller, digestible, understandable.

Thinking of these things in the same way has helped me feel more comfortable using all-in-one metrics. I’ve tried to make a point recently of characterizing them as “an estimate” of a player’s overall value or contributions. Thinking of them as estimations and shorthand for big, big things has helped me keep them in the proper context. They tell a little bit about some things that I’m interested in. But knowing that they’re estimations keeps me from using them to draw conclusions that are inappropriately concrete.

I think that reflection of them as estimations could help a lot with how they are perceived. People who generally don’t trust them may feel more comfortable if they were presented in this softer and more accurate way. It also makes it more difficult to wield them as an argument-ending baton if they are acknowledged to be approximations.

I think most people who dislike and/or distrust these metrics are uncomfortable with approximations being presented as facts. For example, Nick Collison, who was an “APM superstar” for several years in several different plus-minus models. I think people were so turned off by the implication that he was among the league’s best players that it colored the perception of the whole frame of analysis. But I don’t think anyone, even the biggest plus-minus critics, would have argued that he was a bad player. He set screens, played good defense, crashed the boards, made smart passes, didn’t make mistakes. Those are things that are nearly universally regarded as good (if unglamorous) things. The problem came with the implication that Collison was the “sixth-best” player in the league, or wherever a given model slotted him.

The point being. I think these models are useful and accessible for identifying who’s good and who’s bad and the ballpark of how good or bad they are. They become problematic and undermine their accessibility when they are used with careless specificity—”Nick Collison is better than 453 other NBA players, because this number said so.”

Clemens: Ian’s point is actually a point I was trying to make in my RPM glossary entry. Even FG% is an estimate because we don’t know what the player’s FG% would be if we had him take an infinite number of shots. That’s the definition of a statistic — an estimate of the true population value we get from looking at a sample of all instances. This is why I consider RPM to be more ‘honest’ than things like PER and WS, because those are point predictions based on a weighting formula. They don’t really acknowledge the uncertainty that must be inherent to any statistic. RPM does and sure, it’s hard to get standard errors for ridge regressions and etc, but they are the best one-number estimates we have. Bill Simmons’s footnote about RPM[4. in part three of his trade value column] was largely misguided. Using PER as a contrast to RPM, which he described as some kind engineered thing where nerds just screwed with numbers until they looked right gets it largely backwards. PER was created by fiddling with the numbers until they look right — I don’t know exactly how Hollinger created it but I assume he tinkered with the input weights until he got some reasonable answers. By contrast, RPM is really a theoretical construct. It’s a method for decomposing the value from a play into constituent parts that can be attributed to particular players. The method came first, and it creates numbers that look right because it is based on strong theoretical principle.

Walker: In Simmons’ defense every good metric goes through “screwing with numbers until they look right.” One of the reasons RPM exists is because Daniel Myers suggested adding box score information to Jerry Englemann’s RAPM. But how did we know RAPM needed help? It was obviously flawed, as we noted by the eye test. Kevin Durant and Kevin Love, for example, posted extremely low RAPMs in the early 2010s, and adding box score and other information significantly improved on RAPM’s ability to predict offensive and defensive ratings.

Restifo: There are some oddities too, that I’m not sure have been mentioned yet. In some RAPM’s you can improve your DRAPM by taking more shots, etc. etc. This is more of an artifact of the mathematics of estimating a player’s defense, rather than a reflection of the reality of defensive skill. Defense as a whole is still notoriously hard to evaluate precisely.

Levy: Shifting topics slightly, I know Seth talks a lot about the difference between what a player “is” and what a player “did” which I think is really germane to the all-in-one metric conversation. These metrics show what a player did, which is heavily influenced by context and is one of the reasons they are so inappropriate for player ranking. For example, Patty Mills ranked really high in SPM last season, above John Wall. That’s what he did, and that’s because he played for the Spurs in a constrained role where it was easier for him to be productive and efficient. That doesn’t mean he is better than John Wall. When people want to talk about “better” they’re really talking about something that we can’t measure well because it involves “did” and “could”. These metrics are pretty good with did. Not as good at could.

Layne Vashro (@VJL_BBall): I want to say more on this, but one quick note.

One of the biggest strengths of quantitative methods over more subjective evaluations is that we know almost exactly how wrong we likely are

Most of metric interpretation issues would be solved by more consistent use of error reporting or confidence intervals. One of the biggest strengths of quantitative methods over more subjective evaluations is that we know almost exactly how wrong we likely are (at least in general). Despite that, you never see range estimates attached to metrics (this criticism includes myself). I think it would do everyone a big favor for us to phrase things in terms that capture our lack of certainty. “It is 95% likely that Nick Collison, in his current context, is worth…”… “I am 95% certain that Aaron Gordon’s peak NBA value will fall between X1 and X2″…

In my opinion, this is one of the easiest important changes we can all make.

Clemens: I made this point in my RPM glossary entry too and then a bunch of people schooled me on how hard it is to get standard errors for ridge regression. J.E. has done it via bootstrapping but I have read some academic literature that suggests that no matter how you do it it is probably wrong and that ultimately you just shouldn’t try. So that’s kind of depressing. So the bootstrapped SEs are like estimates of estimates.Two (three?) possible solutions though:1) Stick to OLS (no ridge) and iteratively drop players who cause collinearity problems. This means moving more players into your comparison case, which sucks, but you should be able to retain good estimates of players you really care about, like Roy Hibbert, by dropping players you only kind of care about, like Ian Mahinmi. You’d have to do something like run the regression, then automatically check VIFs or collinearity between sets of players, drop a player that is problematic in a group who has the fewest possessions, repeat…

2) Maybe bayesian ridge regression can dodge this? Bayesian ridge regression as I understand it is just OLS with a prior centered at 0 and a parameter for the SD of that prior distribution. If you want to also have SPM as a prior, you could either mix it afterwards (.25*SPM + .75*BPM or whatever, totally a-theoretical ‘messing with numbers’!) or you could blend the prior with your SPM and then use that as a prior (so the SPM regressed towards 0 basically).

3) Lasso? I don’t know anything about lasso. I assume if it was a good solution someone would have done it.

Partnow: It seems mildly churlish for the reason not to include error terms or confidence intervals for a stat like RAPM to be the inexactitude of the confidence interval. We can’t tell you exactly how wrong we are, so we’ll present it as if we’re not wrong at all?Even an estimate of confidence intervals would be useful.

For example, friend of the blog Talking Practice shared a small selection of his/their 2013/2014 data with me[1. For a post which died a silent death in the drafts, but which was survived by its child, this discussion.] which had Danny Green estimated to be 0.5 pts/100 more effective/valuable/better than Klay Thompson last season. The first thing to note is that Thompson obviously got better from last season to this. One-number metrics aren’t necessary to see this, but most/all I’ve seen reflect it, which is nice. The second thing is that using bootstrapping techniques, we were able to determine this relative rating meant Green was about 60% likely to have been better/more valuable/more effective than Thompson last year. That’s FAR less definite (considering the baseline is 50/50 for any two players) than a simple rank ordering of players would suggest.

Yet, how is RPM most-commonly, almost exclusively used? Khris Middleton is a top 10 player by RPM. Steph Curry is better than James Harden because RPM. The way these number are most often presented implies a level of exactitude and certainty that is simply unwarranted which then gets translated into “objective rankings” when employed by the more general public which doesn’t understand the methodology well enough to know these ratings are estimates, and what that fact means. Further, the complexity of the method tends to provide an unwarranted veneer of objectivity. Any given method, from PER on up has a ton of assumptions baked into the formula. At least with PER (or better versions of BPM models like Kevin’s DRE) those assumptions are transparent. That transparency is largely lacking from most public APM models and I’m not completely sure many of the creators could adequately verbalize the assumptions a given model necessarily includes.

There’s not much marketability in uncertainty, so it’s understandable why the range of the estimate and the falseness of any appearance of pure objectivity don’t get talked about much. But those issues are there and are vital to the proper understanding of how much weight to give and what the best use case for a given metric should be.

Clemens: When I get done with the freelance project I’m working on now, I am hoping to introduce an RAPM with confidence intervals and a frontend where you can enter two players and it will spit out something in plain english like “Player a is 75% likely to be better than player b defensively” but now that we have YAPM and Layne’s thing I dunno maybe we should just figure out some way to consolidate all that stuff. I also want to have a full python or state or R tutorial for creating it. I know Evan Zamir just did this and it has been done in the past but I feel like it could use a new presentation.

Kevin Ferrigan (@NBACouchside): I think the regression — whether ridge or otherwise — methodology should just be applied to more things. I like Layne’s Four Factors RAPM, but I’d love to see things like how opponent shot distributions change with one player on the floor — specifically threes surrendered and shots within 3 feet. Those two areas are the most important, obviously, and we’ve hammered the point on NC now that 3P% against is largely random / luck, so forcing teams out of those shots in the first place is a good idea. FG% inside 3 feet, alongside FGA inside 3 feet is also probably a good thing to do, because shot deterrence does seem to be more of a thing there than on 3s.

Walker: Regarding confidence intervals I have two things to add:

1) The inability to discern the true difference between two observed values is the fault of rankings in general, not one-number stats specifically. Every ranking system suffers from this, and will always suffer from this. An easy example is that there is almost no discernible difference in W-L between a #15 college team and a #25 college team.

2) I think it’s a little odd that we are saying in the same breath that “One Number Systems Can’t Really Rate Players Because Context Matters” then we go on to say “I Wish We Knew The Confidence Intervals So We Could Appropriately Rank A Player” (which I feel is an undertone here). If RPM doesn’t “rank” players (which I’m starting to agree with), then confidence intervals don’t seem to add much to the conversation. I do agree however that RPM does rank “value to team,” by which confidence intervals would be of some aid.

Jay Cipoletti (@Hoopalytics): I’m the blasphemer here in that I really don’t pay attention to one-number ratings at all. I’m not involved in recruiting/talent acquisition, so determining who is better really doesn’t matter to me.

If I’m looking at a team, I want to know HOW they do WHAT they do well. Inevitably that drills down to individual players, but ratings have nothing to do with it. I can’t recall a one-number metric ever coming up in conversation with a coach.

Four factors numbers, shot distribution, FG% by zones — those all tell you who to look at and what to look for on film. I’ll admit indifference has led to ignorance regarding the metrics in question, so it is quite possible I’m missing an entire world of insight. If I’m sitting in a film session with a coach, how can I use one-number metrics to pinpoint specific things to look for?

Or for this specific two-year old question I have been unable to answer, is there a measure in use that captures this play: Thomas Bryant is a 5-star 6’10 kid at Huntington Prep that moves like he is 6’5. On one play in February ’13, he kept an offensive rebound alive for two tips, allowing his guards to retreat. Then he jammed the outlet pass. Then he sprinted the floor to be the third defender back, preventing a layup from being attempted on the right block. After the ball was kicked out, the possession ended with a missed three-pointer that a teammate rebounded. He essentially made three potential stops on one possession but never touched the ball. I was at the game with my high school coach and he asked how you measure that. I had no idea. I still don’t. I’m an eager student if there is a way.

The underlying problem I have, with trying to capture that play and player ratings in general, is that they are expressed in a ball-centric language. I think of the game as being basket-centric, with 14 bodies in motion governed by different sets of rules. The ball is a unique body on the level of a celestial sphere, but its movements are wholly a function of the other bodies and in relation to the baskets on either end.

I just have this sense that 1 player metrics are the Catholic Church circa 1542…it is also entirely possible my willful ignorance renders all the above moot.

Clemens: I don’t think anybody would say that one-number metrics are particularly useful for coaching. And it comes with caveats for recruitment. But to answer your question about Thomas Bryant, yes, RAPM does that. RAPM is kind of the ultimate holistic measurement. It picks up literally everything that happens on the court and boils it down to one number. The problem being, of course, that you have no idea what happened on the court that resulted in that number. But my guess is that if Bryant frequently made those kinds of plays he would have a good RAPM. A pure RAPM is completely divorced from the box score, so even if Bryant is a relatively weak box score player, RAPM might still like him.

So I guess the one place where it might help a coach is to help confirm what the eye test is telling you about the Nick Collison’s of the world and maybe even to separate a Nick Collison kind of player from a hustle player who does not bring value — like I don’t know Scalabrine or something. But that might already be obvious to coaches.

Walker: In an attempt to be a more reasonable human, I think I’ve begun to take a more Haberstroh-ian / Hollinger-ian approach to RPM and RAPM: using it to highlight, reward, explain where team and player success comes from that might be difficult to do otherwise.

Example: In D.Rose’s MVP season, the Bulls were something like 10th on offense and 1st on defense. One of the primary logical reasons that RAPM advocates and basketball-stat-twitter (or at least my own narrow vision of it) were against Rose for MVP was that his contributions appeared to be mostly offensive. Of course defense is arguably more a “team” skill than a player skill – but for the voters, the best player on the best team was clearly Rose. RAPM favored Dirk, LeBron, CP3 and Dwight I believe. On the Bulls the strength by RAPM was split between Deng and Noah (Rose gets lumped back into that conversation once you throw in his box-score stats, which 2011 pre-RPM shows). So from at least this one-number metric, we are enabled to better see a player’s defense and therefore MUCH better understand their overall contributions.

Partnow: Thanks guys, this was great, at least if the readers were able to wade through the YAPM alphabet soup.