Nylon Calculus: More on the NBA Draft and why the tallest men can’t dribble
When I watch potential NBA prospects I generally have a simple model in the back of my mind that weighs three factors — height, athleticism, and skill. Earlier this year I wrote a piece detailing this mental model to highlight how much of an outlier an NBA player is compared to the general population.
At the end of the post, I included a bit from a data simulation showing that even if the underlying factors aren’t correlated at all when you skim off the top prospects they will appear to be negatively correlated. Here I want to expand on that a bit after reading an article called, The selection-distortion effect: How selection changes correlations in surprising ways that detailed the same phenomenon with GRE test scores relationship success in graduate school.
What’s interesting here from a statistical perspective is that the selection effect is different than just “restriction of range,” another important concept that can alter the relationship between a factor and the target variable. In that case, an easy basketball example is looking at the impact of height on rebounding among players 6-foot-10 and above — you will see less effect than if shorter players are also included in the sample.
Read More Nylon Calculus: PER 2.0 updates
Here we’re getting a different selection effect. To simplify, if we use two of my factors, say height and skill, in a model where they are independent and are used for selection of the best basketball prospects, we change the relationship between the two variables as we filter the data.
This is most easily seen in a graph. Below I have simulated 20,000 prospects where height and skill are represented by standardized scores and both are independent from each other:
The correlation is basically zero and the charted prospects look like a big dot blob with outliers at random. If, however, I select based on height, and skill is above-average, the shape of the blob changes. The blob is turned on a 45 degree angle and you can see the selection line. More crucially, the relationship among selected players is now noticeably negative, with a -0.44 correlation.
Now, if I go further and filter for only the top 100 prospects out of my initial 20,000, then I get a much stronger negative relationship between height and skills, -0.83 correlation. And we’ve also truncated the negative outliers on both factors to a degree, with the lowest height, for example being 0.5 standard deviations above average.
To be clear, this is just a simplified theoretical model to demonstrate how the selection process changes the relationship between factors. To me, the model seems broadly consistent with what see in the draft prospect selection process and a stylized distribution of skills and height among the NBA player pool. But there are other selection models we could use as a filter and get slightly different results. Below I have the result of two separate filters, one for the top fifty most skilled players without regard to height for, say, point guards, the other filter for the top fifty by height without regard to skills.
Again, there is a negative correlation between the two. But the shape looks much more like a corner, and there are more short prospects with many more below average height dots on the “point guard” filtered group than in the top 100 total score group above. Of course, there are models between these two extremes. For one example, where height counts for half as much as skill for a point guard, and the reverse for a center.
Then there’s the way selection changes the relationship between the factors and the target variable. In the general population using the two-factor skill and height model, height correlation to total score is .7 or an R^2 of .49. This makes sense, as it’s half the score and the simulation is setup to have no relationship between the two factors. However, if we select from the top 100 prospects only in the Skill + Height simulation model, height had a .31 correlation or R^2 of .1 to the total score and skill had a correlation of .32. Though the two variables together explain 100 percent of the variation in the total value by design.
To make the model slightly more realistic, I will add my third variable, athleticism. Again the three variables are simulated as completely independent among the 20,000 prospects, but when I filter for the top prospects by sum of Skill + Athleticism + Height, there is a negative relationship between any two, though not as strong as the two-factor model.
In the real world there are a couple of things these simulations can tell us. The first thing is in draft models. The relationships observed in models built off of top prospects cannot be generalized to lower-level basketball players. For m,e I always consider my models only valid for the top 100 prospects, results for players outside of that range should be treated very cautiously. Conversely, scouts that use heuristics to separate lower and top level prospects need to adjust their thinking when they try to then discern some separation in-between top prospects as the same relationships for athleticism may not hold.
Next: Nylon Calculus -- When can we trust a team's stats?
In both cases restriction of range is an additional issue looking at top prospects.
The other take away is maybe more theoretical, we don’t necessarily need just-so theories about why big men struggle at the free throw line or other skill areas. It’s very plausibly a selection process issue, if they couldn’t hit a free throw or pass or shoot and weren’t so tall, they probably wouldn’t be in the league.