Jun 26, 2014; Brooklyn, NY, USA; Andrew Wiggins (Kansas) walks off stage after being selected as the number one overall pick to the Cleveland Cavaliers in the 2014 NBA Draft at the Barclays Center. Mandatory Credit: Brad Penner-USA TODAY Sports
Freelance Friday is a project that lets us share our platform with the multitude of talented writers and basketball analysts who aren’t part of our regular staff of contributors. As part of that series we’re proud to present this guest post from Michael Murray. Michael is a college kid with dreams of the association while he should be doing his homework. He writes about Laundromats, Santana, and sports using data at shittydata.wordpress.com. You can follow him on twitter at @michaelmurrays.
College basketball season is right around the corner. Official team practices begin October 3rd and we’re less than half a month away from the first games of the season. As if more basketball being played isn’t exciting enough, with new players come new rumors – Who showed up to campus looking “elite”? Who’s got an “NBA body”? How is nobody talking about the incoming phenom your alma mater managed to snap up that the blue blood’s money couldn’t buy? Clearly all of the freshmen and the JUCO transfer your team picked up are NBA destined, but can we empirically predict their success based on their first year of college ball?
To get a handle on this we can try to build a regression model that predicts making it to the association using first year playing stats. I define ‘NBA Success’ as being drafted in the two rounds of the NBA draft in any of a player’s years of eligibility. This variable takes on the value of 1 or 0 because you either get drafted or you don’t. The parameters for the data analyzed needs to be as simple as possible because the maximum sample size is every college basketball player ever and I am just one man. For the sake of simplicity this model looks only at the most recent class of players to exhaust all eligibility (freshmen during the 2010-2011 season) and also focuses only on forwards (n=78).
The first regression I run simply includes the gamut of in game statistics[1. Games played, minutes per game, field goal percentage, free throw percentage, 3 point percentage, true shooting percentage, rebounds per game, assists per game, steals per game, blocks per game, assist to turnover ratio, points per weight shot, effective field goal percentage, points per 40 minutes, points per game, efficiency, efficiency per possession]. The bread and butter of predictive analysis is the linear regression, but because I have a binary outcome there are better methods than a standard linear regression, which is better for predicting a continuous dependent variable. I use a probit model, which estimates the probability a player will fall into the category of 1 (drafted) or 0 (undrafted) given their characteristics or stats. It looks something like this:
Looking simply at the coefficients of the first iteration of the model the most determinant stats are PPWS, A/TO, PPG, BPG, and RPG. BPG ranking higher than RPG here is likely a result of other defensive measures related to defense near the hoop not being included in the model and being reflected in blocks. Unfortunately, though the model fit isn’t horrible, none of the variables are statistically significant.
The first thing that jumps out while assessing this problem is that a whole lot of my variables are different ways of measuring similar things, having multiple ways of measuring shooting efficiency doesn’t tell me much. These similar variables end up having large amounts of interaction with each other creating a problem of multicollinearity. Below are the square roots of the variance inflation factors, this number being >2 signals that I need to take a look at which other variables it could be correlated with in the model.
G | 1.245319 | BPG | 1.946223 |
MPG | 12.725175 | A/TO | 2.816939 |
FG | 5.264167 | PPWS | 7.109452 |
FT% | 1.606988 | eFG% | 4.841158 |
3PT% | 1.501046 | Pts40 | 10.351906 |
True% | 7.982021 | PPG | 17.698638 |
RPG | 4.009675 | Eff | 11.162589 |
APG | 2.237264 | Eff/Pos | 4.792102 |
SPG | 1.35632 |
Once I know the problem variables I can also check what they’re interacting with (the higher the number the higher the correlation, green is bad)
For my next iteration of the model I remove all variables that have a .6 correlation with another variable or greater. This leaves me with G, FT%, 3PT%, RPG, SPG, BPG, A/TO, PPWS, Pts40, and Eff/Pos. This model turns out better. The model fit isn’t worse and some of the variables are statistically significant this time.
Estimate | Std. Error | z value | Pr(>|z|) | |
Intercept | -13.97326 | 6.13823 | -2.276 | 0.02282 |
G | 0.24654 | 0.0954 | 2.584 | 0.00976 |
FT% | -0.0164 | 0.03233 | -0.507 | 0.61187 |
3PT% | 0.0136 | 0.0245 | 0.555 | 0.57884 |
RPG | 0.3409 | 0.25254 | 1.35 | 0.17706 |
SPG | -2.01251 | 1.63047 | -1.234 | 0.21709 |
BPG | 0.29948 | 0.54737 | 0.547 | 0.58429 |
A/TO | 1.78193 | 1.47306 | 1.21 | 0.2264 |
PPWS | 0.47249 | 4.17492 | 0.113 | 0.90989 |
Pts40 | 0.30013 | 0.14678 | 2.045 | 0.04088 |
Eff/Pos | -8.60936 | 13.73724 | -0.627 | 0.53085 |
With collinearity minimized what does it mean for games played and points per 40 minutes to be significant?
Games played could mean a couple of things. First, it could mean that players on teams that do well in the NCAA tournament have a better chance of making the NBA, either due to the signaling involved in increased exposure, or the human capital represented by the player’s role in the team’s success. It could also mean that players that are recruited onto good teams are good. It could mean many more things than can be parsed, all this model tells us is the more games a freshman plays their freshman year the more likely they are to be drafted into the NBA.
Points per 40 minutes is a little less complicated. Scoring is good, get buckets. Extrapolating up to 40 minutes makes extra sense for freshmen whose minutes in their first season may have a lot of variance. Even those that start from day 1 may take some time to get a lot of time on the court.
This may feel like a long way to go to find out that playing a lot of games and scoring lots of points are predictors of positive basketball outcomes, but there’s still more that could be done with a model like this.
First is sample size, 78 players is not really that many when you consider how small this makes the portion that make it to the NBA (NBA=1). A more robust model would use freshmen from many years. Second is that this model is limited by position, it can’t be applied to guards and centers. It also makes no distinction between small forward and power forward as a result of my data source.
Further, a more robust version of this model would include some kind of measure of athletic talent and size that are considered “unteachable” skills and are not measured by in game stats. I tried to do this by normalizing the Rival 150 rankings from when my sample were seniors in high school, but it caused model error.
Finally, the great lesson of this exercise is that freshman year performance is not terribly predictive of whether or not a player will make it into the NBA. For every one-and-done, there is a late-bloomer.