Why Shots Are Made (Or Missed) In The NBA

facebooktwitterreddit

Dec 12, 2014; Atlanta, GA, USA; Atlanta Hawks guard Kyle Korver (26) shoots over Orlando Magic guard Evan Fournier (10) during the second half at Philips Arena. The Hawks defeated the Magic 87-81. Mandatory Credit: Dale Zanine-USA TODAY Sports

At the top of the NBA season, the always fantastic stats side of NBA.com began publishing individual logs for every shot and rebound in the NBA, based on the data gleaned from SportVU, the camera system that tracks the ball and each player’s movement on the court. These logs exist not only for the current 2014-2015 season, but the 2013-2014 season as well. Each log records who took the shot, where the shot took place, how far into the shot clock the shot was taken, how many dribbles were taken by the shooter prior to shooting, how long they had been in possession of the ball prior to shooting, and even how far away the closest defender was to the shooter. In their raw form, 202,690 shot logs exist for the 2013-2014 NBA season, and 55,542 for the 2014-2015 season through December 11th. These logs pertain to field goals only.

The quantity of these shot logs begs for analysis, and many have already answered the call. A few articles have already been written here at Nylon Calculus, and Justin Willard’s awesome stuff at Analyticsgame.com also comes to mind. The work of understanding, however, is never over. Why are shots made and missed in the NBA? What are the most important indicators of shot success?

In an attempt to shed further light on these questions, I took to modeling a given shot’s result and its expected points in the 2013-2014 season. By observing what statistical models deem valuable as input, we can determine and quantify exactly which aspects of a shot’s context are important, and which aspects are the most important.

First, a disclaimer: The shot logs the NBA has provided are far from perfect. In its unfiltered form, there are some 43-foot two-pointers, some 203,569 point shots, and some -23 second touch times. For the purposes of analysis, I simply omitted records that didn’t adhere to common sense. My filtering criteria included: ‘Touch Time’>=0, 0<=’Shot Clock'<=24, Points<4, ‘Points Type’=2 and ‘Shot Dist'<23.75, ‘Points Type’=3 and ‘Shot Dist’>22, ‘Cdef Dis'<=12.261 and ‘Shot Dist'<39.37. These filtering rules make the data look a little more appropriate and remove some crazy outliers like full court shots (who cares), but there could still be some hidden errors and outliers that I missed.

Second, if you’re going to stat, always try to stat responsibly. Several of the variables on the shot logs suffer from multicollinearity issues. In layman’s terms, multicollinearity is when the variables used to predict a target vary strongly with each other. This is an issue because strongly correlated predictors will lead to both unstable models (models that may change significantly every time they are calculated/run), and double-counting the information contained in the correlated predictors. In the shot logs for example, the dribbles and touch time variable are extremely correlated (0.927) with each other. This correlation is somewhat obvious: the more you dribble, the longer you have the ball. Including both of these variables in a model predicting shot success or expected points would double-count this information, so in lieu of principal component analysis, which is less interpretable, I simply didn’t use the touch time variables. I chose dribbles over touch time because I think we should be more interested in shots that occur after no dribbles than shots that occur at the extremes of touch time.

The other multicollinearity issue in the shot logs is the relationship between the shot distance and the closest defender distance. As the shot is further away from the hoop, the closest defender is also likely to be further away. This collinear relationship isn’t nearly as extreme as dribbles and touch time, but in order to keep both shot distance and closest defender distance, it needed to be addressed. To remedy the multicollinearity, I simply divided closest defender distance by the shot distance to create a “proportion of openness” statistic. This metric captures how far away the closest defender was to the shot, not in raw feet, but in relation to how far away the shot was from the hoop. The proportion of openness has a correlation coefficient with shot distance of -0.345, in comparison to the the correlation between closest defender distance and shot distance, which is 0.535. The absolute value of 0.345 for the correlation coefficient between the two is concerning is still concerning (as opposed to the others, which were all 0.2 or lower), but the interpretability of the variables remains, which is desirable.

Third, I normalized all the predictor variables using z-score transformation to stabilize their variance. I won’t get into the gory details here, but this essentially means that variables with large values, ranges, and variance won’t dominate the smaller variables in modeling, variables which could have just as much predictive power as the larger-values variables.

Now to the fun stuff. Using only four variables; shot clock, dribbles, shot distance, and proportion of openness; I ran several models of differing techniques on the entirety of the 2013-2014 season’s shot logs to predict whether a field goal will be made and the expected points of a given field goal attempt.

The results of one such model are below. This model is a logistic regression model that predicts field goal success. Logistic regression is a modeling method that can predict a binary target (like whether or not a field goal is made), using both categorical and numerical predictors.

The most relevant piece of information from this table is the Exp(B) statistic, which represents the likelihood change in a field goal being made per one unit increase of each of the predictors, all else being equal. Since these predictors have all been transformed using z-score transformation, the Exp(B) numbers can be compared to each other evenly.

Translated out of nerd-speak, this model states that:

  • Being open (Exp(B)= 1.39) is the most important aspect of shot success (but I’ll comment further on this momentarily), but it is followed closely by shot distance (Exp(B)=0.778). Shot clock time and dribbling are on a tier level below in importance.
  • A shot that is open enough so that the closest defender distance is further away than the actual shot distance, is about 34.82% more likely to go in than one that is tightly guarded, all else being equal.
  • For every foot of increased distance from the hoop, a shot is 3.09% less likely to go in, all else being equal. For every 8.74 feet (shot distance’s standard deviation), a shot is 28.53% less likely to go in, all else being equal.
  • For every second off the shot clock, a shot is 2.10% less likely to go in, all else being equal. For every 5.79 seconds, (shot clock’s standard deviation), a shot is 12.79% less likely to go in, all else being equal.
  • For every dribble a shooter takes prior to shooting, a shot is 2.56% less likely to go in, all else being equal. For every 3.38 dribbles, (dribbles’ standard deviation), a shot is 8.92% less likely to go in, all else being equal.

When applied to the shot logs of the 2014-2015 season through December 11th, this logistic regression model had an accuracy of 61.33%. While this number is surely not earth-shattering, especially when considering that a model that picked only misses would yield an accuracy of just about 54%, it’s important to remember the model is built off of only four independent variables, and in no way takes into account players, their play styles, injuries, or the teams playing.

Another model I trained on the 2013-2014 shot-log data was a standard classification and regression tree. Classification and regression trees are very interpretable, because their results are visuals that can easily be understood without any background in math, statistics, science or basketball. To gloss over the nuts and bolts, decision trees work by finding splits among the records in a dataset that maximize the difference of a target variable between the records on each “side” of the division. For the model visualized below, points was the target variable. In each decision tree node (illustrated by the boxes), you can see the expected points of a shot (average points in those records) that meet the criteria of that node. For the reader’s benefit, I went through and changed the results to reflect the values of the original variables, rather than the transformed values.

Like the previous logistic regression model that predicted field goals made, a decision tree predicting expected points also values the predictors as important in the same order: proportion of openness, shot distance, shot clock time, and dribbles. The astute may have noticed that there were several splits in the decision tree marking proportion of openness records above 100%. These records represent the thousands and thousands of shots last year where the defender was further away from the shooter than the actual shot was from the hoop, so this value is valid. This model also pretty much figures out the length of the three-point line by itself, with two early shot distance splits at 22.05 feet (the corner three is 22 feet even) and 23.50 feet (the normal three is 23.75 feet). This results is reassuring to see, because it is evidence that the model is reflective of the real world, and not just fitting to noise in the data, as the decision tree technique has been known to do. Following the decision tree from node to node, it is interesting to see how simple criteria can dramatically change the expected points of a shot. Among shots that are less than 77.67% open (this accounts for 84.67% of last season’s shots), dribbling once before shooting results in an expected points value loss of 0.18 points, from 1.03 points to 0.85 points.

While the decision tree above predicted expected points, I also generated a model that predicted field goals made. Like the linear regression model, it will beat a predict-all-misses model, and can predict field goals in the current NBA season with an accuracy of 61.72%.

At this point, the biggest problem with learning from the shot logs is the complicated relationship between shot distance and closest defender distance, and what this means for how “open” a player is. The simple metric I used in this short study is flawed, but I found different statistics that captured the concept of “openness” to have very similar effects and importance. The problem is more than just the multicollinearity between shot distance and the closest defender, and how a larger value of one will lead to a larger value of the other. A myriad of questions exists here: How do we define openness in a statistic that is relatively uncorrelated with shot distance? Is a shot taken one foot away from the hoop with a defender two feet away as open as a shot taken 19 feet away from the hoop with a defender 2 feet away? Isn’t having the closest defender 6 feet away from the shot pretty much the same as having him 12 feet away? How do we deal with outliers within this statistic, once defined? The proportion of openness metric I used was significantly influenced by its outliers. Principal component analysis is an obvious answer to the complicated relationship between shot distance and “openness”, but once an analysis like that is performed, interpretability is lost, and discerning the impact of each independent variable becomes difficult.

There is plenty of work to be done with regards to learning more about shooting, and the public availability of the NBA’s shot logs will only expedite the process. It will be interesting to see if, in the coming weeks and months, someone can shed some insight on the complicated relationship between shot distance and being “open”. AS it stands now, how open a shooter is seems to play the most important role in predicting a field goal’s result, followed closely by shot distance. The amount of time left on the shot clock, and the amount of dribbles a shooter took prior to shooting also play a statistically significant role.