Freelance Friday: Getting to the X’s and O’s — A Blueprint for the Use of Tracking Data

Dec 15, 2014; Portland, OR, USA; San Antonio Spurs head coach Gregg Popovich directs his team during the first quarter of the game against the Portland Trail Blazers at the Moda Center at the Rose Quarter. Mandatory Credit: Steve Dykes-USA TODAY Sports

Freelance Friday is a project that lets us share our platform with the multitude of talented writers and basketball analysts who aren’t part of our regular staff of contributors. As part of that series we’re proud to present this guest post from Johannes Becker. Johannes is interested in basketball and statistics and is a PhD student in bioinformatics. You can follow him on twitter @SportsTribution and on his blog, SportsTribution.blogspot.ch.

I’ve been writing this article for the last six months. It’s one of those things, where I go to bed and the writing only happens in my head. I have been troubled about the ways in which player tracking data is being used in analytics. There are specific reasons why this has troubled me, and specific ways that I think it can be remedied.

I will use basketball in the following, but most of the why as well as the how is freely translatable to all other team sports.

Data analysis and Some Thoughts on the State of the Art

There is a lot of stuff going on right now that is great — completely new analysis on point guard play, Austin Clement’s shot charts are mind blowing, and the APBR community is very lively. But there is also a lot of basketball analysis that leaves me shrugging my shoulders. This is usually not the fault of the person that is doing the analysis. I often find my own stuff not that telling. The problem for a lot of people that do this for fun, is available data and time.

If you are sitting safely, take three minutes and watch this ‘Open Court’ clip. Even though it is cringe-worthy (nobody ever used PIE!), there are still two important messages. The first is Charles Barkley saying (verbatim, it hurts too much to watch it again), “I guess the person that invented PIE could not draw up a play that gets me open for the final shot.” The second is Reggie Miller, who says that he wants a scout telling him information (like player tendencies) that none of these statistics can give him.

I would imagine that they are right. For a player, qualitative information (LaMarcus Aldridge likes to take one dribble to his right before shooting a turnaround) is more important than quantitative information (LMA shot x% this year on pull-up shots). But I think that both points of critique can actually be tackled by statistics, just not in its momentary publicly discussed approaches.

One of the best sources to learn something about the NBA right now, is @j_069 ‘s YouTube channel. The combination of set play sequences (I love the Warriors sliding doors play for the name alone) and great break-beats can tell you more about basketball than any regression model. It would be great if those videos were automatically quantifiable.

To say it differently. Statistics right now are great in explaining to us that corner threes are one of the best shots—high percentage mixed with three-points. But nobody ever mentions that the geometry of the court might make it complicated to get the ball there. People make fun of Rudy Gay and his inefficiency. But people ignore that he does not necessarily put himself in that position (okay, maybe he does), but that the coach/team allows/forces him to do what he does.

If you think of basketball plays, you have to grade them in terms of efficiency and complexity. An isolation play is generally simple. A pick and roll is maybe a bit more complicated. But often more efficient. This Spurs play is maybe as complicated as it gets — but it seems to net you a good corner three possibility. Even isolations and basic pick and rolls are not simply comparable, both in expected efficiency and in complexity. Good luck running an isolation play with three teammates on the floor who can’t shoot from the outside. And often a Spurs pick and roll is simply disguising some other play.

In my opinion, there is too much focus on player evaluation and not enough focus on basketball evaluation. Even though stats like Adjusted Plus-Minus spent a lot of time trying to eliminate this aspect, the worth of a player will be always highly context specific. However, a lot of players are more interchangeable than we think. A recent article here at Nylon Calculus estimated that it takes on average around 700 shots to get a 50% certainty if one player is a better three point shooter than the other one. I may be wrongly describing that result, but in my opinion it means that most NBA wings will become pretty decent three-point shooters when you just give them enough free corner threes (see: Green, Danny; Ariza, Trevor).

Data analysis always means information reduction. But our data space is of high dimension—(number of players playing in the NBA) x (number of 5 man permutations on a team) x (the 2 spatial dimensions of a basketball court) x (time). For reasons I mentioned earlier, a lot of studies focus on the first part (players in the league); adjusted stats try to account for the 5 man rotations; x-player rotation data focuses precisely on this part. Those studies can either survive without tracking data, or use preprocessed tracking data like the public SportVU data at NBA.com. But all these studies ignore time. In general, time seems to be the ugly duckling of sports analytics. Some studies highlight the influence off the shot clock, or look at the influence of playing time and efficiency. But no publicily available story adresses the fact that it takes time for a play to develop, to bend the defense according to your wishes.

What I would like analytics to be used for is a series of question that rank from, “What makes an isolation play efficient (besides the quality of the one-on-one guys)?” to, “How often does Golden State use the sliding doors? What is the expected outcome in points? Which way is the most effective to defend it?”

This in my opinion should be the beginning of the analysis. At the end we can start to ask questions like, “Which players are good personnel to play a specific play?”, the question that most analysis tries to solve at the moment, while ignoring the X and O’s. The approach I will try to lay out in the following will not yield a lot of direct results during the first months and I am aware that a lot of people (myself included) simply lack the resources. But in the end it will provide answers to more useful questions than “Is Kobe correctly ranked at number 25?”. It will answer questions about the way that the game is played.

A Different Way to Approach Tracking Data

In my opinion, the most common approaches to tracking data, right now, have a limited horizon. The reason is that they need to broadly ignore spatial and temporal interactions, which are what make team sports so fascinating. These approaches generally can discover things, but come with a bunch of caveats. Seth Partnow described it in a recent article, which I would summarize with ‘We know what is a good outcome for a team, but we have no idea of how to get there.’

In the following, I want to describe an approach that directly uses tracking data to get more into the how’s, what’s and which’s, like:

How are plays designed that give you a high percentage shot?
What plays are used effectively and how often by which team?
Which plays are effective for shortened shot clocks?.

It is going to be far from a finished product. Instead, I will highlight techniques and possible pitfalls to give a broad blueprint, the “Becker Blueprint” [note: As I am not expecting to ever earn money with this, I will try to at least become famous 😉 ]. Once again, I am fully aware that most bloggers and journalists do not have the data and time available to use my proposals. However, I think that approaches like this one will be part of the next big step in sports data analysis. I am also aware that the following part can be sometimes confusing (my head is not the most sorted space in the world). If you have a question, feel free to ask and I will try to explain my brain 🙂 .

Cluster analysis of raw tracking data

The general idea of the “Becker Blueprint” is the following: Instead of labeling each play individually—”pick-and-roll” or “isolation” or “something fancy that the Spurs are doing”—all plays are compared for similarity. Each pair of plays gets a distance value, which is small for plays in which the tracking data is similar and big for plays that look completely different. This leads to agglomerations of similar plays, which then can be detected by cluster analysis. A user can look at each of these clusters and give them a name that fits with more detail. This is advantageous as it doesn’t need a manual annotation for each play. Even more importantly, it can be much more precise in its groups of plays. For example, a Spurs pick-and-roll is not simply a pick-and-roll, it is often preparation for something that happens three seconds later.

This approach requires considerations in several aspects.

Data storage

To start the blueprint, it is indispensable to have access to both the tracking data and the raw video footage. Raw video footage seems to me to be sometimes undervalued in analysis, but I implore every analytics person that wants to make meaningful assumptions to at least watch some portion of their data on video.

Raw video footage is of course, in some regards, a storage obstacle, but I guess that those of us that have access to a high amount of tracking data can figure those obstacles out as well. In general, I would say every play needs a unique ID that links raw tracking data to raw video footage and available box score data[1. classical box score data, plus evaluated tracking data, plus which players are on the court]. The box score part is important as it is user readable. You cannot watch every play, but you need to know how every play ended (turnover, rebound, corner three, etc.) for further analysis. To avoid bottlenecks, it is important that the raw tracking data is stored in a way that is quickly accessible for the further data processing.

Distance metric

To find a good distance metric is probably the most novel and tricky step of the whole blueprint. You want as a result of your metric that plays that are related are close in distance and plays that are unrelated are far away. I know, this sounds very obvious, but the problem is that in reality “related” can be a little bit subjective. An example:

Top of the key pick-and-roll, while the other three players are positioned around the three-point line. The ball handler drives. Help defense forces the ball handler to kick out the ball for a corner three.
Top of the key pick-and-roll, while the other three players are positioned around the three-point line. The ball handler drives. No help defense and the ball handler gets a layup.
Top of the key pick-and-roll, two players are positioned around the three-point line and one is on the weak side low post. The ball handler drives. Help defense forces the ball handler to kick out the ball for a corner three.

In my opinion, you want to have #1 and #2 to be more closely related than #1 and #3, even though the outcome of #1 and #3 is more similar. As I said, there is not one obvious answer and the whole thing will need several tries until the user is convinced that the metric represents a good distance between plays.

My approach would be to start with a low number of manually selected plays that are well separated—different set plays, typical pick-and-roll and isolation situations, fast breaks, short shot clock situations, etc. Take maybe five to ten plays for each set, leading to around 100 plays. Add the same number of random plays, to see what happens with data that has no user bias.

I would define the distance between play A and B vaguely as the minimal distance between the ball and the offensive players of play A and play B integrated over time. It might make sense to include the defense, but I would estimate that it adds a lot of noise and slows the computation down, so it might be advantageous to look at defense separately. The task is basically to find the pair of players A1-5 and B1-5 so that the plays look the most similar. As an example, if in play A LeBron is the screen setter and in play B the ball handler, a minimal distance would not link the two LeBron’s. You basically have to minimize 120 different player combinations (luckily there is only one ball). I would weight the distance between two players by their distance from the ball—players that are far away from the ball should have a lower influence on the similarity score. Just like the defense, those players will be evaluated later on.

More complicated is the previously mentioned aspect of minimizing the distance between plays that end differently. Another example are plays where somewhere during the play a ball-handler stops the ball and this stop can take different amounts of time. My idea to solve this problem is based upon something used in biology, called Sequence alignment. My vague idea would be to find the minimal distance between the last eight seconds of two plays, while allowing for gaps, which then are penalized by a Gap penalty. So, if the last eight seconds of two plays are almost identical, this play would get a direct score. But if one play results in a layup and one in a kickout for a three-pointer – therefore taking a second longer—the plays would still be well aligned but there would be an additional gap penalty.

A big part in finding the minimal distance will come down to computational running time. As mentioned previously there are 120 combinations of players. The gaps lead to a very high number of theoretical possible combinations. And the underlying distance function will have several local minima, making it necessary to look at several possible combinations. It could be sufficient to minimize the distance between the ball position in play A and play B and then figure out the minimal combination of players.

Cluster Analysis

Once the distance metric is somewhat satisfying for the selected subset of plays, it is time to start looking at bigger chunks of data. A team has around 100 offensive plays per game, 82 games per season, so we are at close to 10,000 plays per team, 300,000 plays for the whole NBA. If we would measure the distance between N=10,000 plays, we would need to calculate around 50 million combinations[2. (N*(N-1)/2)]. If the distance metric calculates one play comparison per second we are talking about a range of years.

One way to avoid this would be to recursively build a ‘comparison set’ consisting of less than 500 plays. This set would try to span the whole space of common plays with as few plays as possible. In addition, the user can implement ‘break functions’. If Play A is compared with the comparison set and we find it to be similar with one of the first plays in this set, we can automatically exclude other plays for which we then simply insert a maximum distance. For example, if I find a play to be similar to a side pick-and-roll, I can be sure that it is not similar to a post up play.

As with the data storage, I guess the people that have access to the complete player tracking data set, also have access to some pretty powerful computer clusters, so I’m sure they can figure these things out.

Whatever you do, you should get a distance vector for each play, with a vector length of either N (if you compare all plays with each other) or the length of your comparison set. You can now use this vector for cluster analysis. There are two main ways to do clustering: Supervised and unsupervised. In our case, we would have an either unsupervised or semi-supervised approach, as we want to remain open-minded about the possible outcomes. I use the term “semi-supervised”, because if we use a comparison set, we can at least partially control the outcome. But we always have to keep in mind that we do not know everything about basketball and should keep our results open for surprises.

These surprises are basically the advantage and disadvantage of unsupervised clustering. You can imagine supervised clustering (or classification) as having a fixed set of magnets that forces your plays into specific groups. Unsupervised clustering does not do this but instead tries to find already existing density clouds. This sounds easy, but the problem is that your resulting clouds are heavily depended on your used techniques.

In general, I have two opinions on what could work quite well: First, you should try to avoid anything like k-means clustering, as you have no idea how many clusters to expect. You will just get into a quagmire of statistical model criteria like AIC and BIC, which you strongly want to avoid[3. Short side note: AIC and BIC are pretty much useless for datasets with more than a few 100 data points]. My weapon of choice at the moment is hierarchical clustering, for which I have a very personal love-hate relationship. The cool thing is that you can understandably represent a high dimensional problem in one figure. In the case of a comparison set where you cluster (N plays) x (M comparisons matrices), you will find subsets in the M dimensions for which you can directly add names (fast breaks or simple pick-and-rolls for example). You can relatively easily detect possible cutoffs for cluster numbers. The biggest problem with hierarchical clustering is that the end result can be highly distance metric and rule dependent. It can easily happen that you slightly change some of your clustering rules and the whole thing looks different [Note: On the other hand, this probably holds true for all unsupervised clustering algorithms].

This brings us to my second advice: You now have to find a second distance metric for your unsupervised clustering. In theory this could be done by standard approaches like euclidean distance. But in my opinion, the distance metric of your clustering approach should not treat all of the M dimensions the same. Instead, you should focus on those values for which at least one of the two plays is similar. For example, if you want to figure out the distance between play A and play B (two similar set plays), it does not matter if play A has a high distance to a fast break comparison and Play B has a very high distance. So every dimension m_i should be weighted by something like 1/(1+min(distance(A,m_i),distance(B,m_i)).

Combination of cluster analysis and feature extraction

Okay, in theory, we are now at the position where we can say things like “The Warriors use the Elevator Doors set on 9% of plays.” This is the moment when we can get full circle to things that are already done nowadays, by using feature extraction. Feature extraction is basically a way to give the raw data a context. For example, if the play resulted in points or a turnover, those are very obvious features. Others can be more complex, ranging from a more precise description of the resulting shot (uncontested catch-and-shoot corner three, for example) to concepts like gravity (area that the defense spreads) or even information about the style of defense (hedging after pick and roll, for example). Especially the defensive stats will be interesting, as ultimate play outcomes are definitely a result of defensive styles (if you ICE the pick-and-roll, it will result in different outcomes than if the defender hedges).

As you can see, at this point the world is your oyster and you can go back to analysis styles that are more common right now.

Conclusion

I hope that this was more enlightening than confusing and that someone probably uses this stuff and, hopefully, turns it into a publicly accessible thing one day. It’s possible that teams are already doing similar analysis behind closed doors, we just don’t know about it. As I said, this is easily translatable to other sports (compare distance of player positions after a snap for American football, or the last eight seconds before a change of possession in hockey or soccer).

Add us as a preferred source on Google