Deep Dives: SportVU X’s and O’s – A Proof-of-Concept

NC Deep Dives
NC Deep Dives /

In my first ever article here at The Nylon Calculus I described blueprint of what I would do with tracking data. I wasn’t aware of it at the time, but the needed tracking data is somewhat freely available for the people that are smart enough how to extract it. Half a year later, I was aware of it’s availability, but still not smart enough[1. Or too smart, as I was able to imagine how much all the coding would drive me crazy…]. Luckily, there are a few great minds at this page. In this case it was Justin[2. As a GM, I would be looking for a person that is top notch at handling data, transforming data into knowledge and understand basketball so good that this knowledge is actually about basketball and not about the data itself. Inner GM, meet Justin!], who was kind enough to provide me with the SportVU-based “movement” tracking data of all Finals games from

Introduction to the analysis

As discussed in the aforementioned blueprint, the main goal is to create and refine method to find clusters of similar basketball plays. More importantly, idea was to use an unsupervised or semi-supervised method, so that my inputs into the system would be the raw data and a rule for a distance measurement between plays. There are existing approaches for play analysis that I would guess are largely supervised. There are definitely good arguments to go a more supervised route, but I felt that for me the unsupervised version is a more interesting, as it is more holistic and allows for more surprises.
What follows is the product of one to two weeks of work[3. distributed over more than a month because of way too many more important activities]. The time consisted of learning how to handle the data (10%), writing code to get the data into the shape that I wanted (25%), writing code that compares different plays (25%) and finding a way to present and analyse the results (40%). I am telling you all of this to emphasize that one very time intensive part of my work was to watch actual basketball[4. Just in case Sir Charles is reading all of this]. I found flaws in the code that slightly influence parts of the analysis. But those flaws are comparable to using a slightly crude measuring tape. The numbers might not be 100% perfect, but the general direction is fine.

Getting the data into shape

As an example for what I did first, we will look at this play from Game 3 of the Finals. Using magic, you can extract the underlying data of this play here. Of course, the first problem of these events is that not all of them start and stop perfectly with a play. Just look at the following event for Iman Shumpert’s free throws and how it stops in the middle of the next play. To cut events somewhat correctly, I used shot clock resets and the “real time”, additional information available through the data.

With the data in place, I’m focusing for now on the most important aspect of the Game, the ball. To make plays comparable, I rotated plays so that everything event is played on the same basket. I also deleted all plays that took less than 7 seconds[3. Sparing you of a lot of Leandro Barbosa fast breaks.]. Afterwards I interpolated the data, using only one time point every 0.2 seconds. I would not think that it affects my further analysis, but at the same time it allowed me to keep my program rather “brute force”[5. The code I wrote for this is some of my ugliest work in a long time. I still feel a bit dirty about it…]. For the final comparison, the plays were cut so that all of them start when the ball first comes with 35 feet of the “offensive” baseline, as this strikes me as a better measure of when in X-and-O turns a “play” starts.

The following figure summarizes all these manipulations:

Summary Data Formatting
Summary Data Formatting /

Comparing two plays using a distance measure

To quickly describe the way I defined a distance between two plays: 1. Take all data points for two events as soon as the ball first is closer than 35 feet to the baseline. 2. Try all different combinations of overlapping the plays, taking the euclidean (aka ordinary) distance between the data points. Points for which there is no match in the other plays get a penalty distance of 14 feet. 3. For each combination, take the average distance (euclidean distances plus penalties). The minimal distance for these combinations is your Play distance. Let me show you two example plays to visualize this:

Example for the distance of two plays
Example for the distance of two plays /

The numbers ~340 and ~573 give me the event number to search for the respective videos [6. You often have to go one or two plays forwards or backwards to find the actual videos]. The two plays are left-side post ups, one for Draymond Green and one for LeBron James. Two interesting side notes: Firstly, using only the ball is of course not perfect for a final measurement. You completely miss that the Draymond Green post up has an entry pass. But it’s a good start to decide how to overlap the two plays, as the ball dictates what’s going on. Secondly, SportVu would probably define the LeBron James’ post-up as drive (starts outside of 20 feet from the basket, ends inside of 10 feet of the basket), though in qualitative terms it looks much more like a “post up” than a “drive.”

First results – does it actually work?

The shortest answer here is yes and no. If you want a long answer that the editorial overlords understandably labeled as “somewhat visually numbing”, you can find a

coarse overview of several “plays” that I I identified here

. To put it short, there are a lot of things that do not work in this version, which we’ll call build 0.00001. Dribbling outside of the arc gets way too much weight in the comparison. As I’m tracking just the ball and not the players for sake of simplicity, we still don’t know if the play is an isolation or a pick and roll. Another problem is that slight delays in timing[2. For example in Play 1 the ball gets stopped for


second while in Play the team runs an identical set, but ball gets stops for


seconds before the next pass is made.] makes it hard to fit plays that are for a human observer very similar. In technical terms, the algorithms I use at the moment create a space that is not very dense and would therefore need a lot of plays to capture every possible aspect of the game. One solution is of course to include more than 6 games. Another possible solution would be to simplify the space [7. I have an idea for that but don’t want to open another can of worms right now…].

That said there are parts where it actually works pretty



The most prominent example are isolation plays, as the pattern of these plays are very particular. The following six plays are taken from the center of a bigger cluster that I found using hierarchical clustering[3. Once again, you can check the

gory details here


Left side isolation plays
Left side isolation plays /

The plays are:

So, without explaining the algorithm in any way shape or form what an isolation play is, the tracking metric automatically can detect a left side isolation play. A slight extension of my momentary algorithm could spit you out video for every LeBron isolation of the last season[8. Plus a few garbage plays that look like LeBron isolations…], not making it necessary for you to search through all 10.000 Cavs plays anymore.

Of course, this might be the simplest play in basketball history, but with more data and a bit of brain energy, I am pretty sure that this is extendable to most existing plays.

A click-bait worthy conclusion

To put my blueprint of a method into perspective with everything that is openly available, I would say it is like comparing a 4 year old playing chess with Gary Kasparov playing checkers. Most metrics condense everything into one or a few numbers. On the one hand, this makes a lot of things directly understandable and you can actually add a lot of value. On the other hand, there is an upper ceiling that is very noticeable to everyone working in the field. In comparison, my construction site of a method is still quite far from being useful. And if you want to write an article about basketball, it’s like taking a sledgehammer to crack a nut. On the other hand, I just started to learn the rules. I know how to move the pawns and the bishops. But my castle always remains stuck inside the corner and I still don’t understand how to use the knight properly[9. I actually don’t know how to use the knight properly. Haven’t played chess in 15 years and had troubles beating a 5 year old recently. That’s why I wrote 4 year old…]. And don’t get me started on this thing were you can exchange a pawn for another piece…

But in the long run, I am certain that my method can bring your analysis further, because it doesn’t try to reduce a complex game like basketball to checkers. It still remains chess.