NBA Hackathon: How to build an expansion team in 21 tweets

Early in August, the NBA announced their inaugural Hackathon event aimed to inspire creative thinking towards developing solutions to challenging and important problems in the field of basketball analytics. With the event restricted to students, Senthil (Rice) and Chris (Stanford) took advantage of a perfect storm of events to take their talents to NYC and join forces to enter the competition as Team Nylon Calculus. As something of a preview to the event today, Senthil and Chris explain the event’s main application question and each of their approaches to answering it, which they did in an email thread leading up to the event.

Chris Pickard: Before Senthil and I jump into our discussion of the application question, I want to explain the format of the event today as “Hackathon” events seem to be far less common than I thought outside of Silicon Valley; something I found out when explaining the event to some family and friends. While the full event details can be found here, the Cliff Notes’ version is as follows:

The actual competition portion of the event begins at 9AM today when the NBA event organizers will present several problems or questions related to basketball analytics. So far we have received two hints about these prompts.

Each team will choose one problem and have just over eight hours to come up with a solution and organize it into a presentation format.

While all public data is acceptable, the NBA has been generous enough to grant us limited access to other non-public datasets to help us build solutions.

From there, a panel of judges will select finalists and these finalists will then present and compete for the top-three prizes.

The event was restricted to ~30-40 teams with no more than four students per team and each team member had to fill out an application that included a central “essay” style question. To paraphrase, the question asked:

Suppose you are a GM of a new hypothetical expanded team in the NBA, explain your methodology and process in drafting your team. What information and data would be important and how would you incorporate it into your decisions?

The prompt rules were much different than what the formerly known Bobcats were held to in 2004. To paraphrase, the prompt stated that the current five highest paid or rookie scale contract players were off limits, and that a hard salary cap would be in effect. So it seems reasonable that you could try to raid the Warriors and build a contender pretty quickly, but that doesn’t take away from the purpose of this exercise: How do you differentiate players and their value within the context of building a team? The catch? Explain it in ~21 tweets worth of characters (3000 character limit). What follows is the back-and-forth between Chris and Senthil as they attempt to configure their solutions.

Senthil Natarajan: Drafting an expansion team is an exercise in exploring playing philosophies, with a few initial considerations to account for. First, since rookie-scale contracts are off the table, we can assume that we are looking for players who can help the team win right away. This also means we’re left primarily with players who don’t have a lot of room for growth left. Second, we can use the expansion draft rules about eligible players to pare down the pool of available players.

More from Nylon Calculus

In order to get a balance of playing style and effectiveness, we can initially break out 16 pretty useful statistical features for each player in our pool (with all volume stats being per 100 possessions): 2PT attempts, 3PT attempts, FT attempts, OREB, DREB, AST, STL, BLK, PTS, 2PT%, 3PT%, eFG%, FT%, USG%, defensive box plus-minus, and standing reach. This could be further granularized by expanding upon the shot selection. To account for noisy fluctuations, we should also regress each player’s statistics to their positional averages.

After extracting the stats for each player, we can use a two-phase process to establish our team-building philosophy. First we can use a clustering scheme (K-Means or Gaussian Mixture Modeling both come to mind) to establish all the different types of players in the league (true “positions”), where each player is a data point with each of their statistical features as an individual dimension. We then take advantage of the “copycat league” concept to select a pre-existing lineup in the league which we want to model our team after (i.e. the twin towers lineup of the 2015 Thunder), and find how each player in that comparison lineup is classified according to our GMM. This tells us which five types of players we need to create our starting lineup, based on which we can determine which types of complementary players we need for the bench.

The next step is to use a wholistic metric based on a possession-based player tracking model like EPV to evaluate every player in our pool, which will rate them based on a probabilistic view of their expected contributions to any single possession. Then, we can look at each playoff team from the most recent season and use a simple regression to find the baseline for what fraction of our team’s total rating should be from the starters (how strong is the starting lineup relative to the rest of the team).

This is now finally down to an optimization problem. First, as a baseline, we select the highest rated player in the pool for each player type that we need for our team, then use a multidimensional gradient descent to maximize the total rating of the team while checking against the salary cap condition at each step of the descent. The gradient descent method is advantageous for identifying the lowest drop-off in skill level at each step, which returns the team with the best possible “value” at each position.

To run this model, a Python framework gives us a robust way of sorting the data, manipulating the variables, and visualizing the results in an easily interpretable and searchable format.

Chris Pickard: So when I think of building a team from scratch, I see two main problems that need to be addressed: First, the team composition needs to be defined (i.e. guard- or post-orientated, star or system reliant, etc.). Second, within the context of the expansion draft scenario, eligible players need to be made measurable in order to identify those who best fit the targeted team composition, ideally predicated by the head coach’s vision.

So, I look at the crux of building an expansion team as the ability to correctly measure, sort and target players that best fit my desired team composition. While the last statement is obvious, the process of doing that is not. My solution is to use a model that I have used in previous posts to gauge player impact on a possession level. Why at the possession level? Possession efficient teams score more; allow fewer points than their opponents leading to wins. A player contributes to team possession efficiency by how often he puts himself or his teammates, directly or indirectly, in position to score points and is an indicator of his value to his team.

This can be measured by viewing a possession outcome as a binary event tree that has associated probabilities that a specific event occurs with known point outcomes ranging from 0 to 4. A player’s expected points contributed are a product of the skills of his teammates, opponents and likelihood that an event occurs while he is on the court. This descriptive model can be further sharpened by accounting for the location of shot events within the tree. Using PBP logs that include on-court players, the event type within the context of the tree and shot locations, a binary Rasch model can fit the data. The model’s strength is that it adjusts results for teammate and opponent strength with the final output as expected points per possession assuming a player has average teammates and opponents.

The proposed model is a robust and adaptable enough solution to strategically target players to draft for an expansion team because it measures players under the same conditions, allowing equal comparison and incorporates on-court actions that are not accounted in traditional statistics. In addition, the results output both offensive and defensive total and shot location specific values that allow for primary and role players (i.e. three-point shooters) to be identified. Most importantly, the model’s output allows for a player’s value to be estimated in terms of wins above replacement using Daryl Morey’s expected wins equation.

By observing each player as a point with a size associated to their WAR value and a cost associated with their contract, optimizing the roster is an example of a “Knapsack Problem” in combinatorial optimization. Using this approach allows for the best roster to be built in terms of wins while keeping the roster under the salary cap. Filtering players by desired features helps further maximize the team toward the desired composition.

CP: I really like your model in terms of creating a mold based on desired team composition. I think like you mentioned in your Ringer post, the league is very much a copycat league. So, I really like what your model does in terms of mirroring typical team profiles and you could use it to mirror a lineup or you could do it to mirror a specific role. Do you think that is possible?

I think the salary question is tough. I think the big thing is identifying which players are worth paying for and which are not. So in this respect, it would be best if we could find the spread in talent across a particular role. My theory is that certain positions will have larger drop-offs in talent from the best too say the 20th-best based on how we measure it, so we would prioritize those players over say a position where there isn’t as large of a spread between talent.

SN: That’s great feedback. From my experience with it, I will say that the advantage of the gradient descent optimization is that it does help identify the best “value” at each position based on finding a replacement that results in the lowest drop-off in total rating at each step of the descent, which is great.

I’ve sort of redone the proposed model a little bit now to make it clearer and more cohesive, which should also help with the salary cap constraint as well. It’s still focused around identifying playing styles and creating a prospective team profile, but now everything starts with the clustering-based “true” positions. Now, as per what you mentioned, we’re identifying individual roles at the beginning. I’m sure different people will have different statistical features they’d rather use than my specific sixteen; with a little more research or perhaps some work in feature selection, I could potentially identify what statistics to actually incorporate in my model.

Next: Allen Iverson and the Hall of Fame

CP: What’s great about our approaches is that we both took different angles to answer the same question; that in itself is neat. More specifically, when combined, I think we create a pretty good overall methodology to drafting an expansion team. I think the strengths in my approach center around valuing players on-court actions from a more overarching perspective by taking into account the likelihood of certain events occurring during a possession while a player is on the court, but I will admit I come up short when addressing the salary cap issue. Even though I can find best value players using WAR from my model, it doesn’t help me maximize my limited resources with respect to my cap space.

SN: I think one of the biggest strengths of your model is the ability to quantify the skill sets of individual players. As for the salary point, the way I have looked at it, “value” players are not the same as best players, and just fielding a team made of value players may not be fielding the best team possible. I think that’s the one fallacy to avoid. Overall though, I agree completely – our models actually could both complement each other really well. Exciting stuff!