Guest Tech Post: EvanZ’s Battle RAPM (a How To)

Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) and forward Draymond Green (23) celebrate during the fourth quarter of game six of the NBA Finals against the Cleveland Cavaliers at Quicken Loans Arena. Warriors won 105-97. Mandatory Credit: David Richard-USA TODAY Sports
Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) and forward Draymond Green (23) celebrate during the fourth quarter of game six of the NBA Finals against the Cleveland Cavaliers at Quicken Loans Arena. Warriors won 105-97. Mandatory Credit: David Richard-USA TODAY Sports /
facebooktwitterreddit
Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) and forward Draymond Green (23) celebrate during the fourth quarter of game six of the NBA Finals against the Cleveland Cavaliers at Quicken Loans Arena. Warriors won 105-97. Mandatory Credit: David Richard-USA TODAY Sports
Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) and forward Draymond Green (23) celebrate during the fourth quarter of game six of the NBA Finals against the Cleveland Cavaliers at Quicken Loans Arena. Warriors won 105-97. Mandatory Credit: David Richard-USA TODAY Sports /

One of the goals of Nylon Calculus is to bring not only the results but the techniques of basketball statistical analysis to the forefront. To this end we asked Evan Zamir to give us a hand with something. For those of you who don’t know, when Evan isn’t homering it up for his beloved Warriors on Twitter, he is the operator and proprietor of the essential nbawowy! website which allows for examination of the performance of various lineup combinations over any give length of time. Though we often debate the merits of various all-in-one metrics here at Nylon Calculus, there is little question the RAPM-style models are widely considered to provide the best approximation of individual player value. Along with that statement comes the question “what is RAPM and how do I do one?” Well Evan is here to shed some light on the latter with a ‘how to’ for a relatively simple RAPM model. This will be by necessity one of the denser and more technical posts at Nylon Calculus, so we won’t be offended if you take a pass on this one. But if you are interested in building out your own valuation model, Evan’s got the goods on how to get started. – Ed

Using Apache Spark to Build A Player Valuation Model for the NBA

Run far, far away from this article if the following assumptions do not apply to you:

  1. You are a basketball fan;
  2. You know (or at least appreciate) a little bit of the mathy type things, since you are spending time on a blog in which fully 1/2 of the title words (last I checked) are Calculus;
  3. You have heard of and are intrigued by some of the following buzz words: Analytics, Big Data, Machine Learning, Daryl Morey, Cronuts (#amirite);
  4. You have either built your own statistical models for the NBA, used the results of someone else’s published models on occasion, or enjoyed at least one article in your life involving such models;
  5. You secretly wish to meet Zach Lowe and impress him with one fact he may not know about basketball.
  6. You are interested in learning and applying new technologies in a fun way for great good.

Motivation

There are only a few fundamentally different types of NBA player valuation models publicly available. Some types of models are ad hoc and rely on box score weights chosen out of thin air in a heuristic fashion (or perhaps, handed down from god, it would be hard to tell the difference). Actually, these models are not at all dissimilar from the rating systems now used on daily fantasy sites. Other models essentially eschew the box score stats entirely, and instead focus on predicting how players actually effect game outcomes. These are the so-called “+/-” models of player valuation, and they tend to have much better predictive value than the heuristic models. The newest generation of models, such as ESPN’s RPM metric, are hybrids of the two approaches as they combine box score statistics and +/-. And these models are not limited to traditional box score stats. For example, Andrew Johnson (Ed – Our guy!) has built a model called Player Tracking +/- (PT-PM), which regresses SportVU data onto long term +/- models. Over at the APBR forum where they run an annual prediction contest, Andrew’s PT-PM model ranked 1st out of 25 or so models this season (yes, that’s a lot of models).

And folks, trust me when I tell you these models are only going to get more complex, as the access to more and newer types of data grows each year (well, hopefully!). In the old days (like 5-6 years ago), one could build +/- models using Microsoft Excel (really!). As the models grow more sophisticated, and especially, as the size of the data becomes increasingly unwieldy to work with on a single computer, new techniques will be required to crunch the numbers for us. Luckily, some of these tools are already here, and in this article, I’m going to give a brief tutorial on how one of them — the Apache Spark framework — can be used to implement a basic +/- model. The only limitation to extending this approach will likely be your imagination and your access to data and a computing cluster (of course, you can start with your own computer and scale from there).

Apache Spark: lightning-fast cluster computing

A little over a decade ago, Google introduce to the world a framework for parallel computing called MapReduce (MR). Shortly thereafter an Apache project called Hadoop was announced that implemented an open-source version of MapReduce that could be used by anyone (and eventually it was used by everyone!) for “free” (you have to supply the developers and computers to run it, of course). Hadoop MR essentially enabled small teams of developers running off-the-shelf commodity hardware (either on premise or in the cloud) to process terabytes or even petabytes of data in sophisticated ETL processes. This has enabled the “Big Data” revolution.

MR hasn’t been all roses though. For the most part, programming MR jobs was really hard for a very long time. You had to be an expert Java developer or invest in learning a domain-specific language like Pig. There are newer solutions that enable one to essentially run SQL queries on a Hadoop cluster (i.e. Hive), but even with that, jobs can be very, very slow. Essentially, MR jobs have always been run as a batch process. The desire, of course, is to be able to run interactive queries or ETL jobs on vast amounts of data in real-time (or at least, near real-time).

Apache Spark purports to be a solution to this problem, and is seen by many (including myself) as a replacement for MR. Spark improves on MR in at least two very big areas: 1) Much of the processing of Spark can be done in-memory as opposed to on disk; 2) Spark provides very easy-to-use API’s for Java, Scala, Python, and even R. The first improvement makes Spark much more efficient and faster than MR for many, many use cases (reportedly 10-100X faster). The second improvement makes Spark much easier to use as a developer of data products. In fact, as a data scientist, I am already incorporating Spark into my Python machine learning and ETL workflow. Spark is absolutely not vaporware. It’s useful today.

The Code

The following code is essentially a Spark recipe written in Python to create a ridge-regressed +/- model for NBA players. It can be run on a single computer or on a cluster. The data comes from my website nbawowy!, which provides on/off data for any arbitrary combination of players on or off the court (neat). The data set here is only a couple hundred megabytes of JSON data, so of course, Spark is not necessary for doing this job. But it does serve as a useful example, and as said above, the same code could just as easily be run on a cluster of computers (as is, this only takes a couple of minutes to run on a 2015 MacBook Pro).

First we need to import a few Spark libraries and the Python json module for decoding the matchups.

In [118]:

import json

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint, RidgeRegressionWithSGD

Are you still with me? Good. Now we are going to load the matchup data. The function

textFile

is a Spark function that converts a text file consisting of lines (or “rows”) of data to a Spark data structure called an

RDD

(for

Resilient Distribute Dataset

). From the Spark documentation:

"Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel."

The RDD is the basic Spark abstraction and a Spark program involves manipulating RDDs using basic operations that will be familiar to developers who have written MR jobs in the past, including

map

,

reduce

, and

filter

. For example, in the follwing code, we

map

over the RDD, converting each JSON string to a Python dictionary (or dict). We then

filter

the RDD keeping all rows that have matchups from the last two seasons (the entire data set goes back to 2013). We then

cache

the RDD so that we can keep it in memory and use it more efficiently in our Spark job.

One more thing.

sc

is the SparkContext, which is the Python class that makes all of this work. It is automatically generated in the context of the iPython notebook which I am using to do this analysis. In a non-interactive Python script, you would have to explicitly load the SparkContext (it’s not hard, don’t worry, you can do it!)

In [119]:

matchups = (sc
            .textFile('matchups/matchups.json')
            .map(lambda string: json.loads(str(string)))
            .filter(lambda matchup: int(matchup['season']) in [2014, 2015])
            .cache())

Here we’re just going to peek inside the RDD and look at a single row of data using the first operator. Everything inside the print statement consists of pure Python operations. The nice thing about Spark is how it makes it so easy to move data back and forth between the Spark RDD abstraction and the driver program running our script.

What you see here is the “atomic unit” of data that will go into our RAPM calculation, namely the players on the court during a stint (the period of time from one substitution to another), the number of possessions, and the point margin between the two teams.

In [120]:

example = matchups.first()
print('home: {}naway: {}nseason: {}nhome unit: {}naway unit: {}nnpossessions: {}nhome scored: {}naway scored: {}'
      .format(example['home'],                                 
              example['away'],   
              example['season'],                                
              example[example['home']]['on'],
              example[example['away']]['on'],
              (example[example['home']]['stats']['poss']+example[example['home']]['stats']['poss'])/2.,
              example[example['home']]['stats']['pts'],
              example[example['away']]['stats']['pts']))

home: Pacers away: Magic season: 2014 home unit: [u'George Hill', u'Lance Stephenson', u'Paul George', u'David West', u'Roy Hibbert'] away unit: [u'Jameer Nelson', u'Arron Afflalo', u'Maurice Harkless', u'Jason Maxiell', u'Nikola Vucevic'] possessions: 14.0 home scored: 14 away scored: 7

Below we create a unique list of players during the last two seasons by doing a couple more Spark operations involving

flatMap

which basically “flattens” a list of lists into a single list containing all the elements of the lists. So for example, imagine that a

map

operation results in a list

[[a, b, c], [d,e]]

. The

flatMap

version would give

[a,b,c,d,e]

. See?

flatMap

is your friend!

Oh, I didn’t explain what lambda is, but you may have guessed by now that it is what programmers call an anonymous function. It’s a function that you hardly use at all, so you don’t even have to give it a name. Cool, right? Functional programming FTW and stuff.

There’s a line in here where I use the

broadcast

function. This is a little bit more advanced, but it basically enables the driver program (think of this running on the “Master” node) to ship data to each of the client or worker nodes. The sister operator of

broadcast

is

accumulator

which enables the worker nodes to write data to a single data structure which can be used by the driver program. I’m not using accumulators in this script.

In [121]:

players = (matchups
           .flatMap(lambda stint: (stint[stint[u'home']]['on'], stint[stint[u'away']]['on']))
           .flatMap(lambda players: [str(player) for player in players])
           .distinct())
num_players_broadcast = sc.broadcast(players.count())
print("There were {} total players in 2014-15.".format(players.count()))

There were 580 total players in 2014-15.

Later on we will want to know how many possessions each player played during the last two seasons. The operations below do that calculation. The only new operator introduced here is

reduceByKey

which basically adds up all the possession totals for each player.

collectAsMap

then transforms the RDD into a Python dictionary (or map or hash or whatever you like to call it in other languages). The print statement only includes the first few players for example.

In [175]:

players_poss = (matchups
           .map(lambda stint: (stint[stint[u'home']]['on'] + stint[stint[u'away']]['on'],
                               (stint[stint[u'home']]['stats']['poss'] + stint[stint[u'away']]['stats']['poss'])/2.))
           .flatMap(lambda (players, poss): [(str(player), poss) for player in players])
           .reduceByKey(lambda a,b: a+b)
           .collectAsMap())
print([(player, poss) for (player, poss) in players_poss.iteritems()][0:10])

[('Pau Gasol', 9690.5), ('Jodie Meeks', 8078.0), ('Joakim Noah', 10277.5), ('Derrick Williams', 6481.0), ('Randy Foye', 7294.5), ('Robbie Hummel', 2736.5), ('Terrence Ross', 8619.0), ('Mike Miller', 5001.0), ('Jorge Gutierrez', 837.5), ('Draymond Green', 10709.5)]

In the following code we assign a unique ID or index to each player using another Spark function called

zipWithIndex

. We will use this dictionary later to create sparse vectors for the regression analysis.

In [144]:

players_index = players.zipWithIndex()
players_dict = players_index.collectAsMap()
players_dict_broadcast = sc.broadcast(players_dict)
print([(player, index) for (player, index) in players_dict_broadcast.value.iteritems()][0:10])

[('Pau Gasol', 489), ('Jodie Meeks', 90), ('Joakim Noah', 3), ('Derrick Williams', 202), ('Randy Foye', 93), ('Robbie Hummel', 5), ('Jeffery Taylor', 236), ('Mike Miller', 203), ('Jorge Gutierrez', 557), ('Draymond Green', 308)]

Here we create a Python function called

createLabeledPointFromMatchup

which will be used in the next step in a

map

step to transform our matchup data into something that can be used by the Spark machine learning library (

MLlib

). A

LabeledPoint

is a Spark-specific Python class that contains a label (in this case the point margin of a single stint), along with a

SparseVector

containing the indices of the players on the court for each team during the stint. The idea of a sparse vector is that it save a ton of space. Imagine you were doing this analysis in Excel. You might have each player in the NBA represented by a different column. That’s 580 columns, that we can essentially reduce to 10 columns. Wouldn’t you do that, if you could? Well, you can. Yay.

In [151]:

def createLabeledPointFromMatchup(m):
    global players_dict_broadcast
    global num_players_broadcast
    home = m['home']
    away = m['away']
    home_unit = m[home]['on']
    away_unit = m[away]['on']
    home_poss = m[home]['stats']['poss']
    away_poss = m[away]['stats']['poss']
    avg_poss = (home_poss+away_poss)/2.
    if avg_poss <= 0:
        avg_poss = 1
    players_dict = {players_dict_broadcast.value[player]:avg_poss for player in home_unit}
    players_dict.update({players_dict_broadcast.value[player]:-avg_poss for player in away_unit})
    home_pts = m[home]['stats']['pts']
    away_pts = m[away]['stats']['pts']
    return LabeledPoint(100*(home_pts-away_pts)/avg_poss, SparseVector(num_players_broadcast.value, players_dict))

We’re getting closer to actually building the model. In this step we

map

over the matchup RDD with our Python function defined above, and then we split the data into a training and test set. The weights mean that 80% of the data goes into the training set, with the remaining 20% used for testing.

In [152]:

parsedData = (matchups
              .map(createLabeledPointFromMatchup)
              .cache())
parsedTrainData, parsedTestData = parsedData.randomSplit(weights=[0.8, 0.2], seed=42)
parsedTrainData.take(1)

Out[152]:

[LabeledPoint(51.8518518519, (580,[16,85,103,244,282,349,456,488,500,521],[ 13.5,13.5,-13.5,-13.5,-13.5,-13.5,-13.5,13.5,13.5,13.5]))]

Now we define another Python function to calculate the squared error between our “label” (the actual point margin) and the prediction we will get from the model.

In [135]:

import math
def calcSqrLoss(label, prediction):
    return (prediction - label) ** 2

Now all that remains is to train the model. Note how much effort we spent just to get the data ready for this step? Yep, welcome to the wonderful world of machine learning! In the following code snippet we loop through a few different (admittedly hand-picked) “hyperparameters” that will be used to find the optimal parameter for regularization of the model. I know, lotsa fancy words there. Basically, we want to the model to work really well on the training set, but not so well that it doesn’t perform well on the test set too. The regularization helps us do that. Actually, it helps to know that regularization is the “R” in RAPM and it is the thing that makes the second generation of +/- models more predictive than the first generation.

RidgeRegressionWithSGD is the workhorse function that creates the model by using (being trained on) the training data set. If you are curious (you are!), the SGD stands for stochastic gradient descent which is the name of the algorithm being used to find the optimal weights for the model that minimize the error between the predictions and the actual point margins in the training set.

In [157]:

bestModelRMS = 200
bestModel = None
bestParam = None
print('Starting grid search...')
for param in [0, 5e-3, 1e-1, 1.5e-1, 2e-1, 2.5e-1, 3e-1]:
    model = RidgeRegressionWithSGD.train(parsedTrainData, regParam=param, step=0.5, iterations=200)
    labelsAndPreds = parsedTestData.map(lambda lp: (lp.label, model.predict(lp.features)))
    rmsErr = math.sqrt(labelsAndPreds.map(lambda (l, p): calcSqrLoss(l, p)).mean())
    print((rmsErr, param))
    if rmsErr < bestModelRMS:
        bestModelRMS = rmsErr
        bestModel = model
        bestParam = param
print("Finished!")
print("Best model RMS = {}nbest regularization param = {}.".format(bestModelRMS, bestParam))

Starting grid search... (134.90988221498682, 0) (134.90658068951853, 0.005) (134.87701129077732, 0.1) (134.8739237228102, 0.15) (134.8739858893256, 0.2) (134.8756014670794, 0.25) (134.87797940494727, 0.3) Finished! Best model RMS = 134.873923723 best regularization param = 0.15.

Once we decide on the best regularization parameter, all that remains is to print out the players, their RAPM ratings, and how many possessions they played in the last two seasons. Here I’m filtering out players that played fewer than 8000 possessions and just showing the top 25 players.

In [176]:

model_weights = model.weights
model_dict = {name: (model_weights[index], int(players_poss[name])) for (name, index) in players_dict.iteritems()
             if players_poss[name] >= 8000}
sorted_players = sorted(model_dict.items(), key=lambda x: -x[1][0])
for player in sorted_players[0:24]:
    print('{},{:.3f},{}'.format(player[0], player[1][0], player[1][1]))

Chris Paul,1.162,11837 LeBron James,1.074,13264 Stephen Curry,1.067,13357 Russell Westbrook,0.990,9028 James Harden,0.985,13393 Kyle Korver,0.985,10931 Kyle Lowry,0.953,10999 Draymond Green,0.917,10709 Kevin Durant,0.848,9736 Dirk Nowitzki,0.839,10432 Zach Randolph,0.826,10529 Kawhi Leonard,0.822,9657 Danny Green,0.813,9113 LaMarcus Aldridge,0.799,11057 Andre Iguodala,0.796,9961 Anthony Davis,0.789,9552 George Hill,0.774,8283 Monta Ellis,0.725,12210 Marcin Gortat,0.713,11289 Klay Thompson,0.686,12735 Dwight Howard,0.673,8776 Kevin Love,0.660,10803 Carmelo Anthony,0.644,8252 Deron Williams,0.619,9145

The resulting rankings don’t look too bad, right? The regularization is shrinking the RAPM coefficients pretty significantly, more than I’m used to from my previous efforts in other frameworks, actually. If you catch something in my code (or modeling approach) that you think might be cause for this, please let me know. For now, I hope you think of this article more as a recipe or strategy than a final final thingy. It’s obviously not that. But the idea going forward should be pretty clear. Get a bunch of data and computers and modelize all the things. Winning basketball will ensue as a result. If it’s not lit already, I sincerely hope Spark is the spark that ignites your modeling flame.