How Predictable is the NBA?

Jan 31, 2015; Atlanta, GA, USA; Philadelphia 76ers center Henry Sims (35) is defended by Atlanta Hawks guard Jeff Teague (0) and forward Elton Brand (7) in the third quarter of their game at Philips Arena. The Hawks won 91-85. Mandatory Credit: Jason Getz-USA TODAY Sports

One of the most common ways to measure a metric or algorithm is to see how well it can predict the future. Last semester, I worked on a project in which our goal was to see how accurately we could predict future outcomes in the NBA. The full details of our findings are here, and in this post I will attempt to summarize the key findings of what we learned.

At a high level, our approach to predicting future outcomes was to split the data into training and test sets, fit a model to the training data, use this model to predict on the test data-set, and measure how well it does. For example, our data-set had the results of all games from the 2008-2013 NBA seasons. To predict the 2010 season, we would train a model using the data from 2008 and 2009 (training set), and then use that model to predict the 2011 season (test set). The measure of how well we did (loss function), is just the percentage of games in which we predicted the winning team correctly in the 2011 season.

The data and covariates we used for prediction was essentially a less refined version of Five Thirty Eights new power rankings. For each player on a team, we assigned them a weight based on how many minutes per game we expected them to play, and then assigned each team features based on a weighted sum of its individual player statistics.

Algorithm Comparison

There are many different algorithms available to predict the results of future games. Below is a look at 8 different methods, and their their corresponding prediction accuracy over all the years in our data-set. We also have on the far left, a method which just naively predicts that the home team will win every game (more on this later).

Performance of various algorithms in predicting the 2009-2013 NBA seasons. The Y-Axis shows the Test error, which is the percentage of games each algorithm predicted correctly in each of the 5 different years.

One of the main takeaways from this plot is that generally, prediction accuracy for a given season hovers between 60 and 70 percent. Others have attempted this same problem, and they even do a little better than us in some situations, but it appears that the best we can do in predicting the outcomes of individual games is typically around this range.

From a basketball perspective, you can think of this as the percentage of games that result in an upset. Ideally, these methods should be choosing the favored team to win each game, so an accuracy of 60-70% implies that 30-40% of NBA games result in an upset. This is perhaps a simplistic was to view the result, as it doesn’t consider the margin between how good each teams are, which is probably an area to look at in the future.

This also shows the inherent randomness in NBA games. One common saying is that it’s a “make or miss” league: some games the better team just misses a bunch of shots, and they lose. While it’s certainly more likely that the better team will win every game, it doesn’t always happen. This is simply an attempt to quantify how often you can expect an upset to take place.

From a statistical perspective, this appears a big win for Logistic Regression! While some other algorithms come close (notably Naive Bayes and Support Vector Machines) in terms of prediction accuracy, it appears logistic regression consistently does the best job predicting future games.

Projection System Comparison

If you look at the above plot close enough, you can see that the prediction accuracy dips a little bit in the final, 2013-14 season. While it could just be a random blip (i.e; more upsets in a season), there may be some signal there indicating that the NBA is becoming less predictable. To demonstrate this further, let’s look at a comparison of predictions for the 2012-13 and 2013-14 seasons from the weak side awareness blog.

Comparison of Prediction Systems for the 2012-13 (Blue) and 2013-14 (Red) seasons. We are measuring how accurate the predictions are in terms of the Root Mean Squared Error loss function.

As you can see, the public projection systems in the 2012-13 season did MUCH better than the 2013-14 season. There are many theories for why this is happening. If you look back to the original plot, we saw a slight dip in home-court advantage in the 2013 season compared with 2012. Since the home court advantage has been such a strong signal for predicting future games in the past, perhaps this slight dip hurt the 2013 projection systems. One can only imagine what this means for projection systems in this current season, as the home court advantage has reached an all time low!

Metric Comparison

One final takeaway from our project is a comparison of metrics. There has been a long debate in the NBA statistics community over which single metric does the best job of capturing a players value. We don’t attempt to definitively answer this question here, but one way you can investigate this is by fitting models using only a single metric, and then compare how well the various metrics do predicting future seasons.

Our takeaway from this investigation is that Jeremias Engelmann’s RAPM did the best job of predicting future outcomes. To demonstrate this, let’s take two of our most accurate algorithms from above, Logistic Regression and Naive Bayes, and see how they perform using either just PER, or just RAPM as input data.

Comparison of Prediction Accuracy using just PER and just RAPM. We see that RAPM out performs PER in all season/algorithm combinations.

From this plot, it’s clear that RAPM outperforms PER in every season, with both of our algorithms. This isn’t meant to be a grandiose statement about RAPM, you would need much more careful analysis to formally show that it is the best estimator of player value. However, our project did determine that out of all measures, it did the best in predicting future outcomes.

This also doesn’t mean that you should just stop watching basketball, and check out a teams weighted RAPM to know who will win a game. The below plot shows how wins and losses for home and away team are still quite messy when you compare their weighted RAPM. There’s still some clear signal in the plot, in terms of the team with the higher RAPM winning the majority of their games, but there is still a ton of uncertainty regarding the result of a game. This is why basketball is so fun to watch,

Weighted RAPM for the Home and Away team in terms of wins and losses. There’s some signal here, in terms of more bloack dots below the line (representing the away team is better) and more blue dots below the line (indicating that the home team is better.)

Add us as a preferred source on Google