Nylon Calculus 101: Python Data Wrangling – Merging PBP Data and Shot Logs
In this edition of Nylon Calculus 101, the great and greatly unheralded Darryl Blackport walks us through the steps to creating one of the most important data sets used by authors at Nylon Calculus, which we’ve taken to calling the Detailed Shot Logs.
One of the great things about all the publically available data on stats.nba.com is it has created opportunities for people with some programming skills and some creativity to merge different data sets and create new data sets for deeper more complex analysis. A good, recent example of this is the positional estimates from Seth Partnow. A lot can be done using nothing more than publically available data, an idea for how to transform this data into something new and useful and some coding skills to put it together (and some frustration when your idea doesn’t work as originally planned on the first try). As anyone who has worked a lot with data will tell you, the hardest part is not the analysis itself but getting and organizing the data into a format that you can use to perform that analysis. This article will show you how to do just that with one of the most used data sets here at Nylon Calculus – the detailed shot logs created by merging the player tracking shot logs with play by play data. This allows you to include things like who assisted on the shot and who blocked the shot. Since this links the shot logs to the play by play, you can also figure out who was on the floor for each shot as well as the prior event. I will show you how to merge these two data sets for one player in one quarter, but you can expand on this and do it for the entire season. I will be doing this using Python 2.7. If you are more familiar with R this can still be of use to you since the steps and logic would be exactly the same.
We will start by importing the packages we need to extract the data from stats.nba.com and store it as a pandas data frame.
In [1]:
import pandas as pd import json import urllib2 from IPython.display import display
For this example, we will be merging Stephen Curry’s shots from the third quarter of one game this past season.
In [2]:
game_id = "0021400014" player_id = "201939" period = 3
First we will need to get the data. If you don’t know how to find the URLs below, here is a good resource.
In [3]:
pbp_url = "http://stats.nba.com/stats/playbyplayv2?GameId="+game_id+"&StartPeriod=0&EndPeriod=10&RangeType=2&StartRange=0&EndRange=55800" player_tracking_shot_logs_url = "http://stats.nba.com/stats/playerdashptshotlog?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&PlayerID="+player_id+"&Season=2014-15&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision="
I won’t go into details but this is a function to read the data we want from the URLs above and transform it into a pandas data frame.
In [4]:
def get_data_from_url(url): response = urllib2.urlopen(url) data = json.loads(response.read()) headers = data['resultSets'][0]['headers'] rows = data['resultSets'][0]['rowSet'] data_dict = [dict(zip(headers, row)) for row in rows] return pd.DataFrame(data_dict)
Get the play by play data for the game and examine the head of the data frame.
In [5]:
pbp_data = get_data_from_url(pbp_url) with pd.option_context('display.max_columns', None): display(pbp_data.head())
EVENTMSGACTIONTYPE | EVENTMSGTYPE | EVENTNUM | GAME_ID | HOMEDESCRIPTION | NEUTRALDESCRIPTION | PCTIMESTRING | PERIOD | PERSON1TYPE | PERSON2TYPE | PERSON3TYPE | PLAYER1_ID | PLAYER1_NAME | PLAYER1_TEAM_ABBREVIATION | PLAYER1_TEAM_CITY | PLAYER1_TEAM_ID | PLAYER1_TEAM_NICKNAME | PLAYER2_ID | PLAYER2_NAME | PLAYER2_TEAM_ABBREVIATION | PLAYER2_TEAM_CITY | PLAYER2_TEAM_ID | PLAYER2_TEAM_NICKNAME | PLAYER3_ID | PLAYER3_NAME | PLAYER3_TEAM_ABBREVIATION | PLAYER3_TEAM_CITY | PLAYER3_TEAM_ID | PLAYER3_TEAM_NICKNAME | SCORE | SCOREMARGIN | VISITORDESCRIPTION | WCTIMESTRING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 12 | 0 | 0021400014 | None | None | 12:00 | 1 | 0 | 0 | 0 | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | None | 10:15 PM |
1 | 0 | 10 | 1 | 0021400014 | Jump Ball Thompson vs. Bogut: Tip to | None | 12:00 | 1 | 4 | 5 | 2 | 201574 | Jason Thompson | SAC | Sacramento | 1610612758 | Kings | 101106 | Andrew Bogut | GSW | Golden State | 1610612744 | Warriors | 1610612758 | None | None | None | NaN | None | None | None | None | 10:16 PM |
2 | 5 | 2 | 2 | 0021400014 | MISS Cousins 3′ Layup | None | 11:43 | 1 | 4 | 0 | 0 | 202326 | DeMarcus Cousins | SAC | Sacramento | 1610612758 | Kings | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | None | 10:16 PM |
3 | 0 | 4 | 3 | 0021400014 | None | None | 11:41 | 1 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | Curry REBOUND (Off:0 Def:1) | 10:16 PM |
4 | 1 | 2 | 4 | 0021400014 | None | None | 11:35 | 1 | 5 | 0 | 0 | 202691 | Klay Thompson | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Thompson 25′ 3PT Jump Shot | 10:16 PM |
5 rows × 33 columns
Get Stephen Curry’s shot logs and examine the head of the data frame. Note that this has all his shots from the regular season.
In [6]:
shot_log_data = get_data_from_url(player_tracking_shot_logs_url) with pd.option_context('display.max_columns', None): display(shot_log_data.head())
CLOSEST_DEFENDER | CLOSEST_DEFENDER_PLAYER_ID | CLOSE_DEF_DIST | DRIBBLES | FGM | FINAL_MARGIN | GAME_CLOCK | GAME_ID | LOCATION | MATCHUP | PERIOD | PTS | PTS_TYPE | SHOT_CLOCK | SHOT_DIST | SHOT_NUMBER | SHOT_RESULT | TOUCH_TIME | W | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lawson, Ty | 201951 | 2.3 | 4 | 0 | 7 | 10:17 | 0021401229 | H | APR 15, 2015 – GSW vs. DEN | 1 | 0 | 2 | 20.4 | 4.3 | 1 | missed | 4.5 | W |
1 | Gallinari, Danilo | 201568 | 3.5 | 1 | 1 | 7 | 7:32 | 0021401229 | H | APR 15, 2015 – GSW vs. DEN | 1 | 2 | 2 | 20.1 | 13.7 | 2 | made | 3.9 | W |
2 | Lawson, Ty | 201951 | 3.3 | 6 | 0 | 7 | 6:16 | 0021401229 | H | APR 15, 2015 – GSW vs. DEN | 1 | 0 | 2 | 20.5 | 23.8 | 3 | missed | 4.4 | W |
3 | Lawson, Ty | 201951 | 2.4 | 0 | 0 | 7 | 4:32 | 0021401229 | H | APR 15, 2015 – GSW vs. DEN | 1 | 0 | 3 | 11.0 | 25.8 | 4 | missed | 0.6 | W |
4 | Lawson, Ty | 201951 | 2.9 | 1 | 1 | 7 | 3:42 | 0021401229 | H | APR 15, 2015 – GSW vs. DEN | 1 | 3 | 3 | 23.0 | 24.9 | 5 | made | 1.7 | W |
5 rows × 19 columns
We need to filter out Curry’s shots for the period we are working with from both data sets. In the play by play data we can filter out shots by using the EVENTMSGTYPE column. Shots are EVENTMSGTYPE 1(makes) and 2(misses).
In [7]:
pbp_player_period = pbp_data[(pbp_data['PERIOD'] == period) & (pbp_data['PLAYER1_ID'] == int(player_id)) & ((pbp_data['EVENTMSGTYPE'] == 1) | (pbp_data['EVENTMSGTYPE'] == 2))] shot_logs_player_period = shot_log_data[(shot_log_data['PERIOD'] == period) & (shot_log_data['GAME_ID'] == game_id)]
Let’s look at the shots from the play by play.
In [8]:
with pd.option_context('display.max_columns', None): display(pbp_player_period)
EVENTMSGACTIONTYPE | EVENTMSGTYPE | EVENTNUM | GAME_ID | HOMEDESCRIPTION | NEUTRALDESCRIPTION | PCTIMESTRING | PERIOD | PERSON1TYPE | PERSON2TYPE | PERSON3TYPE | PLAYER1_ID | PLAYER1_NAME | PLAYER1_TEAM_ABBREVIATION | PLAYER1_TEAM_CITY | PLAYER1_TEAM_ID | PLAYER1_TEAM_NICKNAME | PLAYER2_ID | PLAYER2_NAME | PLAYER2_TEAM_ABBREVIATION | PLAYER2_TEAM_CITY | PLAYER2_TEAM_ID | PLAYER2_TEAM_NICKNAME | PLAYER3_ID | PLAYER3_NAME | PLAYER3_TEAM_ABBREVIATION | PLAYER3_TEAM_CITY | PLAYER3_TEAM_ID | PLAYER3_TEAM_NICKNAME | SCORE | SCOREMARGIN | VISITORDESCRIPTION | WCTIMESTRING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
294 | 1 | 2 | 349 | 0021400014 | None | None | 10:18 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 3PT Jump Shot | 11:42 PM |
319 | 1 | 2 | 378 | 0021400014 | None | None | 8:02 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 25′ 3PT Jump Shot | 11:47 PM |
365 | 1 | 1 | 431 | 0021400014 | None | None | 4:26 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | 60 – 54 | -6 | Curry 25′ 3PT Jump Shot (15 PTS) | 11:59 PM |
385 | 5 | 2 | 454 | 0021400014 | None | None | 2:33 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 3′ Layup | 12:04 AM |
413 | 1 | 2 | 492 | 0021400014 | None | None | 0:42 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 25′ 3PT Jump Shot | 12:12 AM |
5 rows × 33 columns
Let’s look at the shots from the shot logs
In [9]:
with pd.option_context('display.max_columns', None): display(shot_logs_player_period)
CLOSEST_DEFENDER | CLOSEST_DEFENDER_PLAYER_ID | CLOSE_DEF_DIST | DRIBBLES | FGM | FINAL_MARGIN | GAME_CLOCK | GAME_ID | LOCATION | MATCHUP | PERIOD | PTS | PTS_TYPE | SHOT_CLOCK | SHOT_DIST | SHOT_NUMBER | SHOT_RESULT | TOUCH_TIME | W | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1309 | Collison, Darren | 201954 | 2.7 | 0 | 0 | 18 | 10:22 | 0021400014 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 10.6 | 23.3 | 8 | missed | 0.7 | W |
1310 | Thompson, Jason | 201574 | 7.1 | 0 | 0 | 18 | 8:05 | 0021400014 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 16.6 | 27.4 | 9 | missed | 2.0 | W |
1311 | Collison, Darren | 201954 | 4.2 | 6 | 1 | 18 | 4:29 | 0021400014 | A | OCT 29, 2014 – GSW @ SAC | 3 | 3 | 3 | 19.0 | 25.4 | 10 | made | 5.2 | W |
1312 | Landry, Carl | 201171 | 0.8 | 6 | 0 | 18 | 2:36 | 0021400014 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 2 | 17.0 | 2.9 | 11 | missed | 6.9 | W |
1313 | Stauskas, Nik | 203917 | 2.5 | 4 | 0 | 18 | 0:45 | 0021400014 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 20.3 | 24.9 | 12 | missed | 3.4 | W |
5 rows × 19 columns
So he took 5 shots in the quarter. Note that since we are dealing with two separate data sets, sometimes the number of shots a player took in the quarter is different in each data set.Now that we have the data for the quarter from both data sets we can combine them. The logical step is to combine them where the time is the same in both data sets. Take a look at the PCTIMESTRING column from the play by play data and the GAME_CLOCK column from the shot logs column. These columns both represent the time remaining in the quarter. You will notice that they don’t match up and are a few seconds off. The first shot in the player tracking logs was taken with 10:22 remaining and in the the play by play it was taken with 10:18 remaining. You will notice that for all shots the time is off by 3-4 seconds. This makes things a little more challenging since we can’t just merge where play by play time equals shot log time. Since they are only off by a few seconds we can just compare the times and merge them when they are within a few seconds. Note that not all cases are as simple as this example. Sometimes the times are off by so much that they probably aren’t the same shot. These are just normal issues that arise when dealing with data from different sources.
To make things easier, we should convert the times from mm:ss format to just the number of seconds remaining in the period. So if the time is 1:30 we will convert it to 90. Another thing we should do is order both by time. Since there are cases where a player may shoot twice within a few seconds, if we only look at the time and keep shots within a few seconds we might link a shot in the shot logs to multiple shots in the play by play. If we order the shots by time and only compare the first shot in the shot logs with the first shot in the play by play, and so on, then we can prevent this from happening.
In [10]:
pbp_split = pbp_player_period['PCTIMESTRING'].str.split(":") pbp_player_period['seconds'] = pbp_split.map(lambda x: int(x[0])*60 + int(x[1])) shot_logs_split = shot_logs_player_period['GAME_CLOCK'].str.split(":") shot_logs_player_period['seconds'] = shot_logs_split.map(lambda x: int(x[0])*60 + int(x[1])) pbp_player_period = pbp_player_period.sort('seconds', ascending=False) shot_logs_player_period = shot_logs_player_period.sort('seconds', ascending=False) pbp_player_period = pbp_player_period.reset_index(drop=True) shot_logs_player_period = shot_logs_player_period.reset_index(drop=True)
Now we are set up to link the shots from the shot logs to a play by play event. To do this we will simply iterate though each row in the shot logs table and compare the time with the time on the same numbered row in the play by play table. If they are within 5 seconds we will assume they are the same shot. Note that I chose 5 seconds to be safe. You could extend this if you wanted to and probably be fine and link a few more shots to a play by play event.
In [11]:
shot_log_event_num = [] # keep shots where times are within 5 seconds between datasets and shot number for period is equal for i, row in shot_logs_player_period.iterrows(): if abs(row['seconds'] - pbp_player_period['seconds'].iloc[i]) <= 5: shot_log_event_num.append({"GAME_ID": game_id, "PLAYER_ID": player_id, "SHOT_NUMBER": row['SHOT_NUMBER'], "EVENTNUM": pbp_player_period['EVENTNUM'].iloc[i]}) shot_log_pbp_links = pd.DataFrame(shot_log_event_num) shot_log_pbp_links
Out[11]:
EVENTNUM | GAME_ID | PLAYER_ID | SHOT_NUMBER | |
---|---|---|---|---|
0 | 349 | 0021400014 | 201939 | 8 |
1 | 378 | 0021400014 | 201939 | 9 |
2 | 431 | 0021400014 | 201939 | 10 |
3 | 454 | 0021400014 | 201939 | 11 |
4 | 492 | 0021400014 | 201939 | 12 |
5 rows × 4 columns
This table has all we need to merge the shots logs with the play by play so let’s merge them.
In [39]:
shot_logs_with_eventnum = pd.merge(shot_log_pbp_links, shot_logs_player_period, on=["GAME_ID", "SHOT_NUMBER"]) shot_logs_merged_pbp = pd.merge(shot_logs_with_eventnum, pbp_player_period, on=["EVENTNUM", "GAME_ID"]) with pd.option_context('display.max_columns', None): display(shot_logs_merged_pbp)
EVENTNUM | GAME_ID | PLAYER_ID | SHOT_NUMBER | CLOSEST_DEFENDER | CLOSEST_DEFENDER_PLAYER_ID | CLOSE_DEF_DIST | DRIBBLES | FGM | FINAL_MARGIN | GAME_CLOCK | LOCATION | MATCHUP | PERIOD_x | PTS | PTS_TYPE | SHOT_CLOCK | SHOT_DIST | SHOT_RESULT | TOUCH_TIME | W | seconds_x | EVENTMSGACTIONTYPE | EVENTMSGTYPE | HOMEDESCRIPTION | NEUTRALDESCRIPTION | PCTIMESTRING | PERIOD_y | PERSON1TYPE | PERSON2TYPE | PERSON3TYPE | PLAYER1_ID | PLAYER1_NAME | PLAYER1_TEAM_ABBREVIATION | PLAYER1_TEAM_CITY | PLAYER1_TEAM_ID | PLAYER1_TEAM_NICKNAME | PLAYER2_ID | PLAYER2_NAME | PLAYER2_TEAM_ABBREVIATION | PLAYER2_TEAM_CITY | PLAYER2_TEAM_ID | PLAYER2_TEAM_NICKNAME | PLAYER3_ID | PLAYER3_NAME | PLAYER3_TEAM_ABBREVIATION | PLAYER3_TEAM_CITY | PLAYER3_TEAM_ID | PLAYER3_TEAM_NICKNAME | SCORE | SCOREMARGIN | VISITORDESCRIPTION | WCTIMESTRING | seconds_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 349 | 0021400014 | 201939 | 8 | Collison, Darren | 201954 | 2.7 | 0 | 0 | 18 | 10:22 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 10.6 | 23.3 | missed | 0.7 | W | 622 | 1 | 2 | None | None | 10:18 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 3PT Jump Shot | 11:42 PM | 618 |
1 | 378 | 0021400014 | 201939 | 9 | Thompson, Jason | 201574 | 7.1 | 0 | 0 | 18 | 8:05 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 16.6 | 27.4 | missed | 2.0 | W | 485 | 1 | 2 | None | None | 8:02 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 25′ 3PT Jump Shot | 11:47 PM | 482 |
2 | 431 | 0021400014 | 201939 | 10 | Collison, Darren | 201954 | 4.2 | 6 | 1 | 18 | 4:29 | A | OCT 29, 2014 – GSW @ SAC | 3 | 3 | 3 | 19.0 | 25.4 | made | 5.2 | W | 269 | 1 | 1 | None | None | 4:26 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | 60 – 54 | -6 | Curry 25′ 3PT Jump Shot (15 PTS) | 11:59 PM | 266 |
3 | 454 | 0021400014 | 201939 | 11 | Landry, Carl | 201171 | 0.8 | 6 | 0 | 18 | 2:36 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 2 | 17.0 | 2.9 | missed | 6.9 | W | 156 | 5 | 2 | None | None | 2:33 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 3′ Layup | 12:04 AM | 153 |
4 | 492 | 0021400014 | 201939 | 12 | Stauskas, Nik | 203917 | 2.5 | 4 | 0 | 18 | 0:45 | A | OCT 29, 2014 – GSW @ SAC | 3 | 0 | 3 | 20.3 | 24.9 | missed | 3.4 | W | 45 | 1 | 2 | None | None | 0:42 | 3 | 5 | 0 | 0 | 201939 | Stephen Curry | GSW | Golden State | 1610612744 | Warriors | 0 | None | None | None | NaN | None | 0 | None | None | None | NaN | None | None | None | MISS Curry 25′ 3PT Jump Shot | 12:12 AM | 42 |
5 rows × 54 columns
And there we have it, a table that combines the shot log data with play by play data. This is a simple method for merging them and it gets the play by play event number for over 98% of shots. The above method can be tweaked to improve this to 99.4% of shots. If you are curious about those tweaks, I have an NBA scraping library on github that contains those tweaks as well as how to do this same thing with the rebound logs.