Nylon Calculus 101: Python Data Wrangling – Merging PBP Data and Shot Logs

Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) shoots against Cleveland Cavaliers guard Matthew Dellavedova (8) during the first quarter of game six of the NBA Finals at Quicken Loans Arena. Mandatory Credit: David Richard-USA TODAY Sports
Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) shoots against Cleveland Cavaliers guard Matthew Dellavedova (8) during the first quarter of game six of the NBA Finals at Quicken Loans Arena. Mandatory Credit: David Richard-USA TODAY Sports /
facebooktwitterreddit
Nylon1o1logo
Nylon1o1logo /

In this edition of Nylon Calculus 101, the great and greatly unheralded Darryl Blackport walks us through the steps to creating one of the most important data sets used by authors at Nylon Calculus, which we’ve taken to calling the Detailed Shot Logs.


Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) shoots against Cleveland Cavaliers guard Matthew Dellavedova (8) during the first quarter of game six of the NBA Finals at Quicken Loans Arena. Mandatory Credit: David Richard-USA TODAY Sports
Jun 16, 2015; Cleveland, OH, USA; Golden State Warriors guard Stephen Curry (30) shoots against Cleveland Cavaliers guard Matthew Dellavedova (8) during the first quarter of game six of the NBA Finals at Quicken Loans Arena. Mandatory Credit: David Richard-USA TODAY Sports /

One of the great things about all the publically available data on stats.nba.com is it has created opportunities for people with some programming skills and some creativity to merge different data sets and create new data sets for deeper more complex analysis. A good, recent example of this is the positional estimates from Seth Partnow. A lot can be done using nothing more than publically available data, an idea for how to transform this data into something new and useful and some coding skills to put it together (and some frustration when your idea doesn’t work as originally planned on the first try). As anyone who has worked a lot with data will tell you, the hardest part is not the analysis itself but getting and organizing the data into a format that you can use to perform that analysis. This article will show you how to do just that with one of the most used data sets here at Nylon Calculus – the detailed shot logs created by merging the player tracking shot logs with play by play data. This allows you to include things like who assisted on the shot and who blocked the shot. Since this links the shot logs to the play by play, you can also figure out who was on the floor for each shot as well as the prior event. I will show you how to merge these two data sets for one player in one quarter, but you can expand on this and do it for the entire season. I will be doing this using Python 2.7. If you are more familiar with R this can still be of use to you since the steps and logic would be exactly the same.

We will start by importing the packages we need to extract the data from stats.nba.com and store it as a pandas data frame.

In [1]:

import pandas as pd
import json
import urllib2
from IPython.display import display

For this example, we will be merging Stephen Curry’s shots from the third quarter of one game this past season.

In [2]:

game_id = "0021400014"
player_id = "201939"
period = 3

First we will need to get the data. If you don’t know how to find the URLs below, here is a good resource.

In [3]:

pbp_url = "http://stats.nba.com/stats/playbyplayv2?GameId="+game_id+"&StartPeriod=0&EndPeriod=10&RangeType=2&StartRange=0&EndRange=55800"

player_tracking_shot_logs_url = "http://stats.nba.com/stats/playerdashptshotlog?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&Period=0&PlayerID="+player_id+"&Season=2014-15&SeasonSegment=&SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision="

I won’t go into details but this is a function to read the data we want from the URLs above and transform it into a pandas data frame.

In [4]:

def get_data_from_url(url):
    response = urllib2.urlopen(url)
    data = json.loads(response.read())
    headers = data['resultSets'][0]['headers']
    rows = data['resultSets'][0]['rowSet']
    data_dict = [dict(zip(headers, row)) for row in rows]
    return pd.DataFrame(data_dict)

Get the play by play data for the game and examine the head of the data frame.

In [5]:

pbp_data = get_data_from_url(pbp_url)

with pd.option_context('display.max_columns', None):
    display(pbp_data.head())
EVENTMSGACTIONTYPEEVENTMSGTYPEEVENTNUMGAME_IDHOMEDESCRIPTIONNEUTRALDESCRIPTIONPCTIMESTRINGPERIODPERSON1TYPEPERSON2TYPEPERSON3TYPEPLAYER1_IDPLAYER1_NAMEPLAYER1_TEAM_ABBREVIATIONPLAYER1_TEAM_CITYPLAYER1_TEAM_IDPLAYER1_TEAM_NICKNAMEPLAYER2_IDPLAYER2_NAMEPLAYER2_TEAM_ABBREVIATIONPLAYER2_TEAM_CITYPLAYER2_TEAM_IDPLAYER2_TEAM_NICKNAMEPLAYER3_IDPLAYER3_NAMEPLAYER3_TEAM_ABBREVIATIONPLAYER3_TEAM_CITYPLAYER3_TEAM_IDPLAYER3_TEAM_NICKNAMESCORESCOREMARGINVISITORDESCRIPTIONWCTIMESTRING
001200021400014NoneNone12:0010000NoneNoneNoneNaNNone0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneNone10:15 PM
101010021400014Jump Ball Thompson vs. Bogut: Tip toNone12:001452201574Jason ThompsonSACSacramento1610612758Kings101106Andrew BogutGSWGolden State1610612744Warriors1610612758NoneNoneNoneNaNNoneNoneNoneNone10:16 PM
25220021400014MISS Cousins 3′ LayupNone11:431400202326DeMarcus CousinsSACSacramento1610612758Kings0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneNone10:16 PM
30430021400014NoneNone11:411500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneCurry REBOUND (Off:0 Def:1)10:16 PM
41240021400014NoneNone11:351500202691Klay ThompsonGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Thompson 25′ 3PT Jump Shot10:16 PM

5 rows × 33 columns

Get Stephen Curry’s shot logs and examine the head of the data frame. Note that this has all his shots from the regular season.

In [6]:

shot_log_data = get_data_from_url(player_tracking_shot_logs_url)

with pd.option_context('display.max_columns', None):
    display(shot_log_data.head())
CLOSEST_DEFENDERCLOSEST_DEFENDER_PLAYER_IDCLOSE_DEF_DISTDRIBBLESFGMFINAL_MARGINGAME_CLOCKGAME_IDLOCATIONMATCHUPPERIODPTSPTS_TYPESHOT_CLOCKSHOT_DISTSHOT_NUMBERSHOT_RESULTTOUCH_TIMEW
0Lawson, Ty2019512.340710:170021401229HAPR 15, 2015 – GSW vs. DEN10220.44.31missed4.5W
1Gallinari, Danilo2015683.51177:320021401229HAPR 15, 2015 – GSW vs. DEN12220.113.72made3.9W
2Lawson, Ty2019513.36076:160021401229HAPR 15, 2015 – GSW vs. DEN10220.523.83missed4.4W
3Lawson, Ty2019512.40074:320021401229HAPR 15, 2015 – GSW vs. DEN10311.025.84missed0.6W
4Lawson, Ty2019512.91173:420021401229HAPR 15, 2015 – GSW vs. DEN13323.024.95made1.7W

5 rows × 19 columns

We need to filter out Curry’s shots for the period we are working with from both data sets. In the play by play data we can filter out shots by using the EVENTMSGTYPE column. Shots are EVENTMSGTYPE 1(makes) and 2(misses).

In [7]:

pbp_player_period = pbp_data[(pbp_data['PERIOD'] == period) & (pbp_data['PLAYER1_ID'] == int(player_id)) & ((pbp_data['EVENTMSGTYPE'] == 1) | (pbp_data['EVENTMSGTYPE'] == 2))]

shot_logs_player_period = shot_log_data[(shot_log_data['PERIOD'] == period) & (shot_log_data['GAME_ID'] == game_id)]

Let’s look at the shots from the play by play.

In [8]:

with pd.option_context('display.max_columns', None):
    display(pbp_player_period)
EVENTMSGACTIONTYPEEVENTMSGTYPEEVENTNUMGAME_IDHOMEDESCRIPTIONNEUTRALDESCRIPTIONPCTIMESTRINGPERIODPERSON1TYPEPERSON2TYPEPERSON3TYPEPLAYER1_IDPLAYER1_NAMEPLAYER1_TEAM_ABBREVIATIONPLAYER1_TEAM_CITYPLAYER1_TEAM_IDPLAYER1_TEAM_NICKNAMEPLAYER2_IDPLAYER2_NAMEPLAYER2_TEAM_ABBREVIATIONPLAYER2_TEAM_CITYPLAYER2_TEAM_IDPLAYER2_TEAM_NICKNAMEPLAYER3_IDPLAYER3_NAMEPLAYER3_TEAM_ABBREVIATIONPLAYER3_TEAM_CITYPLAYER3_TEAM_IDPLAYER3_TEAM_NICKNAMESCORESCOREMARGINVISITORDESCRIPTIONWCTIMESTRING
294123490021400014NoneNone10:183500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 3PT Jump Shot11:42 PM
319123780021400014NoneNone8:023500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 25′ 3PT Jump Shot11:47 PM
365114310021400014NoneNone4:263500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNone60 – 54-6Curry 25′ 3PT Jump Shot (15 PTS)11:59 PM
385524540021400014NoneNone2:333500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 3′ Layup12:04 AM
413124920021400014NoneNone0:423500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 25′ 3PT Jump Shot12:12 AM

5 rows × 33 columns

Let’s look at the shots from the shot logs

In [9]:

with pd.option_context('display.max_columns', None):
    display(shot_logs_player_period)
CLOSEST_DEFENDERCLOSEST_DEFENDER_PLAYER_IDCLOSE_DEF_DISTDRIBBLESFGMFINAL_MARGINGAME_CLOCKGAME_IDLOCATIONMATCHUPPERIODPTSPTS_TYPESHOT_CLOCKSHOT_DISTSHOT_NUMBERSHOT_RESULTTOUCH_TIMEW
1309Collison, Darren2019542.7001810:220021400014AOCT 29, 2014 – GSW @ SAC30310.623.38missed0.7W
1310Thompson, Jason2015747.100188:050021400014AOCT 29, 2014 – GSW @ SAC30316.627.49missed2.0W
1311Collison, Darren2019544.261184:290021400014AOCT 29, 2014 – GSW @ SAC33319.025.410made5.2W
1312Landry, Carl2011710.860182:360021400014AOCT 29, 2014 – GSW @ SAC30217.02.911missed6.9W
1313Stauskas, Nik2039172.540180:450021400014AOCT 29, 2014 – GSW @ SAC30320.324.912missed3.4W

5 rows × 19 columns

So he took 5 shots in the quarter. Note that since we are dealing with two separate data sets, sometimes the number of shots a player took in the quarter is different in each data set.Now that we have the data for the quarter from both data sets we can combine them. The logical step is to combine them where the time is the same in both data sets. Take a look at the PCTIMESTRING column from the play by play data and the GAME_CLOCK column from the shot logs column. These columns both represent the time remaining in the quarter. You will notice that they don’t match up and are a few seconds off. The first shot in the player tracking logs was taken with 10:22 remaining and in the the play by play it was taken with 10:18 remaining. You will notice that for all shots the time is off by 3-4 seconds. This makes things a little more challenging since we can’t just merge where play by play time equals shot log time. Since they are only off by a few seconds we can just compare the times and merge them when they are within a few seconds. Note that not all cases are as simple as this example. Sometimes the times are off by so much that they probably aren’t the same shot. These are just normal issues that arise when dealing with data from different sources.

To make things easier, we should convert the times from mm:ss format to just the number of seconds remaining in the period. So if the time is 1:30 we will convert it to 90. Another thing we should do is order both by time. Since there are cases where a player may shoot twice within a few seconds, if we only look at the time and keep shots within a few seconds we might link a shot in the shot logs to multiple shots in the play by play. If we order the shots by time and only compare the first shot in the shot logs with the first shot in the play by play, and so on, then we can prevent this from happening.

In [10]:

pbp_split = pbp_player_period['PCTIMESTRING'].str.split(":")
pbp_player_period['seconds'] = pbp_split.map(lambda x: int(x[0])*60 + int(x[1]))

shot_logs_split = shot_logs_player_period['GAME_CLOCK'].str.split(":")
shot_logs_player_period['seconds'] = shot_logs_split.map(lambda x: int(x[0])*60 + int(x[1]))

pbp_player_period = pbp_player_period.sort('seconds', ascending=False)
shot_logs_player_period = shot_logs_player_period.sort('seconds', ascending=False)

pbp_player_period = pbp_player_period.reset_index(drop=True)
shot_logs_player_period = shot_logs_player_period.reset_index(drop=True)

Now we are set up to link the shots from the shot logs to a play by play event. To do this we will simply iterate though each row in the shot logs table and compare the time with the time on the same numbered row in the play by play table. If they are within 5 seconds we will assume they are the same shot. Note that I chose 5 seconds to be safe. You could extend this if you wanted to and probably be fine and link a few more shots to a play by play event.

In [11]:

shot_log_event_num = []
# keep shots where times are within 5 seconds between datasets and shot number for period is equal

for i, row in shot_logs_player_period.iterrows():
    if abs(row['seconds'] - pbp_player_period['seconds'].iloc[i]) <= 5:
        shot_log_event_num.append({"GAME_ID": game_id, "PLAYER_ID": player_id, "SHOT_NUMBER": row['SHOT_NUMBER'], "EVENTNUM": pbp_player_period['EVENTNUM'].iloc[i]})

shot_log_pbp_links = pd.DataFrame(shot_log_event_num)
shot_log_pbp_links

Out[11]:

EVENTNUMGAME_IDPLAYER_IDSHOT_NUMBER
034900214000142019398
137800214000142019399
2431002140001420193910
3454002140001420193911
4492002140001420193912

5 rows × 4 columns

This table has all we need to merge the shots logs with the play by play so let’s merge them.

In [39]:

shot_logs_with_eventnum = pd.merge(shot_log_pbp_links, shot_logs_player_period, on=["GAME_ID", "SHOT_NUMBER"])

shot_logs_merged_pbp = pd.merge(shot_logs_with_eventnum, pbp_player_period, on=["EVENTNUM", "GAME_ID"])

with pd.option_context('display.max_columns', None):
    display(shot_logs_merged_pbp)
EVENTNUMGAME_IDPLAYER_IDSHOT_NUMBERCLOSEST_DEFENDERCLOSEST_DEFENDER_PLAYER_IDCLOSE_DEF_DISTDRIBBLESFGMFINAL_MARGINGAME_CLOCKLOCATIONMATCHUPPERIOD_xPTSPTS_TYPESHOT_CLOCKSHOT_DISTSHOT_RESULTTOUCH_TIMEWseconds_xEVENTMSGACTIONTYPEEVENTMSGTYPEHOMEDESCRIPTIONNEUTRALDESCRIPTIONPCTIMESTRINGPERIOD_yPERSON1TYPEPERSON2TYPEPERSON3TYPEPLAYER1_IDPLAYER1_NAMEPLAYER1_TEAM_ABBREVIATIONPLAYER1_TEAM_CITYPLAYER1_TEAM_IDPLAYER1_TEAM_NICKNAMEPLAYER2_IDPLAYER2_NAMEPLAYER2_TEAM_ABBREVIATIONPLAYER2_TEAM_CITYPLAYER2_TEAM_IDPLAYER2_TEAM_NICKNAMEPLAYER3_IDPLAYER3_NAMEPLAYER3_TEAM_ABBREVIATIONPLAYER3_TEAM_CITYPLAYER3_TEAM_IDPLAYER3_TEAM_NICKNAMESCORESCOREMARGINVISITORDESCRIPTIONWCTIMESTRINGseconds_y
034900214000142019398Collison, Darren2019542.7001810:22AOCT 29, 2014 – GSW @ SAC30310.623.3missed0.7W62212NoneNone10:183500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 3PT Jump Shot11:42 PM618
137800214000142019399Thompson, Jason2015747.100188:05AOCT 29, 2014 – GSW @ SAC30316.627.4missed2.0W48512NoneNone8:023500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 25′ 3PT Jump Shot11:47 PM482
2431002140001420193910Collison, Darren2019544.261184:29AOCT 29, 2014 – GSW @ SAC33319.025.4made5.2W26911NoneNone4:263500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNone60 – 54-6Curry 25′ 3PT Jump Shot (15 PTS)11:59 PM266
3454002140001420193911Landry, Carl2011710.860182:36AOCT 29, 2014 – GSW @ SAC30217.02.9missed6.9W15652NoneNone2:333500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 3′ Layup12:04 AM153
4492002140001420193912Stauskas, Nik2039172.540180:45AOCT 29, 2014 – GSW @ SAC30320.324.9missed3.4W4512NoneNone0:423500201939Stephen CurryGSWGolden State1610612744Warriors0NoneNoneNoneNaNNone0NoneNoneNoneNaNNoneNoneNoneMISS Curry 25′ 3PT Jump Shot12:12 AM42

5 rows × 54 columns

And there we have it, a table that combines the shot log data with play by play data. This is a simple method for merging them and it gets the play by play event number for over 98% of shots. The above method can be tweaked to improve this to 99.4% of shots. If you are curious about those tweaks, I have an NBA scraping library on github that contains those tweaks as well as how to do this same thing with the rebound logs.