Nylon Calculus 101: Visualizing the NBA Draft with Python

(Ed – This is the second in a series of tutorials for using the Python programming language to get, clean and analyze NBA statistical data. This post introduces using Python for data visualization. Presentation of analytical insights is key to adoption of findings by end users, so the ability to visually demonstrate and explain the meaning and import of an analytical discovery is vital. An earlier version of this post appeared at Savvas’ personal site.)

Using the data we scraped in the previous post, we will be creating a variety of data visualizations using the matplotlib and seaborn Python libraries.

Lets get started by importing all the necessary libraries.

In [1]:

import pandas as pd
import numpy as np

# we need this 'magic' function to plot within the ipython notebook
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

Read in the CSV file

pandas

allows us to easily read in CSV files using

read_csv

. The

index_col

parameter allows us to set the column that will act as the index for our rows. In our CSV file that is the first column.

In [2]:

draft_df = pd.read_csv("draft_data_1966_to_2014.csv", index_col=0)

Lets take a look at the data.

In [3]:

draft_df.head()

Out[3]:

	Draft_Yr	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	…	FT_Perc	MP_per_G	PTS_per_G	TRB_per_G	AST_per_G	WS	WS_per_48	BPM	VORP
0	1966	1	NYK	Cazzie Russell	University of Michigan	12	817	22213	12377	3068	…	0.827	27.2	15.1	3.8	2.2	51.7	0.112	-2.0	0.1
1	1966	2	DET	Dave Bing	Syracuse University	12	901	32769	18327	3420	…	0.775	36.4	20.3	3.8	6.0	68.8	0.101	0.6	8.5
2	1966	3	SFW	Clyde Lee	Vanderbilt University	10	742	19885	5733	7626	…	0.614	26.8	7.7	10.3	1.1	33.5	0.081	-2.4	-0.6
3	1966	4	STL	Lou Hudson	University of Minnesota	13	890	29794	17940	3926	…	0.797	33.5	20.2	4.4	2.7	81.0	0.131	0.1	5.9
4	1966	5	BAL	Jack Marin	Duke University	11	849	24590	12541	4405	…	0.843	29.0	14.8	5.2	2.1	59.3	0.116	-2.8	-1.4

5 rows × 22 columns

In [4]:

draft_df.tail()

Out[4]:

	Draft_Yr	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	…	3P_Perc	FT_Perc	MP_per_G	PTS_per_G	TRB_per_G	AST_per_G	WS	WS_per_48	BPM	VORP
6445	2014	56	DEN	Roy Devyn Marble	University of Iowa	1	16	208	37	31	…	0.182	0.313	13.0	2.3	1.9	1.1	-0.1	-0.031	-4.5	-0.1
6446	2014	57	IND	Louis Labeyrie	NaN	0	0	0	0	0	…	0.000	0.000	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0
6447	2014	58	SAS	Jordan McRae	University of Tennessee	0	0	0	0	0	…	0.000	0.000	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0
6448	2014	59	TOR	Xavier Thames	San Diego State University	0	0	0	0	0	…	0.000	0.000	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0
6449	2014	60	SAS	Cory Jefferson	Baylor University	1	50	531	183	145	…	0.133	0.574	10.6	3.7	2.9	0.3	0.8	0.071	-3.7	-0.2

5 rows × 22 columns

In [5]:

draft_df.info()

Int64Index: 5868 entries, 0 to 6449 Data columns (total 22 columns): Draft_Yr 5868 non-null int64 Pk 5868 non-null int64 Tm 5868 non-null object Player 5868 non-null object College 5572 non-null object Yrs 5868 non-null int64 G 5868 non-null int64 MP 5868 non-null int64 PTS 5868 non-null int64 TRB 5868 non-null int64 AST 5868 non-null int64 FG_Perc 5868 non-null float64 3P_Perc 5868 non-null float64 FT_Perc 5868 non-null float64 MP_per_G 5868 non-null float64 PTS_per_G 5868 non-null float64 TRB_per_G 5868 non-null float64 AST_per_G 5868 non-null float64 WS 5868 non-null float64 WS_per_48 5868 non-null float64 BPM 5868 non-null float64 VORP 5868 non-null float64 dtypes: float64(11), int64(8), object(3) memory usage: 1.0+ MB

We can see a few summary statistics for each column using

describe

In [6]:

draft_df.describe()

Out[6]:

	Draft_Yr	Pk	Yrs	G	MP	PTS	TRB	AST	FG_Perc	3P_Perc	FT_Perc	MP_per_G	PTS_per_G	TRB_per_G	AST_per_G	WS	WS_per_48	BPM	VORP
count	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000	5868.000000
mean	1983.153033	81.589980	2.609407	148.226483	3587.682004	1535.693933	649.459952	345.194274	0.188203	0.075082	0.298985	7.567144	3.026943	1.348807	0.678459	7.603050	0.023319	-1.114025	1.807941
std	12.760479	60.990659	4.247474	274.421227	7742.299563	3692.162233	1578.298098	950.648127	0.222286	0.138717	0.358721	10.476943	4.723596	2.107614	1.254241	20.664087	0.071398	3.040026	7.588628
min	1966.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-3.500000	-1.264000	-53.600000	-8.300000
25%	1973.000000	30.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1.400000	0.000000
50%	1981.000000	63.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	1989.000000	130.000000	4.000000	151.000000	2244.000000	767.500000	349.500000	147.250000	0.432000	0.095500	0.703000	14.525000	4.900000	2.200000	0.900000	2.100000	0.055000	0.000000	0.000000
max	2014.000000	239.000000	21.000000	1611.000000	57446.000000	38387.000000	17440.000000	15806.000000	1.000000	1.000000	1.000000	41.100000	30.100000	14.000000	11.200000	273.400000	1.442000	19.600000	104.400000

Let’s get the average Win Shares per 48 minutes for the 1966 draft. To do that we need to apply the following Boolean operation

draft_df['Draft_Yr'] == 1

to draft_df, which returns a DataFrame containing data for the 1966 draft. We then select its WS_per_48 column and call the mean() method.

In [7]:

draft_df[draft_df['Draft_Yr'] == 1966]['WS_per_48'].mean()

Out[7]:

0.012830357142857145

There are a lot of different ways to index and slice data using pandas I suggest reading the documentation for more information.

Now that we can get the WS_per_48 mean for one year lets get it for every year. We can do this using list comprehension.

In [8]:

# draft_df.Draft_Yr.unique() contains all the years in our DataFrame
WS48_yrly_avg = [draft_df[draft_df['Draft_Yr']==yr]['WS_per_48'].mean()
                 for yr in draft_df.Draft_Yr.unique() ]

In [9]:

type(WS48_yrly_avg)

Out[9]:

list

Another way we can get the above information is by using

groupby

. It allow us to group our data by draft year and then find the mean WS/48 for each year.

In [10]:

WS48_yrly_avg = draft_df.groupby('Draft_Yr').WS_per_48.mean()
WS48_yrly_avg  # this is a pandas Series not a list

Out[10]:

Draft_Yr 1966 0.012830 1967 0.007049 1968 0.005869 1969 0.015862 1970 0.009289 1971 0.009215 1972 0.011747 1973 0.012057 1974 0.018758 1975 0.017494 1976 0.015890 1977 0.015006 1978 0.019411 1979 0.011842 1980 0.010051 1981 0.017910 1982 0.014582 1983 0.011938 1984 0.013162 1985 0.025883 1986 0.018735 1987 0.014509 1988 0.021013 1989 0.052796 1990 0.056167 1991 0.070204 1992 0.055889 1993 0.042259 1994 0.054519 1995 0.039052 1996 0.058138 1997 0.054579 1998 0.075724 1999 0.054552 2000 0.043397 2001 0.045807 2002 0.049684 2003 0.041466 2004 0.043339 2005 0.053617 2006 0.053817 2007 0.051817 2008 0.066217 2009 0.066367 2010 0.044583 2011 0.060383 2012 0.039100 2013 0.035567 2014 0.009617 Name: WS_per_48, dtype: float64

In [11]:

type(WS48_yrly_avg)

Out[11]:

pandas.core.series.Series

Visualizing the Draft

We can now take WS48_yrly_avg and plot it using matplotlib and seaborn.

When creating plots, less is more. So no unnecessary 3D effects, labels, colors, or borders.

In [12]:

# Plot WS/48 by year

# use seaborn to set our graphing style
# the style 'white' creates a white background for
# our graph
sns.set_style("white")  

# Set the size to have a width of 12 inches
# and height of 9 inches
plt.figure(figsize=(12,9))

# get the x and y values
x_values = draft_df.Draft_Yr.unique()  
y_values = WS48_yrly_avg

# add a title
title = ('Average Career Win Shares Per 48 minutesnby Draft Year (1966-2014)')
plt.title(title, fontsize=20)

# Label the y-axis
# We don't need to label the year values
plt.ylabel('Win Shares Per 48 minutes', fontsize=18)

# Limit the range of the axis labels to only
# show where the data is. This helps to avoid
# unnecessary whitespace.
plt.xlim(1966, 2014.5)
plt.ylim(0, 0.08)

# Create a series of grey dashed lines across the each
# labled y-value of the graph
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)

# Change the size of tick labels for both axis
# to a more readable font size
plt.tick_params(axis='both', labelsize=14)
  
# get rid of borders for our graph using seaborn's
# despine function
sns.despine(left=True, bottom=True) 

# plot the line for our graph
plt.plot(x_values, y_values)

# Provide a reference to data source and credit yourself
# by adding text to the bottom of the graph.
# The first 2 arguments are the x and y axis coordinates of where
# we want to place the text.
# The coordinates given below should place the text below
# the xlabel and aligned left against the y-axis
plt.text(1966, -0.012,
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
          fontsize=12)

# Display our graph
plt.show()

The huge jump in WS/48 coincides with the change to a two round draft format in 1989. So it makes sense to see the jump in average WS/48 as better players made up a higher percentage of the total players drafted.

Lets take a look at how the number of players drafted has changed over time. First we need to calculate the number of players drafted by year then replace the y_value variable, from the above code, with those values.

In [13]:

players_drafted = draft_df.groupby('Draft_Yr').Pk.count()

In [14]:

sns.set_style("white")  
plt.figure(figsize=(12,9))

# set the x and y values
x_values = draft_df.Draft_Yr.unique()  
y_values = players_drafted

# set our title
title = ('The Number of players Drafted in each Draft (1966-2014)')
plt.title(title, fontsize=20)

# set y label
plt.ylabel('Number of Players Drafted', fontsize=18)

# set the value limits for x and y axis
plt.xlim(1966, 2014.5)
plt.ylim(0, 250)

# Create a series of grey dashed lines across the each
# labled y-value of the graph
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)


plt.tick_params(axis='both', labelsize=14) 
sns.despine(left=True, bottom=True) 
plt.plot(x_values, y_values)
plt.text(1966, -35,
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
          fontsize=12)
plt.show()

Lets plot both of those lines on one plot with 2 y-axis labels. To do this we can use the

matploltlib

Figure

object and an array of (or single)

Axes

objects that the

plt.subplots()

method returns us. We can access some of the plot elements, like our x-axis and y-axis, through the

Axes

objects. To create the two different plots we will create two different

Axes

objects and call the

plot

method from each of them.

In [15]:

sns.set_style("white")  

# change the mapping of default matplotlib color shorthands (like 'b' 
# or 'r') to default seaborn palette 
sns.set_color_codes()

# the x values for the plot
x_values = draft_df.Draft_Yr.unique() 

# plt.subplots returns a tuple containing a Figure and an Axes
# fig is a Figure object and ax1 is an Axes object
# we can also set the size of our plot
fig, ax1 = plt.subplots(figsize=(12,9))  

title = ('The Number of Players Drafted and Average Career WS/48'
         'nfor each Draft (1966-2014)')
plt.title(title, fontsize=20)

# Create a series of grey dashed lines across the each
# labeled y-value of the graph
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)

# Change the size of tick labels for x-axis and left y-axis
# to a more readable font size for
plt.tick_params(axis='both', labelsize=14)

# Plot our first line representing number of players drafted
# We assign it to plot1 to reference later for our legend
# We also give it a label, in order to use in the legend
plot1 = ax1.plot(x_values, players_drafted, 'b', label='No. of Players Drafted')
# Create the ylabel for our WS/48 line
ax1.set_ylabel('Number of Players Drafted', fontsize=18)
# Set limits for 1st y-axis
ax1.set_ylim(0, 240)
# Have tick color match corresponding line color
for tl in ax1.get_yticklabels():
    tl.set_color('b')

# Now we create the our 2nd Axes object that will share the same x-axis
# To do this we call the twinx() method from our first Axes object
ax2 = ax1.twinx()
# Create our second line for avg WS/48
plot2 = ax2.plot(x_values, WS48_yrly_avg, 'r', label='Avg WS/48')
# Create our label for the 2nd y-axis
ax2.set_ylabel('Win Shares Per 48 minutes', fontsize=18)
# Set the limit for 2nd y-axis
ax2.set_ylim(0, 0.08)
# Set tick size for second y-axis
ax2.tick_params(axis='y', labelsize=14)
# Have tick color match corresponding line color
for tl in ax2.get_yticklabels():
    tl.set_color('r')


# Limit our x-axis values to minimize white space
ax2.set_xlim(1966, 2014.15)

# create our legend 
# First add our lines together
lines = plot1 + plot2
# Then create legend by calling legend and getting the label for each line
ax1.legend(lines, [l.get_label() for l in lines])

# Create evenly aligned up tick marks for both y-axes.
# np.linspace allows us to get evenly spaced numbers over
# the specified interval given by first 2 arguments.
# Those 2 arguments are the the outer bounds of the y-axis values
# the third argument is the number of values we want to create.
# ax1 - create 9 tick values from 0 to 240
ax1.set_yticks(np.linspace(ax1.get_ybound()[0], ax1.get_ybound()[1], 9))
# ax2 - create 9 tick values from 0.00 to 0.08
ax2.set_yticks(np.linspace(ax2.get_ybound()[0], ax2.get_ybound()[1], 9))

# need to get rid of spines for each Axes object
for ax in [ax1, ax2]:
    ax.spines["top"].set_visible(False)  
    ax.spines["bottom"].set_visible(False)  
    ax.spines["right"].set_visible(False)  
    ax.spines["left"].set_visible(False)  
    
# Create text by calling the text() method from our figure object    
fig.text(0.1, 0.02,
         'Data source: http://www.basketball-reference.com/draft/'
        'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
          fontsize=10)

plt.show()

Lets create a

DataFrame

of just the top 60 picks, and then grab the data we need to plot.

Note that drafts from 1989 to 2004 have fewer than 60 draft picks.

In [16]:

# Get the top 60 picks for each year
top60 = draft_df[(draft_df['Pk'] < 61)]
# Get the average WS/48 for each year
top60_yrly_WS48 = top60.groupby('Draft_Yr').WS_per_48.mean()

In [17]:

# Create a line graph for avg WS/48 for top 60 picks
sns.set_style("white")  

plt.figure(figsize=(12,9))
x_values = draft_df.Draft_Yr.unique() 
y_values = top60_yrly_WS48
title = ('Average Career Win Shares Per 48 minutes for'
         'nTop 60 Picks by Draft Year (1966-2014)')
plt.title(title, fontsize=20)
plt.ylabel('Win Shares Per 48 minutes', fontsize=18)
plt.xlim(1966, 2014.5)
plt.ylim(0, 0.08)
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)
plt.tick_params(axis='both', labelsize=14)
sns.despine(left=True, bottom=True) 
plt.plot(x_values, y_values)
plt.text(1966, -0.012,
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
         'nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
          fontsize=12)
plt.show()

Bar Plots

Lets create some bar plots for the average WS/48 of each pick in the top 60.

In [18]:

# Get the mean WS/48 for each pick
top60_mean_WS48 = top60.groupby('Pk').WS_per_48.mean()

In [19]:

sns.set_style("white")  

# Set the x and y values
x_values = top60.Pk.unique()
y_values = top60_mean_WS48

# Get our Figure and Axes objects
fig, ax = plt.subplots(figsize=(15,10))  
# Create a title
title = ('Average Win Shares per 48 Minutes for each' 
         'nNBA Draft Pick in the Top 60 (1966-2014)')
# Set the title font size to 18
ax.set_title(title, fontsize=18)

# Set x and y axis labels
ax.set_xlabel('Draft Pick', fontsize=16)
ax.set_ylabel('Win Shares Per 48 minutes', fontsize=16)

# Set the tick label font size to 12
ax.tick_params(axis='both', labelsize=12)

# Set the x-axis limits
ax.set_xlim(0,61)

# Set the tick lables for picks 1 to 60
ax.set_xticks(np.arange(1,61)) 

# Create white y-axis grid lines to 
ax.yaxis.grid(color='white')

# overlay the white grid line on top of the bars
ax.set_axisbelow(False)

# Now add the bars to our plot
# this is equivalent to plt.bar(x_values, y_values)
ax.bar(x_values, y_values)

# Get rid chart borders
sns.despine(left=True, bottom=True)

plt.text(0, -.05, 
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
         'nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
          fontsize=12)
plt.show()

We can also create a horizontal bar plot, which will give us better spacing for our tick labels.

In [20]:

sns.set_style("white")  

# Note we flipped the value variable names
y_values = top60.Pk.unique()
x_values = top60_mean_WS48

fig, ax = plt.subplots(figsize=(10,15))  
title = ('Average Win Shares per 48 Minutes for each' 
         'nNBA Draft Pick in the Top 60 (1966-2014)')
# Add title with space below for x-axix ticks and label
ax.set_title(title, fontsize=18, y=1.06)
# We can rotate an axis label via the rotation argument.
# Here we set roation to 0 to so ylabel is read horizontally
ax.set_ylabel('Draft nPick', fontsize=16, rotation=0)
ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16)
ax.tick_params(axis='both', labelsize=12)

# Set a limit for our y-axis so that pick 1 is at the top
ax.set_ylim(61,0)
# Show all values for draft picks
ax.set_yticks(np.arange(1,61))
# pad the y-axis label so it doesn't overlap tick labels
ax.yaxis.labelpad = 25

# Move x-axis ticks and label to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')

# create white x-axis grid lines to 
ax.xaxis.grid(color='white')

# overlay the white grid line on top of the bars
ax.set_axisbelow(False)

# Now add the horizontal bars to our plot, 
# and align them centerd with ticks
ax.barh(y_values, x_values, align='center')

# get rid of borders for our graph
# Not using sns.despine as I get an issue with displaying
# the x-axis at the top of the graph
ax.spines["top"].set_visible(False)  
ax.spines["bottom"].set_visible(False)  
ax.spines["right"].set_visible(False)  
ax.spines["left"].set_visible(False)

plt.text(-0.02, 65, 
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
         'nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
          fontsize=12)

plt.show()

Dot Plots/Point Plots

Instead of using a bar plot we can use a dot plot or point plot to represent the above information.

seaborn

allows us to create point plots using

pointplot

In [21]:

sns.set_style("white")  

plt.figure(figsize=(10,15))

# Create Axes object with pointplot drawn onto it.
# This pointplot by default returns the mean along with a confidence
# intervals drawn, default returns 95% CI.
# The join parameter when set to True, draws a line connecting the points.
ax = sns.pointplot(x='WS_per_48', y='Pk', join=False, data=top60, 
                   orient='h')

title = ('Average Win Shares per 48 Minutes (with 95% CI)' 
         'nfor each NBA Draft Pick in the Top 60 (1966-2014)')
# Add title with space below for x-axix ticks and label
ax.set_title(title, fontsize=18, y=1.06)

ax.set_ylabel('Draft nPick', fontsize=16, rotation=0)
ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16)
ax.tick_params(axis='both', labelsize=12)
# pad the y-axis label to not overlap tick labels
ax.yaxis.labelpad = 25
# limit x-axis
ax.set_xlim(-0.1, 0.15)
# Move x-axis ticks and label to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')

# add horizontal lines for each draft pick
for y in range(len(y_values)):
    ax.hlines(y, -0.1, 0.15, color='grey', linestyle='-', lw=0.5)
    
# Add a vertical line at 0.00 WS/48
ax.vlines(0.00, -1, 60, color='grey', linestyle='-', lw=0.5)

# get rid of borders for our graph
# Not using sns.despine as I get an issue with displaying
# the x-axis at the top of the graph
ax.spines["top"].set_visible(False)  
ax.spines["bottom"].set_visible(False)  
ax.spines["right"].set_visible(False)  
ax.spines["left"].set_visible(False)

plt.text(-0.1, 63, 
         'Primary Data Source: http://www.basketball-reference.com/draft/'
         'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
         'nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
          fontsize=12)

plt.show()

Boxplots

To create a boxplot using

seaborn

all we have to do is use

boxpolot

So, lets create a boxplot of the WS/48 for the top 30 draft picks.

In [22]:

top30 = top60[top60['Pk'] < 31]

In [23]:

sns.set_style("whitegrid")

plt.figure(figsize=(15,12))

# create our Axes that contains our boxplot
bplot = sns.boxplot(x='Pk', y='WS_per_48', data=top30, whis=[5,95], color='salmon')

title = ('Distribution of Win Shares per 48 Minutes for each' 
         'nNBA Draft Pick in the Top 30 (1966-2014)')

# set title, axis labels, and change tick label size
bplot.set_title(title, fontsize=20)
bplot.set_xlabel('Draft Pick', fontsize=16)
bplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16)
bplot.tick_params(axis='both', labelsize=12)

# get rid of chart borders
sns.despine(left=True) 

plt.text(-1, -.5, 
         'Data source: http://www.basketball-reference.com/draft/'
        'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
         'nNote: Whiskers represent the 5th and 95th percentiles',
          fontsize=12)
plt.show()

Each box contains the inter-quartile range, which means the bottom of the box represents the 25th percentile and the top represents the 75th percentile. The median is represented by the line within the box.

By default in

seaborn

and

matplotlib

, each whisker extends out to 1.5 * the closest quartile. So the top whisker line extends out 1.5 * the value of the 75th percentile. The dots that fall outside the whiskers are considered outliers.

However in our boxplot above, we set the whiskers to represent the 5th and 95th percentiles by setting the

whis

parameter to [5, 95]. The dots now represent outliers that fall within the top or bottom 5% of the distribution.

Lets get the top 5% for the 3rd overall draft pick. To do this we get all 3rd overall picks, then get their WS_per_48 and call the quantile() method. Passing in 0.95 into quantile() returns the WS_per_48 value of the 95th percentile for all 3rd picks.

In [24]:

pick3_95 = top30[top30['Pk']==3]['WS_per_48'].quantile(0.95)
pick3_95

Out[24]:

0.17839999999999998

Now to get the players that have a WS_per_48 greater than about 0.1784

In [25]:

# Here we are accessing columns as attributes and then using
# Boolean operations
# Lets create a mask that contains our Boolean operations then index
# the data using the mask
mask = (top30.Pk == 3) & (top30.WS_per_48 > pick3_95)
pick3_top5_percent = top30[mask]

pick3_top5_percent[['Player', 'WS_per_48']]

Out[25]:

	Player	WS_per_48
3092	Kevin McHale	0.180
4052	Michael Jordan	0.250
6080	James Harden	0.207

We can rewrite the above code using the

query

method. To reference a local variable within our query string we must place ‘@’ in front of its name.

pandas

also allows us to use English instead of symbols in our query string.

In [26]:

pick3_top5_percent = top30.query('Pk == 3 and WS_per_48 > @pick3_95')

pick3_top5_percent[['Player', 'WS_per_48']]

Out[26]:

	Player	WS_per_48
3092	Kevin McHale	0.180
4052	Michael Jordan	0.250
6080	James Harden	0.207

Violin Plots

Creating violin plots using

seaborn

is pretty much the same as creating a boxplot, but we use the

violinplot

function instead of

boxplot

We’ll create a violin plot for the top 10 draft picks.

In [27]:

top10 = top60[top60['Pk'] < 11]

In [28]:

sns.set(style="whitegrid")

plt.figure(figsize=(15,10))

# create an Axes object that contains our violin plot
vplot = sns.violinplot(x='Pk', y='WS_per_48', data=top10)

title = ('Distribution of Win Shares per 48 Minutes for each' 
         'nNBA Draft Pick in the Top 10 (1966-2014)')

# set title, axis labels, and change tick label size
vplot.set_title(title, fontsize=20)
vplot.set_xlabel('Draft Pick', fontsize=16)
vplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16)
vplot.tick_params(axis='both', labelsize=12)

plt.text(-1, -.55, 
         'Data source: http://www.basketball-reference.com/draft/'
        'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',         
          fontsize=12)

sns.despine(left=True) 
           
plt.show()

Each violin in the above plot actually contains a box plot, with white dot in the middle representing the median.

A violin plot is a combination of a boxplot and kernel density estimate. Instead of just having whiskers or dots to provide us information about the distribution, the violin plot also provides an estimated shape of the distribution.

Software Versions

In [29]:

import sys
print('Python version:', sys.version_info)
import IPython
print('IPython version:', IPython.__version__)
import matplotlib as mpl
print('Matplotlib version:', mpl.__version__)
print('Seaborn version:', sns.__version__)
print('Pandas version:', pd.__version__)

Python version: sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0) IPython version: 3.2.0 Matplotlib version: 1.4.3 Seaborn version: 0.6.0 Pandas version: 0.16.2