Nylon Calculus 101: Visualizing the NBA Draft with Python
(Ed – This is the second in a series of tutorials for using the Python programming language to get, clean and analyze NBA statistical data. This post introduces using Python for data visualization. Presentation of analytical insights is key to adoption of findings by end users, so the ability to visually demonstrate and explain the meaning and import of an analytical discovery is vital. An earlier version of this post appeared at Savvas’ personal site.)
Using the data we scraped in the previous post, we will be creating a variety of data visualizations using the matplotlib
and seaborn
Python libraries.
Lets get started by importing all the necessary libraries.
In [1]:
import pandas as pd import numpy as np # we need this 'magic' function to plot within the ipython notebook %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns
Read in the CSV file
pandas
allows us to easily read in CSV files using
read_csv
. The
index_col
parameter allows us to set the column that will act as the index for our rows. In our CSV file that is the first column.
In [2]:
draft_df = pd.read_csv("draft_data_1966_to_2014.csv", index_col=0)
Lets take a look at the data.
In [3]:
draft_df.head()
Out[3]:
Draft_Yr | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P_Perc | FT_Perc | MP_per_G | PTS_per_G | TRB_per_G | AST_per_G | WS | WS_per_48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1966 | 1 | NYK | Cazzie Russell | University of Michigan | 12 | 817 | 22213 | 12377 | 3068 | … | 0 | 0.827 | 27.2 | 15.1 | 3.8 | 2.2 | 51.7 | 0.112 | -2.0 | 0.1 |
1 | 1966 | 2 | DET | Dave Bing | Syracuse University | 12 | 901 | 32769 | 18327 | 3420 | … | 0 | 0.775 | 36.4 | 20.3 | 3.8 | 6.0 | 68.8 | 0.101 | 0.6 | 8.5 |
2 | 1966 | 3 | SFW | Clyde Lee | Vanderbilt University | 10 | 742 | 19885 | 5733 | 7626 | … | 0 | 0.614 | 26.8 | 7.7 | 10.3 | 1.1 | 33.5 | 0.081 | -2.4 | -0.6 |
3 | 1966 | 4 | STL | Lou Hudson | University of Minnesota | 13 | 890 | 29794 | 17940 | 3926 | … | 0 | 0.797 | 33.5 | 20.2 | 4.4 | 2.7 | 81.0 | 0.131 | 0.1 | 5.9 |
4 | 1966 | 5 | BAL | Jack Marin | Duke University | 11 | 849 | 24590 | 12541 | 4405 | … | 0 | 0.843 | 29.0 | 14.8 | 5.2 | 2.1 | 59.3 | 0.116 | -2.8 | -1.4 |
5 rows × 22 columns
In [4]:
draft_df.tail()
Out[4]:
Draft_Yr | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P_Perc | FT_Perc | MP_per_G | PTS_per_G | TRB_per_G | AST_per_G | WS | WS_per_48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6445 | 2014 | 56 | DEN | Roy Devyn Marble | University of Iowa | 1 | 16 | 208 | 37 | 31 | … | 0.182 | 0.313 | 13.0 | 2.3 | 1.9 | 1.1 | -0.1 | -0.031 | -4.5 | -0.1 |
6446 | 2014 | 57 | IND | Louis Labeyrie | NaN | 0 | 0 | 0 | 0 | 0 | … | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 |
6447 | 2014 | 58 | SAS | Jordan McRae | University of Tennessee | 0 | 0 | 0 | 0 | 0 | … | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 |
6448 | 2014 | 59 | TOR | Xavier Thames | San Diego State University | 0 | 0 | 0 | 0 | 0 | … | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 |
6449 | 2014 | 60 | SAS | Cory Jefferson | Baylor University | 1 | 50 | 531 | 183 | 145 | … | 0.133 | 0.574 | 10.6 | 3.7 | 2.9 | 0.3 | 0.8 | 0.071 | -3.7 | -0.2 |
5 rows × 22 columns
In [5]:
draft_df.info()
Int64Index: 5868 entries, 0 to 6449 Data columns (total 22 columns): Draft_Yr 5868 non-null int64 Pk 5868 non-null int64 Tm 5868 non-null object Player 5868 non-null object College 5572 non-null object Yrs 5868 non-null int64 G 5868 non-null int64 MP 5868 non-null int64 PTS 5868 non-null int64 TRB 5868 non-null int64 AST 5868 non-null int64 FG_Perc 5868 non-null float64 3P_Perc 5868 non-null float64 FT_Perc 5868 non-null float64 MP_per_G 5868 non-null float64 PTS_per_G 5868 non-null float64 TRB_per_G 5868 non-null float64 AST_per_G 5868 non-null float64 WS 5868 non-null float64 WS_per_48 5868 non-null float64 BPM 5868 non-null float64 VORP 5868 non-null float64 dtypes: float64(11), int64(8), object(3) memory usage: 1.0+ MB
We can see a few summary statistics for each column using
describe
.
In [6]:
draft_df.describe()
Out[6]:
Draft_Yr | Pk | Yrs | G | MP | PTS | TRB | AST | FG_Perc | 3P_Perc | FT_Perc | MP_per_G | PTS_per_G | TRB_per_G | AST_per_G | WS | WS_per_48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 | 5868.000000 |
mean | 1983.153033 | 81.589980 | 2.609407 | 148.226483 | 3587.682004 | 1535.693933 | 649.459952 | 345.194274 | 0.188203 | 0.075082 | 0.298985 | 7.567144 | 3.026943 | 1.348807 | 0.678459 | 7.603050 | 0.023319 | -1.114025 | 1.807941 |
std | 12.760479 | 60.990659 | 4.247474 | 274.421227 | 7742.299563 | 3692.162233 | 1578.298098 | 950.648127 | 0.222286 | 0.138717 | 0.358721 | 10.476943 | 4.723596 | 2.107614 | 1.254241 | 20.664087 | 0.071398 | 3.040026 | 7.588628 |
min | 1966.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -3.500000 | -1.264000 | -53.600000 | -8.300000 |
25% | 1973.000000 | 30.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.400000 | 0.000000 |
50% | 1981.000000 | 63.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 1989.000000 | 130.000000 | 4.000000 | 151.000000 | 2244.000000 | 767.500000 | 349.500000 | 147.250000 | 0.432000 | 0.095500 | 0.703000 | 14.525000 | 4.900000 | 2.200000 | 0.900000 | 2.100000 | 0.055000 | 0.000000 | 0.000000 |
max | 2014.000000 | 239.000000 | 21.000000 | 1611.000000 | 57446.000000 | 38387.000000 | 17440.000000 | 15806.000000 | 1.000000 | 1.000000 | 1.000000 | 41.100000 | 30.100000 | 14.000000 | 11.200000 | 273.400000 | 1.442000 | 19.600000 | 104.400000 |
Let’s get the average Win Shares per 48 minutes for the 1966 draft. To do that we need to apply the following Boolean operation
draft_df['Draft_Yr'] == 1
to draft_df, which returns a DataFrame
containing data for the 1966 draft. We then select its WS_per_48 column and call the mean()
method.
In [7]:
draft_df[draft_df['Draft_Yr'] == 1966]['WS_per_48'].mean()
Out[7]:
0.012830357142857145
There are a lot of different ways to index and slice data using pandas
I suggest reading the documentation for more information.
Now that we can get the WS_per_48 mean for one year lets get it for every year. We can do this using list comprehension.
In [8]:
# draft_df.Draft_Yr.unique() contains all the years in our DataFrame WS48_yrly_avg = [draft_df[draft_df['Draft_Yr']==yr]['WS_per_48'].mean() for yr in draft_df.Draft_Yr.unique() ]
In [9]:
type(WS48_yrly_avg)
Out[9]:
list
Another way we can get the above information is by using
groupby
. It allow us to group our data by draft year and then find the mean WS/48 for each year.
In [10]:
WS48_yrly_avg = draft_df.groupby('Draft_Yr').WS_per_48.mean() WS48_yrly_avg # this is a pandas Series not a list
Out[10]:
Draft_Yr 1966 0.012830 1967 0.007049 1968 0.005869 1969 0.015862 1970 0.009289 1971 0.009215 1972 0.011747 1973 0.012057 1974 0.018758 1975 0.017494 1976 0.015890 1977 0.015006 1978 0.019411 1979 0.011842 1980 0.010051 1981 0.017910 1982 0.014582 1983 0.011938 1984 0.013162 1985 0.025883 1986 0.018735 1987 0.014509 1988 0.021013 1989 0.052796 1990 0.056167 1991 0.070204 1992 0.055889 1993 0.042259 1994 0.054519 1995 0.039052 1996 0.058138 1997 0.054579 1998 0.075724 1999 0.054552 2000 0.043397 2001 0.045807 2002 0.049684 2003 0.041466 2004 0.043339 2005 0.053617 2006 0.053817 2007 0.051817 2008 0.066217 2009 0.066367 2010 0.044583 2011 0.060383 2012 0.039100 2013 0.035567 2014 0.009617 Name: WS_per_48, dtype: float64
In [11]:
type(WS48_yrly_avg)
Out[11]:
pandas.core.series.Series
Visualizing the Draft
We can now take WS48_yrly_avg and plot it using matplotlib
and seaborn
.
When creating plots, less is more. So no unnecessary 3D effects, labels, colors, or borders.
In [12]:
# Plot WS/48 by year # use seaborn to set our graphing style # the style 'white' creates a white background for # our graph sns.set_style("white") # Set the size to have a width of 12 inches # and height of 9 inches plt.figure(figsize=(12,9)) # get the x and y values x_values = draft_df.Draft_Yr.unique() y_values = WS48_yrly_avg # add a title title = ('Average Career Win Shares Per 48 minutesnby Draft Year (1966-2014)') plt.title(title, fontsize=20) # Label the y-axis # We don't need to label the year values plt.ylabel('Win Shares Per 48 minutes', fontsize=18) # Limit the range of the axis labels to only # show where the data is. This helps to avoid # unnecessary whitespace. plt.xlim(1966, 2014.5) plt.ylim(0, 0.08) # Create a series of grey dashed lines across the each # labled y-value of the graph plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5) # Change the size of tick labels for both axis # to a more readable font size plt.tick_params(axis='both', labelsize=14) # get rid of borders for our graph using seaborn's # despine function sns.despine(left=True, bottom=True) # plot the line for our graph plt.plot(x_values, y_values) # Provide a reference to data source and credit yourself # by adding text to the bottom of the graph. # The first 2 arguments are the x and y axis coordinates of where # we want to place the text. # The coordinates given below should place the text below # the xlabel and aligned left against the y-axis plt.text(1966, -0.012, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)', fontsize=12) # Display our graph plt.show()
The huge jump in WS/48 coincides with the change to a two round draft format in 1989. So it makes sense to see the jump in average WS/48 as better players made up a higher percentage of the total players drafted.
Lets take a look at how the number of players drafted has changed over time. First we need to calculate the number of players drafted by year then replace the y_value variable, from the above code, with those values.
In [13]:
players_drafted = draft_df.groupby('Draft_Yr').Pk.count()
In [14]:
sns.set_style("white") plt.figure(figsize=(12,9)) # set the x and y values x_values = draft_df.Draft_Yr.unique() y_values = players_drafted # set our title title = ('The Number of players Drafted in each Draft (1966-2014)') plt.title(title, fontsize=20) # set y label plt.ylabel('Number of Players Drafted', fontsize=18) # set the value limits for x and y axis plt.xlim(1966, 2014.5) plt.ylim(0, 250) # Create a series of grey dashed lines across the each # labled y-value of the graph plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5) plt.tick_params(axis='both', labelsize=14) sns.despine(left=True, bottom=True) plt.plot(x_values, y_values) plt.text(1966, -35, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)', fontsize=12) plt.show()
Lets plot both of those lines on one plot with 2 y-axis labels. To do this we can use the
matploltlib
Figure
object and an array of (or single)
Axes
objects that the
plt.subplots()
method returns us. We can access some of the plot elements, like our x-axis and y-axis, through the
Axes
objects. To create the two different plots we will create two different
Axes
objects and call the
plot
method from each of them.
In [15]:
sns.set_style("white") # change the mapping of default matplotlib color shorthands (like 'b' # or 'r') to default seaborn palette sns.set_color_codes() # the x values for the plot x_values = draft_df.Draft_Yr.unique() # plt.subplots returns a tuple containing a Figure and an Axes # fig is a Figure object and ax1 is an Axes object # we can also set the size of our plot fig, ax1 = plt.subplots(figsize=(12,9)) title = ('The Number of Players Drafted and Average Career WS/48' 'nfor each Draft (1966-2014)') plt.title(title, fontsize=20) # Create a series of grey dashed lines across the each # labeled y-value of the graph plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5) # Change the size of tick labels for x-axis and left y-axis # to a more readable font size for plt.tick_params(axis='both', labelsize=14) # Plot our first line representing number of players drafted # We assign it to plot1 to reference later for our legend # We also give it a label, in order to use in the legend plot1 = ax1.plot(x_values, players_drafted, 'b', label='No. of Players Drafted') # Create the ylabel for our WS/48 line ax1.set_ylabel('Number of Players Drafted', fontsize=18) # Set limits for 1st y-axis ax1.set_ylim(0, 240) # Have tick color match corresponding line color for tl in ax1.get_yticklabels(): tl.set_color('b') # Now we create the our 2nd Axes object that will share the same x-axis # To do this we call the twinx() method from our first Axes object ax2 = ax1.twinx() # Create our second line for avg WS/48 plot2 = ax2.plot(x_values, WS48_yrly_avg, 'r', label='Avg WS/48') # Create our label for the 2nd y-axis ax2.set_ylabel('Win Shares Per 48 minutes', fontsize=18) # Set the limit for 2nd y-axis ax2.set_ylim(0, 0.08) # Set tick size for second y-axis ax2.tick_params(axis='y', labelsize=14) # Have tick color match corresponding line color for tl in ax2.get_yticklabels(): tl.set_color('r') # Limit our x-axis values to minimize white space ax2.set_xlim(1966, 2014.15) # create our legend # First add our lines together lines = plot1 + plot2 # Then create legend by calling legend and getting the label for each line ax1.legend(lines, [l.get_label() for l in lines]) # Create evenly aligned up tick marks for both y-axes. # np.linspace allows us to get evenly spaced numbers over # the specified interval given by first 2 arguments. # Those 2 arguments are the the outer bounds of the y-axis values # the third argument is the number of values we want to create. # ax1 - create 9 tick values from 0 to 240 ax1.set_yticks(np.linspace(ax1.get_ybound()[0], ax1.get_ybound()[1], 9)) # ax2 - create 9 tick values from 0.00 to 0.08 ax2.set_yticks(np.linspace(ax2.get_ybound()[0], ax2.get_ybound()[1], 9)) # need to get rid of spines for each Axes object for ax in [ax1, ax2]: ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) # Create text by calling the text() method from our figure object fig.text(0.1, 0.02, 'Data source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)', fontsize=10) plt.show()
Lets create a
DataFrame
of just the top 60 picks, and then grab the data we need to plot.
Note that drafts from 1989 to 2004 have fewer than 60 draft picks.
In [16]:
# Get the top 60 picks for each year top60 = draft_df[(draft_df['Pk'] < 61)] # Get the average WS/48 for each year top60_yrly_WS48 = top60.groupby('Draft_Yr').WS_per_48.mean()
In [17]:
# Create a line graph for avg WS/48 for top 60 picks sns.set_style("white") plt.figure(figsize=(12,9)) x_values = draft_df.Draft_Yr.unique() y_values = top60_yrly_WS48 title = ('Average Career Win Shares Per 48 minutes for' 'nTop 60 Picks by Draft Year (1966-2014)') plt.title(title, fontsize=20) plt.ylabel('Win Shares Per 48 minutes', fontsize=18) plt.xlim(1966, 2014.5) plt.ylim(0, 0.08) plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5) plt.tick_params(axis='both', labelsize=14) sns.despine(left=True, bottom=True) plt.plot(x_values, y_values) plt.text(1966, -0.012, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)' 'nNote: Drafts from 1989 to 2004 have less than 60 draft picks', fontsize=12) plt.show()
Bar Plots
Lets create some bar plots for the average WS/48 of each pick in the top 60.
In [18]:
# Get the mean WS/48 for each pick top60_mean_WS48 = top60.groupby('Pk').WS_per_48.mean()
In [19]:
sns.set_style("white") # Set the x and y values x_values = top60.Pk.unique() y_values = top60_mean_WS48 # Get our Figure and Axes objects fig, ax = plt.subplots(figsize=(15,10)) # Create a title title = ('Average Win Shares per 48 Minutes for each' 'nNBA Draft Pick in the Top 60 (1966-2014)') # Set the title font size to 18 ax.set_title(title, fontsize=18) # Set x and y axis labels ax.set_xlabel('Draft Pick', fontsize=16) ax.set_ylabel('Win Shares Per 48 minutes', fontsize=16) # Set the tick label font size to 12 ax.tick_params(axis='both', labelsize=12) # Set the x-axis limits ax.set_xlim(0,61) # Set the tick lables for picks 1 to 60 ax.set_xticks(np.arange(1,61)) # Create white y-axis grid lines to ax.yaxis.grid(color='white') # overlay the white grid line on top of the bars ax.set_axisbelow(False) # Now add the bars to our plot # this is equivalent to plt.bar(x_values, y_values) ax.bar(x_values, y_values) # Get rid chart borders sns.despine(left=True, bottom=True) plt.text(0, -.05, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)' 'nNote: Drafts from 1989 to 2004 have less than 60 draft picks', fontsize=12) plt.show()
We can also create a horizontal bar plot, which will give us better spacing for our tick labels.
In [20]:
sns.set_style("white") # Note we flipped the value variable names y_values = top60.Pk.unique() x_values = top60_mean_WS48 fig, ax = plt.subplots(figsize=(10,15)) title = ('Average Win Shares per 48 Minutes for each' 'nNBA Draft Pick in the Top 60 (1966-2014)') # Add title with space below for x-axix ticks and label ax.set_title(title, fontsize=18, y=1.06) # We can rotate an axis label via the rotation argument. # Here we set roation to 0 to so ylabel is read horizontally ax.set_ylabel('Draft nPick', fontsize=16, rotation=0) ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16) ax.tick_params(axis='both', labelsize=12) # Set a limit for our y-axis so that pick 1 is at the top ax.set_ylim(61,0) # Show all values for draft picks ax.set_yticks(np.arange(1,61)) # pad the y-axis label so it doesn't overlap tick labels ax.yaxis.labelpad = 25 # Move x-axis ticks and label to the top ax.xaxis.tick_top() ax.xaxis.set_label_position('top') # create white x-axis grid lines to ax.xaxis.grid(color='white') # overlay the white grid line on top of the bars ax.set_axisbelow(False) # Now add the horizontal bars to our plot, # and align them centerd with ticks ax.barh(y_values, x_values, align='center') # get rid of borders for our graph # Not using sns.despine as I get an issue with displaying # the x-axis at the top of the graph ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) plt.text(-0.02, 65, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)' 'nNote: Drafts from 1989 to 2004 have less than 60 draft picks', fontsize=12) plt.show()
Dot Plots/Point Plots
Instead of using a bar plot we can use a dot plot or point plot to represent the above information.
seaborn
allows us to create point plots using
pointplot
.
In [21]:
sns.set_style("white") plt.figure(figsize=(10,15)) # Create Axes object with pointplot drawn onto it. # This pointplot by default returns the mean along with a confidence # intervals drawn, default returns 95% CI. # The join parameter when set to True, draws a line connecting the points. ax = sns.pointplot(x='WS_per_48', y='Pk', join=False, data=top60, orient='h') title = ('Average Win Shares per 48 Minutes (with 95% CI)' 'nfor each NBA Draft Pick in the Top 60 (1966-2014)') # Add title with space below for x-axix ticks and label ax.set_title(title, fontsize=18, y=1.06) ax.set_ylabel('Draft nPick', fontsize=16, rotation=0) ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16) ax.tick_params(axis='both', labelsize=12) # pad the y-axis label to not overlap tick labels ax.yaxis.labelpad = 25 # limit x-axis ax.set_xlim(-0.1, 0.15) # Move x-axis ticks and label to the top ax.xaxis.tick_top() ax.xaxis.set_label_position('top') # add horizontal lines for each draft pick for y in range(len(y_values)): ax.hlines(y, -0.1, 0.15, color='grey', linestyle='-', lw=0.5) # Add a vertical line at 0.00 WS/48 ax.vlines(0.00, -1, 60, color='grey', linestyle='-', lw=0.5) # get rid of borders for our graph # Not using sns.despine as I get an issue with displaying # the x-axis at the top of the graph ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) plt.text(-0.1, 63, 'Primary Data Source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)' 'nNote: Drafts from 1989 to 2004 have less than 60 draft picks', fontsize=12) plt.show()
Boxplots
To create a boxplot using
seaborn
all we have to do is use
boxpolot
.
So, lets create a boxplot of the WS/48 for the top 30 draft picks.
In [22]:
top30 = top60[top60['Pk'] < 31]
In [23]:
sns.set_style("whitegrid") plt.figure(figsize=(15,12)) # create our Axes that contains our boxplot bplot = sns.boxplot(x='Pk', y='WS_per_48', data=top30, whis=[5,95], color='salmon') title = ('Distribution of Win Shares per 48 Minutes for each' 'nNBA Draft Pick in the Top 30 (1966-2014)') # set title, axis labels, and change tick label size bplot.set_title(title, fontsize=20) bplot.set_xlabel('Draft Pick', fontsize=16) bplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16) bplot.tick_params(axis='both', labelsize=12) # get rid of chart borders sns.despine(left=True) plt.text(-1, -.5, 'Data source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)' 'nNote: Whiskers represent the 5th and 95th percentiles', fontsize=12) plt.show()
Each box contains the inter-quartile range, which means the bottom of the box represents the 25th percentile and the top represents the 75th percentile. The median is represented by the line within the box.
By default in
seaborn
and
matplotlib
, each whisker extends out to 1.5 * the closest quartile. So the top whisker line extends out 1.5 * the value of the 75th percentile. The dots that fall outside the whiskers are considered outliers.
However in our boxplot above, we set the whiskers to represent the 5th and 95th percentiles by setting the
whis
parameter to [5, 95]. The dots now represent outliers that fall within the top or bottom 5% of the distribution.
Lets get the top 5% for the 3rd overall draft pick. To do this we get all 3rd overall picks, then get their WS_per_48 and call the quantile()
method. Passing in 0.95 into quantile()
returns the WS_per_48 value of the 95th percentile for all 3rd picks.
In [24]:
pick3_95 = top30[top30['Pk']==3]['WS_per_48'].quantile(0.95) pick3_95
Out[24]:
0.17839999999999998
Now to get the players that have a WS_per_48 greater than about 0.1784
In [25]:
# Here we are accessing columns as attributes and then using # Boolean operations # Lets create a mask that contains our Boolean operations then index # the data using the mask mask = (top30.Pk == 3) & (top30.WS_per_48 > pick3_95) pick3_top5_percent = top30[mask] pick3_top5_percent[['Player', 'WS_per_48']]
Out[25]:
Player | WS_per_48 | |
---|---|---|
3092 | Kevin McHale | 0.180 |
4052 | Michael Jordan | 0.250 |
6080 | James Harden | 0.207 |
We can rewrite the above code using the
query
method. To reference a local variable within our query string we must place ‘@’ in front of its name.
pandas
also allows us to use English instead of symbols in our query string.
In [26]:
pick3_top5_percent = top30.query('Pk == 3 and WS_per_48 > @pick3_95') pick3_top5_percent[['Player', 'WS_per_48']]
Out[26]:
Player | WS_per_48 | |
---|---|---|
3092 | Kevin McHale | 0.180 |
4052 | Michael Jordan | 0.250 |
6080 | James Harden | 0.207 |
Violin Plots
Creating violin plots using
seaborn
is pretty much the same as creating a boxplot, but we use the
violinplot
function instead of
boxplot
.
We’ll create a violin plot for the top 10 draft picks.
In [27]:
top10 = top60[top60['Pk'] < 11]
In [28]:
sns.set(style="whitegrid") plt.figure(figsize=(15,10)) # create an Axes object that contains our violin plot vplot = sns.violinplot(x='Pk', y='WS_per_48', data=top10) title = ('Distribution of Win Shares per 48 Minutes for each' 'nNBA Draft Pick in the Top 10 (1966-2014)') # set title, axis labels, and change tick label size vplot.set_title(title, fontsize=20) vplot.set_xlabel('Draft Pick', fontsize=16) vplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16) vplot.tick_params(axis='both', labelsize=12) plt.text(-1, -.55, 'Data source: http://www.basketball-reference.com/draft/' 'nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)', fontsize=12) sns.despine(left=True) plt.show()
Each violin in the above plot actually contains a box plot, with white dot in the middle representing the median.
A violin plot is a combination of a boxplot and kernel density estimate. Instead of just having whiskers or dots to provide us information about the distribution, the violin plot also provides an estimated shape of the distribution.
Software Versions
In [29]:
import sys print('Python version:', sys.version_info) import IPython print('IPython version:', IPython.__version__) import matplotlib as mpl print('Matplotlib version:', mpl.__version__) print('Seaborn version:', sns.__version__) print('Pandas version:', pd.__version__)
Python version: sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0) IPython version: 3.2.0 Matplotlib version: 1.4.3 Seaborn version: 0.6.0 Pandas version: 0.16.2