Nylon Calculus 101: Data Scraping With Python
(Ed – One of the most common questions we get at Nylon Calculus is “how/where did you get your data?” This isn’t a trivial question, as the majority of the time that goes into good analytics work is tracking, collating, cleaning and massaging the raw data into some sort of usable form. Still, the answer in most cases is we[1. And by “we” I quite often mean Darryl, which is why he’s da real MVP.] went out and found it ourselves. It doesn’t take that many instances of manually copying, pasting and cleaning large tables from various websites to realize there has to be a better way. Those better ways include programming languages like R and Python. To the end of providing an introduction in the use of the Python language and it’s application to basketball analytics, we’ve asked Savvas Tjortjoglou to join us to re-post expanded versions of the Python tutorials originally posted on his site as well as continue to produce some new ones to be a resource to the aspiring DIYers out there in the basketball analytics community.)
What is Python?
Python is an open source programming language with an elegant and straightforward syntax that makes it easy to learn and fun to write code in. Python is very flexible and is used in a variety of fields from web development to machine learning. Since its release in 1991, a great community has developed and grown along with the Python language. Thanks to this community, there are tons of fantastic resources to help you learn the language. One of my favorite introductory books to Python is Automate the Boring Stuff with Python, which has a great chapter on web scraping. If you’ve never had any experience with Python, I suggest taking a look at any of the linked resources, as you will need some basic knowledge of Python to fully understand this tutorial.
What is Web Scraping?
Web scraping is the gathering or collecting of data from websites. When web scraping you typically connect to the desired websites, request the data (usually the HTML), and then extract the information you want.
In this tutorial we’ll be scraping and cleaning data from the 1966 draft (the first year without territorial picks) to the 2014 draft. To scrape the data we want, we will use the
Beautiful Soup
library.
Beautiful Soup
allows us to search through the HTML of a webpage and extract the information we need. We’ll then take the scraped data and store it in a 2-dimensional data structure called a
DataFrame
using the
pandas
library.
In [1]:
from urllib.request import urlopen from bs4 import BeautifulSoup import pandas as pd
Scraping the Data
Lets get a feel for using
Beautiful Soup
by just scraping the data for the 2014 NBA draft.
First we need to get the HTML document we will be scraping. To do that we will use
urlopen
that we imported from the
urllib.request
library.
In [2]:
# url that we are scraping url = "http://www.basketball-reference.com/draft/NBA_2014.html" # this is the html from the given url html = urlopen(url)
Now we create a
BeautifulSoup
object by passing through
html
to the
BeautifulSoup()
constructor.
In [3]:
soup = BeautifulSoup(html)
In [4]:
type(soup) # we see that soup is a BeautifulSoup object
Out[4]:
bs4.BeautifulSoup
Getting Column Headers
First lets grab the information that will become the column headers of our DataFrame
. That information is found in the content of a table header cell HTML element (which are denoted by the <th></th>
tags), which is found within a table row element (<tr></tr>
) of a table header element (<thead></thead>
) of a table element (<table></table>
).
The HTML structure would look something like this:
As an example, lets take a look at the table header cell element for the Player column header:
Player
We want to extract the text content (which would be “Player” in above example) from all elements like one above and store them into a list.
By inspecting the HTML document (in Chrome you do this by right-clinking on the webpage and selecting “Inspect element” from the dropdown menu) we see that the 2nd table row is the one that contains the table headers we want.
Using
BeautifulSoup
, getting this information is pretty straightforward. All we have to do is:
- Use the
findAll()
method to find the first 2 rows on the page.- We pass the element we want as the first argument, in this case
tr
- We set the
limit
parameter to 2 so we only get the first 2 table row elements. - So we have something like this:
soup.findAll('tr', limit=2)
- The
findAll()
method above returns us a list ofBeautifulSoup Tag
objects. In this case it returns theTag
objects that represent the table rows we wanted.
- We pass the element we want as the first argument, in this case
- We want the 2nd table row so we have to extract the 2nd element of the list of
Tag
objects, which is done by just indexing the list as follows:soup.findAll('tr', limit=2)[1]
- Since the above returns us a
Tag
object we can just callfindAll()
on it to extract the table header cell elements like this:soup.findAll('tr', limit=2)[1].findAll('th')
- Now we have a list of containing the table header cell elements that we can iterate through and extract the text we want via the
getText()
method. Lets extract that text and construct a list of the column headers using list comprehension .
In [5]:
column_headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]
In [6]:
column_headers # our column headers
Out[6]:
['Rk', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB', 'AST', 'FG%', '3P%', 'FT%', 'MP', 'PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP']
NOTE Some of the column headers need to be renamed, but we won’t be doing that until we scrape all the draft data.
Getting the Data
We now need to extract the data from the HTML table and fill up our
DataFrame
. We can follow a similar process to the one we used to exract the header information.
In this case the data we want is found in the table row elements after the first two header rows. So lets get the list of table rows as follows:
In [7]:
data_rows = soup.findAll('tr')[2:] # skip the first 2 header rows
In [8]:
type(data_rows) # now we have a list of table rows
Out[8]:
list
The difference between extracting the the player data and extracting the column headers is that the the player data is in a 2-dimensional format (or a matrix), so we have to construct a 2-dimensional list. We can do this using a nested list comprehension
In [9]:
player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
Lets take a look at what the above does.
The outer for loop
for i in range(len(data_rows))
gets each table row which represents each draft pick.
The for loop in the inner list comprehension
[td.getText() for td in data_rows[i].findAll('td')]
extracts the text from the table data element (<td></td>
) for each row. This text represents the column data for each draft pick.
You could also construct the above list without a list comprehension:
In [10]:
player_data_02 = [] # create an empty list to hold all the data for i in range(len(data_rows)): # for each table row player_row = [] # create an empty list for each pick/player # for each table data element from each table row for td in data_rows[i].findAll('td'): # get the text content and append to the player_row player_row.append(td.getText()) # then append each pick/player to the player_data matrix player_data_02.append(player_row)
Both
player_data
and
player_data_02
are equivalent.
In [11]:
player_data == player_data_02
Out[11]:
True
Now that we have the column labels and the data, we can construct a
pandas DataFrame
.
We pass in the
player_data
as the first parameter in the
DataFrame
contructor and set the
columns
parameter to
column_headers
.
In [12]:
df = pd.DataFrame(player_data, columns=column_headers)
In [13]:
df.head() # head() lets us see the 1st 5 rows of our DataFrame by default
Out[13]:
Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | CLE | Andrew Wiggins | University of Kansas | 1 | 82 | 2969 | 1387 | 374 | … | .310 | .760 | 36.2 | 16.9 | 4.6 | 2.1 | 2.1 | .034 | -2.3 | -0.2 |
1 | 2 | 2 | MIL | Jabari Parker | Duke University | 1 | 25 | 738 | 308 | 138 | … | .250 | .697 | 29.5 | 12.3 | 5.5 | 1.7 | 1.3 | .088 | -1.2 | 0.1 |
2 | 3 | 3 | PHI | Joel Embiid | University of Kansas | … | |||||||||||||||
3 | 4 | 4 | ORL | Aaron Gordon | University of Arizona | 1 | 47 | 797 | 243 | 169 | … | .271 | .721 | 17.0 | 5.2 | 3.6 | 0.7 | 1.0 | .060 | -2.8 | -0.2 |
4 | 5 | 5 | UTA | Dante Exum | 1 | 82 | 1817 | 393 | 131 | … | .314 | .625 | 22.2 | 4.8 | 1.6 | 2.4 | -0.1 | -.003 | -3.8 | -0.8 |
5 rows × 22 columns
In [14]:
df.head(3) # we can change from the default view by passing in an integer
Out[14]:
Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | CLE | Andrew Wiggins | University of Kansas | 1 | 82 | 2969 | 1387 | 374 | … | .310 | .760 | 36.2 | 16.9 | 4.6 | 2.1 | 2.1 | .034 | -2.3 | -0.2 |
1 | 2 | 2 | MIL | Jabari Parker | Duke University | 1 | 25 | 738 | 308 | 138 | … | .250 | .697 | 29.5 | 12.3 | 5.5 | 1.7 | 1.3 | .088 | -1.2 | 0.1 |
2 | 3 | 3 | PHI | Joel Embiid | University of Kansas | … |
3 rows × 22 columns
Cleaning the Data
There are few things we have to do make our data usable:
- Get rid of a couple of header rows that contain only missing values.
- Rename some of the columns
- Change the data in each column to their proper data type.
- Deal with some more missing values in the form of “NaNs”.
- Add and drop certain columns.
Getting Rid of the Rows with Missing Values
Now lets find the rows containing only missing values values. To do this we can use pandas
boolean indexing. We can find the the rows we want by calling isnull()
method (which return True
if there is a None
object or an NaN
) from the Pk column. If Pk value is missing then there isn’t a draft pick in that row so we can get rid of that row.
In [15]:
# Finding the None rows df[df['Pk'].isnull()]
Out[15]:
Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
31 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 22 columns
We get rid of those rows by reassigning
df
a subset of itself that doesn’t have the above rows.
Note how we can access the same column of our DataFrame in different ways. In the cell below we are accessing the column as an attribute.
In [16]:
df = df[df.Player.notnull()]
Now there aren’t any rows that are full of missing values.
In [17]:
df[df['Pk'].isnull()]
Out[17]:
Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP |
---|
0 rows × 22 columns
Renaming the Columns
We should rename some of the columns since Python is not happy with having ‘%’ or ‘/’ in identifiers.
Lets rename the WS/48 column to WS_per_48 using the rename()
method. We set the columns
parameter to a diction with the key being the column we want to rename and the the value being the new name for that column.
In [18]:
df.rename(columns={'WS/48':'WS_per_48'}, inplace=True)
Since there are multiple columns with the ‘%’ character lets use the built-in string method
replace()
to replace ‘%’ with ‘_Perc’ for those columns
In [19]:
# get the column names and replace all '%' with '_Perc' df.columns = df.columns.str.replace('%', '_Perc')
We also need to differentiate between per game stats and total cumulative career stats. For example, we have two MP columns, one is the career total minutes played by that player and the other is his per game average.
To do this we can append ‘_per_G’ to the per game stats using a list comprehension.
In [20]:
# Get the columns we want by slicing the list of column names # and then replace them with the appended names df.columns.values[14:18] = [df.columns.values[14:18][col] + "_per_G" for col in range(4)]
In [21]:
print(df.columns)
Index(['Rk', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB', 'AST', 'FG_Perc', '3P_Perc', 'FT_Perc', 'MP_per_G', 'PTS_per_G', 'TRB_per_G', 'AST_per_G', 'WS', 'WS_per_48', 'BPM', 'VORP'], dtype='object')
Changing Data to the Proper Data Types
In [22]:
df.dtypes # Take a look at data types in each column
Out[22]:
Rk object Pk object Tm object Player object College object Yrs object G object MP object PTS object TRB object AST object FG_Perc object 3P_Perc object FT_Perc object MP_per_G object PTS_per_G object TRB_per_G object AST_per_G object WS object WS_per_48 object BPM object VORP object dtype: object
We can see that our
DataFrame
consists of alot columns with data type
object
. We can call on the
DataFrame
method
convert_objects()
and pass in the parameter
convert_numeric=True
to convert the columns with numerical strings to the most suitable numeric data type.
In [23]:
df = df.convert_objects(convert_numeric=True) df.dtypes
Out[23]:
Rk int64 Pk int64 Tm object Player object College object Yrs float64 G float64 MP float64 PTS float64 TRB float64 AST float64 FG_Perc float64 3P_Perc float64 FT_Perc float64 MP_per_G float64 PTS_per_G float64 TRB_per_G float64 AST_per_G float64 WS float64 WS_per_48 float64 BPM float64 VORP float64 dtype: object
Not everything is exactly as we want it to be. We should convert the columns Yrs, G, MP, PTS, TRB, and AST to integers using astype()
. But before we can do that we need to deal with some NaN
s (Not a Number) or else we will get an error like this:
ValueError: Cannot convert NA
Dealing with the NaN Values
The
NaN
s in our data indicate that a player has not played in the NBA. We should replace these
NaN
s with 0s to indicate that the player has not accumulated any stats.
To replace
NaN
values with 0s we just pass in the value 0 into the
fillna()
.
In [24]:
df = df[:].fillna(0) # index all the columns and fill in the 0s
Now that there are no more missing values in our data we can call
astype()
.
To get the columns we will convert we can use loc
, which allows us to access the rows and columns of our DataFrame
by using their labels. Using loc
we can slice consecutive rows or columns in a similar mannner that we slice lists
, except the slice ranges in loc
are inclusive on both ends of the indicated slice.
The typical format for slicing with
loc
is as follows:
df.loc[first_row_label:last_row_label , first_col_label:last_col_label]
(Read more about indexing pandas DataFrames
here)
Since the columns Yrs, G, MP, PTS, TRB, and AST are consecutive columns we can retrieve them by slicing our DataFrame
using loc
and then convert their data type.
In [25]:
df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)
In [26]:
df.head() # All NaNs are now replaced with 0s
Out[26]:
Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | TRB | … | 3P_Perc | FT_Perc | MP_per_G | PTS_per_G | TRB_per_G | AST_per_G | WS | WS_per_48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | CLE | Andrew Wiggins | University of Kansas | 1 | 82 | 2969 | 1387 | 374 | … | 0.310 | 0.760 | 36.2 | 16.9 | 4.6 | 2.1 | 2.1 | 0.034 | -2.3 | -0.2 |
1 | 2 | 2 | MIL | Jabari Parker | Duke University | 1 | 25 | 738 | 308 | 138 | … | 0.250 | 0.697 | 29.5 | 12.3 | 5.5 | 1.7 | 1.3 | 0.088 | -1.2 | 0.1 |
2 | 3 | 3 | PHI | Joel Embiid | University of Kansas | 0 | 0 | 0 | 0 | 0 | … | 0.000 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 |
3 | 4 | 4 | ORL | Aaron Gordon | University of Arizona | 1 | 47 | 797 | 243 | 169 | … | 0.271 | 0.721 | 17.0 | 5.2 | 3.6 | 0.7 | 1.0 | 0.060 | -2.8 | -0.2 |
4 | 5 | 5 | UTA | Dante Exum | 1 | 82 | 1817 | 393 | 131 | … | 0.314 | 0.625 | 22.2 | 4.8 | 1.6 | 2.4 | -0.1 | -0.003 | -3.8 | -0.8 |
5 rows × 22 columns
In [27]:
df.dtypes # and we have the datatyps we want
Out[27]:
Rk int64 Pk int64 Tm object Player object College object Yrs int64 G int64 MP int64 PTS int64 TRB int64 AST int64 FG_Perc float64 3P_Perc float64 FT_Perc float64 MP_per_G float64 PTS_per_G float64 TRB_per_G float64 AST_per_G float64 WS float64 WS_per_48 float64 BPM float64 VORP float64 dtype: object
Adding and Deleting Columns
Now lets finally add a Draft_Yr column to indicate the draft class year. We do this by using the insert()
method. The first parameter is where we are inserting the column. The second parameter is the name of the column. The third parameter is the value(s) for that column
In [28]:
df.insert(0, 'Draft_Yr', 2014)
Now to delete the Rk column, as it is redundant since we have the Pk column. To delete a column we use the drop()
method. We pass in the name of the column we want to delete, the axis we are deleting along, and set inplace
to True
so that the changes occur in place and we don’t have to reassign the df to itself.
In [29]:
df.drop('Rk', axis='columns', inplace=True)
In [30]:
df.columns # checkout our revised columns
Out[30]:
Index(['Draft_Yr', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB', 'AST', 'FG_Perc', '3P_Perc', 'FT_Perc', 'MP_per_G', 'PTS_per_G', 'TRB_per_G', 'AST_per_G', 'WS', 'WS_per_48', 'BPM', 'VORP'], dtype='object')
Scraping and Cleaning Data for all Drafts from 1966 to 2014
Scraping and cleaning the data from all previous drafts follows a similar procedure to the one we used for the 2014 draft. The main difference is that we have to to do it multiple times and then combine the data into one large
DataFrame
.
Scraping the Data
First lets create a URL template that will allow us to access the web page for each year.
In [31]:
url_template = "http://www.basketball-reference.com/draft/NBA_{year}.html"
The draft year will be passed into
{year}
as we collect the data for each draft.
In [32]:
# create an empty DataFrame draft_df = pd.DataFrame()
Now we are going to create a
for
loop to scrape and append our data to our big
DataFrame
. It does take a bit to get all the data.
In [33]:
for year in range(1966, 2015): # for each year url = url_template.format(year=year) # get the url html = urlopen(url) # get the html soup = BeautifulSoup(html, 'html5lib') # create our BS object # get our player data data_rows = soup.findAll('tr')[2:] player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))] # Turn yearly data into a DatFrame year_df = pd.DataFrame(player_data, columns=column_headers) # create and insert the Draft_Yr column year_df.insert(0, 'Draft_Yr', year) # Append to the big dataframe draft_df = draft_df.append(year_df, ignore_index=True)
NOTE: I had to pass html5lib
as an extra argument into the BeautifulSoup
contructor. html5lib
is a third-party Python parser that Beautiful Soup can use. The reason I had to do this was because my default parser (lxml
, another thrid party parser) was not extracting all of the table rows for some of the years.
For more about parsers and Beautiful Soup check out the
documentation
Lets take a look at the first and last few rows of our data
In [34]:
draft_df.head()
Out[34]:
Draft_Yr | Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1966 | 1 | 1 | NYK | Cazzie Russell | University of Michigan | 12 | 817 | 22213 | 12377 | … | .827 | 27.2 | 15.1 | 3.8 | 2.2 | 51.7 | .112 | -2.0 | 0.1 | |
1 | 1966 | 2 | 2 | DET | Dave Bing | Syracuse University | 12 | 901 | 32769 | 18327 | … | .775 | 36.4 | 20.3 | 3.8 | 6.0 | 68.8 | .101 | 0.6 | 8.5 | |
2 | 1966 | 3 | 3 | SFW | Clyde Lee | Vanderbilt University | 10 | 742 | 19885 | 5733 | … | .614 | 26.8 | 7.7 | 10.3 | 1.1 | 33.5 | .081 | -2.4 | -0.6 | |
3 | 1966 | 4 | 4 | STL | Lou Hudson | University of Minnesota | 13 | 890 | 29794 | 17940 | … | .797 | 33.5 | 20.2 | 4.4 | 2.7 | 81.0 | .131 | 0.1 | 5.9 | |
4 | 1966 | 5 | 5 | BAL | Jack Marin | Duke University | 11 | 849 | 24590 | 12541 | … | .843 | 29.0 | 14.8 | 5.2 | 2.1 | 59.3 | .116 | -2.8 | -1.4 |
5 rows × 23 columns
In [35]:
draft_df.tail()
Out[35]:
Draft_Yr | Rk | Pk | Tm | Player | College | Yrs | G | MP | PTS | … | 3P% | FT% | MP | PTS | TRB | AST | WS | WS/48 | BPM | VORP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6445 | 2014 | 56 | 56 | DEN | Roy Devyn Marble | University of Iowa | 1 | 16 | 208 | 37 | … | .182 | .313 | 13.0 | 2.3 | 1.9 | 1.1 | -0.1 | -.031 | -4.5 | -0.1 |
6446 | 2014 | 57 | 57 | IND | Louis Labeyrie | … | |||||||||||||||
6447 | 2014 | 58 | 58 | SAS | Jordan McRae | University of Tennessee | … | ||||||||||||||
6448 | 2014 | 59 | 59 | TOR | Xavier Thames | San Diego State University | … | ||||||||||||||
6449 | 2014 | 60 | 60 | SAS | Cory Jefferson | Baylor University | 1 | 50 | 531 | 183 | … | .133 | .574 | 10.6 | 3.7 | 2.9 | 0.3 | 0.8 | .071 | -3.7 | -0.2 |
5 rows × 23 columns
Cleaning the Data
In [36]:
# Convert data to proper data types draft_df = draft_df.convert_objects(convert_numeric=True) # Get rid of the rows full of null values draft_df = draft_df[draft_df.Player.notnull()] # Replace NaNs with 0s draft_df = draft_df.fillna(0) # Rename Columns draft_df.rename(columns={'WS/48':'WS_per_48'}, inplace=True) # Change % symbol draft_df.columns = draft_df.columns.str.replace('%', '_Perc') # Add per_G to per game stats draft_df.columns.values[15:19] = [draft_df.columns.values[15:19][col] + "_per_G" for col in range(4)] # Changing the Data Types to int draft_df.loc[:,'Yrs':'AST'] = draft_df.loc[:,'Yrs':'AST'].astype(int) # Delete the 'Rk' column draft_df.drop('Rk', axis='columns', inplace=True)
Let’s see how our data looks now.
In [37]:
draft_df.dtypes
Out[37]:
Draft_Yr int64 Pk float64 Tm object Player object College object Yrs int64 G int64 MP int64 PTS int64 TRB int64 AST int64 FG_Perc float64 3P_Perc float64 FT_Perc float64 MP_per_G float64 PTS_per_G float64 TRB_per_G float64 AST_per_G float64 WS float64 WS_per_48 float64 BPM float64 VORP float64 dtype: object
In [38]:
draft_df['Pk'] = draft_df['Pk'].astype(int) # change Pk to int
In [39]:
draft_df.dtypes
Out[39]:
Draft_Yr int64 Pk int64 Tm object Player object College object Yrs int64 G int64 MP int64 PTS int64 TRB int64 AST int64 FG_Perc float64 3P_Perc float64 FT_Perc float64 MP_per_G float64 PTS_per_G float64 TRB_per_G float64 AST_per_G float64 WS float64 WS_per_48 float64 BPM float64 VORP float64 dtype: object
In [40]:
draft_df.isnull().sum() # No missing values in our DataFrame
Out[40]:
Draft_Yr 0 Pk 0 Tm 0 Player 0 College 0 Yrs 0 G 0 MP 0 PTS 0 TRB 0 AST 0 FG_Perc 0 3P_Perc 0 FT_Perc 0 MP_per_G 0 PTS_per_G 0 TRB_per_G 0 AST_per_G 0 WS 0 WS_per_48 0 BPM 0 VORP 0 dtype: int64
Writing the Data to a CSV File
Using the
to_csv()
method we can easily write out our
DataFrame
to a csv file.
In [41]:
draft_df.to_csv("draft_data_1966_to_2014.csv")
Finally, the version of Python and versions of the libraries we used.
In [42]:
# sys allows us to get the info for the version of Python we use import sys import urllib.request import bs4
In [43]:
print('Python version:', sys.version_info) print('Urllib.request version:', urllib.request.__version__) print('BeautifulSoup version:', bs4.__version__) print('Pandas version:', pd.__version__)
Python version: sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0) Urllib.request version: 3.4 BeautifulSoup version: 4.3.2 Pandas version: 0.16.2