(Ed. – As most regular readers of TNC must know, Evan Zamir owns and operates the absolutely invaluable NBAwowy.com, the absolute go-to website for examining lineup combinations in all their gore and glory. When he’s not providing that invaluable service, he’s homering it up on Twitter over his beloved Golden State Warriors. He has previously written a how-to on building your own RAPM model and today her brings you the goods on creating your own NBA, or really any other, stats website. Enjoy, and happy building!)
Build yourself an NBA stats website for great good
You got into this slowly. Maybe you heard a friend talk about true shooting percentage [1. “Double-U-Tee-Eff Carl” you remarked snidely at the time.] or maybe you read a comment in a forum dissing your favorite player[2. Monta was my guy “back then” as you now say fondly with a wry smile.]. A light bulb went off. Maybe these “analytics” folks were onto something. Maybe points per game wasn’t the most important stat anymore. You had to learn more. You found the old blog posts at 82games.com. You became a member of APBR, where you lurked for a while before becoming a regular contributor. You started following @ZachLowe_NBA. You read advanced stat primers. Eventually you would write your own. But that didn’t quench your thirst. You wanted to do more. You wanted to give more.
You wanted to build an NBA stats website.
Welcome. As a member of that club myself, I want to share with you advice on how to get started (btw, do you see how meta this is?). First, the TL;DR summary:
Build a data pipeline → Build a back-end API → Build a front-end client
Oh, and before we begin in earnest, I want to offer one piece of invaluable advice, which is to make sure all your code is being version controlled, preferably in Git. Git is a distributed version control system that will make your life as a developer so much easier. If you don’t already know it, do yourself a favor and learn. It’s not very difficult and there’s even a free book.
Build a data pipeline
“This is my data pipeline. There are many others like it, but this one is mine. My data pipeline is my best friend. It is my life. I must master it as I must master my life. Without me, my data pipeline is useless. Without my data pipeline, I am useless.”
Sometimes jargon sucks (“Big Data”), but sometimes it’s awesome. “Data pipeline” is awesome jargon in my book. It is pretty much exactly what it sounds like. Your website essentially is data, but you need to get the data from somewhere, and you need to do things to the data, sometimes even nasty things (don’t worry, the data doesn’t have feelings), and you need to send the data somewhere else so other people can tap into it. Your data pipeline is fundamental. You want to make sure it is done right.
Start at the source – Data Acquisition
If you did read the old blog posts at 82games.com, you’d find that some of the articles were actually based on data manually collected data by the authors writing down on paper (or maybe it was Palm Pilots) stats while watching games (with their eyes). For example, they estimated the value of an assist by charting games using upwards of 40 contributors!. Of course, without a lot of financial resources, manual data collection is not very scalable. So most likely, the data for your website won’t come from manual collection. Most likely, it will come from one of three sources: 1) “flat” CSV or JSON text files that you find somewhere on the internet (or gasp, Excel files); 2) writing your own web scraper to pull data from static web pages; 3) using an API from a website, such as NBA.com.
Before we move on, it is important to note here that different sites may have different policies regarding the legality of using any of their data for commercial or even non-commercial purposes. You should do the investigation necessary to make sure you are not violating any TOS policies.
In Python, csv and json modules are part of the standard library and very easy to use (i.e. parse). Of course, there are also libraries for reading in Excel files. I tend to prefer the JSON format when available, simply because it is so easy to use for debugging and basically mirrors the data structures used in almost every modern scripting language. Here’s a real-world example of a JSON-formatted object from nbawowy that describes a single play-by-play event:
“event”: “Start of the 3rd Quarter”,
The advantages of text files are that you pretty much know the site meant to make the data publicly available and did most of the grunt work for you by aggregating the data in a convenient format. These days, however, if you find flat files on the web, it’s usually not in large quantities, and you’ll often have to turn to the more automated approaches described below to collect the requisite amount of data you want for your new site.
This is actually how I get the data for nbawowy. Web scraping involves using a library module to load the HTML (or sometimes XML) from a website, and parsing the DOM (a tree-like structure at the heart of every web page) for the data you want, whether it be in a table element or a link. This will involve your learning at least a little bit of CSS or XPath to be able to code custom patterns for parsing the data. I would suggest using Nokogiri in Ruby or Scrapy in Python for this task. Both are great frameworks with big communities and excellent documentation, and will make this part of the data pipeline relatively painless. Now, there are other ways to scrape data that don’t involve writing much, if any, code. Check out import.io for a good example of a service that could help you automate your scraping process, with the caveat that those services tend not to be free.
We’ll see later that when you build the backend for your site, you’ll want to build an API for your clients (i.e. the browser) to retrieve data or perform some calculation (or perhaps, create a visualization on-the-fly) or even post data back to your site (for example, a form). Given that many modern websites implement their own APIs, you may be able to directly call such an API (e.g. make an HTTP request) from your code to acquire the data necessary for your site. Check out the Requests module in Python for a good example of a HTTP request library. Here’s an example of an API call to NBA.com/stats, which retrieves pretty much all the players in NBA history up until the 2013 season in JSON format (don’t all go hitting this at the same time!):
Be wary of “hitting” an API too frequently though. Sites often put rate limits on API calls, lest you unwittingly cause a DOS (Denial of Service) attack by hitting their servers too hard. I can tell you that this has happened to me on nbawowy, and it was not appreciated! If you’re acting in good faith, most likely you will not be calling an API frequently enough to cause any problems for the site.
Before moving on to the next phase, you’ll want to think about how often you will need to acquire data. If it’s once or twice a year or even once or twice a month, you can probably run a script or download files manually. If the acquisition needs to happen more often, say daily or even hourly, you probably want to start thinking about setting up a cron job to automate the process.
Nobody puts data in the corner!
Once you’ve gone to the trouble of acquiring the data, you probably don’t want it to disappear, so you need to store it somewhere. There are many options. You can store it locally on your computer in flat files (again, CSV or JSON come to mind), or more likely, you will want to load the data into an appropriate database. All the data for nbawowy is stored in MongoDB, a so-called “NoSQL” database. Mongo is a convenient choice if your data needs are read-only (i.e. not transactional) and if each element of data is essentially a JSON document[1. Mongo uses a binary form of JSON called BSON.]. If you need transactions and/or you need to do joins in production (i.e. in real-time), you probably will want to use a SQL database. In that case I would recommend MySQL or Postgresql (my preference in SQL land).
Aside from choosing the format to store your data (i.e. flat files or in a database), you will need to decide where you will have the data hosted. Typically, and usually most cheaply, you will store the data on the same server where your site is hosted (whether in your home or on a cloud service such as Heroku). If you want to separate your data hosting from your web hosting, you could use a third-party (“cloud-based”) host, such as MongoLab for MongoDB (nbawowy uses this service currently) or one of the countless SQL hosts you can find online.
The ubiquitous Amazon Web Services (AWS) offers several different database services to suit almost every need from cold storage to production environments requiring virtually 100% uptime. Amazon RDS (and more recently Aurora) and DynamoDB are Amazon’s production relational and NoSQL services, respectively. Cloud database services can be useful if the cost is reasonable, especially as the size of your database gets to a point where you need more advanced features such as sharding (e.g. distributed data volumes). Essentially, “DB-as-a-service” enables you to focus more on building the features of your site and less time worrying about the responsibilities usually owned by full-time DBA’s or SysAdmins.
I would be remiss to leave out Amazon’s S3 service, which is arguably the gold standard service on the internet for storing flat files. If you have a ton of data and want both archival ability and reasonably fast access times for transforming it in a data pipeline, you really can’t go wrong with storing your data on S3. In fact, as a general rule, it’s probably good form to store your raw data on S3 regardless of wherever else you’re going to send it down the data pipeline. I’ve heard that S3 has never lost a single file, which is probably not at all true…but true enough that your files are probably safer on S3 than on the 5 year old Dell laptop sitting on your desk as you read this precariously close to that cup of coffee you just poured begging to be knocked over by your girlfriend’s cat onto the keyboard, thus resulting in an electrical fire and complete data, if not cat, loss. So good luck with that.
Build a server
Now that you have your data source and storage requirements nailed down, it’s time to build the site. Most modern websites are built using a client-server architecture. The server or “back end” is where you handle requests coming from the browser (also called the “client” or “front end”). You can choose to use a “full stack” framework, such as Ruby on Rails (obviously for Ruby developers) or Django (for Python developers) in which case you will build both the server and client in one unified (“from soup to nuts”) framework. These tools are great for getting something up and running rather quickly and have sophisticated tools, such as user authentication and object relational mappers (ORM’s) for communicating with databases in a more language idiomatic way, as opposed to using raw SQL queries.
Of course, there are other server frameworks or libraries to choose from. Sinatra is a very popular alternative to RoR in the Ruby community that is “leaner” and more focused on building API’s. Flask is essentially the Python equivalent of Sinatra and refers to itself as a “microframework”. Thin (Ruby) and Bottle (Python) are even, well, thinner, versions of Sinatra and Flask, respectively.
To be honest, any of these frameworks will most likely be a good choice for building a modern site. The more difficult question you’ll likely have is what to actually make your server do for you and your users! One question that you need to ask early on in the development cycle of your site is where do you want to place the heaviest loads?
There are actually 4 distinct points in the stack that I can think of depending on your project where you might choose to perform heavier calculations/aggregations:
1) You can do pre-aggregation, transformation, and computation of data off-line in a batch process. For example, much of the data for nbawowy is already transformed and annotated before being uploaded to MongoLab. This is great if you can do it, because the users of your site will not see any performance penalty having to wait for computations to occur. But just as downloading pre-aggregated text files isn’t always a possibility, chances are your site will depend on some user interaction that can’t be accounted for in an off-line batch process and must be handled in real-time while the user is actually on the site.
2) You can do “on-the-fly” aggregations using SQL or NoSQL queries, which essentially offloads calculations from your server to the database (which, if you recall, may or may not physically reside on the same computer as your server process). Most websites will do this to some extent. As a general rule of thumb, I’ve learned that it is usually a good idea to take advantage of database queries whenever you can, because database code is probably more optimized than your code and because databases are typically built to handle heavier loads than your server, so you’re effectively leaning on that inherent robustness and scalability. It will make your job easier!
3) If your queries are somehow too complex to be performed on the database directly, you can do aggregation/computation on the server. This isn’t ideal, but sometimes it’s necessary. For example, you might be working with statistical or machine learning libraries that simply can’t hand off computation to the database.
Build a client
Pick a framework. Any framework.
While much of the work that goes into building a website actually involves the data pipeline and server, it’s the client or front end that people see when they open up your site in Chrome or Safari (or heaven forbid, the poor lost souls who are still using IE), and for that reason it’s what most people tend to think of as “web development”.
Technically, React (developed at Facebook) is more of a library than a framework, since it’s focused primarily on creating the UI for a site. In fact, it can be used in tandem with Angular or the other frameworks listed above. I mention React because it seems to have gained a huge groundswell of support over the past year or so, taking over much of the mindshare of JS front end development away from the other frameworks. When choosing a framework, you should ask yourself which one fits your needs and your programming style and sensibility. If I were to rebuild nbawowy today, I think I would lean towards React, partly because the “cool kids are using it” (which was definitely also true of Angular when I started working on wowy), but also because it is built on some pretty neat ideas, including a “virtual” DOM, and also there is the potential to use React Native to build iOS or Android mobile apps (which, to be clear, I haven’t even touched on in this article). Here’s a nice tutorial on the “ReactJS way” to get you started.
Of course, if you don’t want to deal with any of these frameworks, and you want to basically roll your own front end, you can simply use plain old HTML/CSS/JS with a few helper libraries, such as jQuery and underscore.js. There’s nothing wrong with that! Probably.
Choose your own style. As long as it looks like mine.
There’s definitely something to be said for doing it your way, but when it comes to websites, you don’t necessarily want to stray too far from the pack, especially if the focus of your site is data. You probably want users to focus on your numbers, not your font selection. Unless you plan on hiring web designers (who are experts in graphic design), I would highly recommend using a front-end CSS framework such as Bootstrap (developed several years ago at Twitter) or Foundation. These frameworks will enable you to create professional looking websites with minimal effort, assuming you’re ok with your website looking “Bootstrappy” like every other website (and believe me, once you start using Bootstrap, you will realize how many other sites also use Bootstrap these days). If you want your site to look “Googly”, check out Google’s Material Design manifesto for their vision of good design practices.
Building out a dynamic website is not trivial, but can be extremely rewarding. I have connected with countless people online and offline through a mutually shared interest in the data service that nbawowy provides. It’s not all butterflies and puppies though. No matter how useful or technically impressive you think your site is (believe me, been here seen that), someone will always want more or, at least, want to tell you how to do it better. And you know what? Often, they are right. Some of the best advice I can give you is to listen carefully to people, and try to put aside your ego. Chances are if one person is telling you something, many more out there are thinking the same thing. In the end, though, you should consider yourself the number one user of your site. I built nbawowy because I felt I needed a tool that was missing in the analytics community. Even if I never released it to the public, the site would have served me well the last few years, just to be able to do the research. If you approach building out your site with this mentality, I promise you can’t fail.