Nylon 101: Q&A with Emmanuel Perry on “Intro to R”

facebooktwitterreddit

At Nylon Calculus, we receive frequent questions along the lines of “how do I get into basketball analytics?” The answer is almost always some form of “do some analytics!” Unfortunately, for many the ability to do so is restricted by the tools and skills they use to form the questions they wish to ‘ask’ of basketball data. Being “stuck in Excel” is often a huge limiting factor on turning the germ of an insight into a full-fledged idea. To that end, we usually suggest learning a programming language, the two most popular for this sort of work being R and Python.


Nylon Calculus: First of all, thanks for taking the time. Correct me if I’m wrong, but you’re self-taught in terms of coding and have learned as you went with creating Corsica?

Emmanuel Perry: I guess you could say that. I haven’t formally studied computer science, but I imagine that describes most programmers these days. I started learning R a year and a half ago because I was frustrated by the limitations Excel and other tools imposed on my ability to conduct the type of analysis I wanted. I had previously dabbled in Python but I found R was far more intuitive to me by comparison. Early on, I had developed a Euclidean distance tool that would later become the Similarity Calculator on Corsica. Sam Ventura, a co-creator of WAR On Ice *pours out liquor*, (ed.  note: War On Ice was unquestionably one of the top online resources for hockey analytics before the proprietors were hired by the Pittsburgh Penguins and Minnesota Wild respectively) reached out to me and mentioned he would be willing to host the tool if I turned it into a web app. That motivated me to learn Shiny, an R framework for application development, and expand my understanding of R in general.

Corsica was similarly monumental in my education. Initially, it was two things: A project for me to learn how to scrape and maintain my own SQL database and a repository for Shiny apps I had built and planned to build. When WAR On Ice announced the site would go dark, Corsica evolved into a larger-scale project. It was another opportunity for me to hone my skills in an effort to disappoint the fewest people possible.

More from Nylon Calculus

NC: You sort of answered my next question about how Corsica came about, so thanks. Moving on, you mentioned moving beyond Excel or other spreadsheet programs, which is a pretty big step for most hobbyists in terms of their ability to produce analytic insights. Can you describe some examples of the limitations you faced and how learning R has addressed those problems?

EP: I had trouble working with larger data sets. Too many times, I had Excel crash when trying to treat Play-By-Play data. I also never was very adept at writing macros for automation and it took me unreasonable amounts of time to perform fairly elementary operations. The ability to write and save scripts in R was a game-changer for me, as was the assurance that I could import and manage bigger data.

NC: In terms of the time savings you mentioned, how much time are we talking about?

EP: I’ve spent hours struggling to achieve things with Excel that I can now do in a matter of minutes.

NC: “R or Python” seems like a “Coke or Pepsi” type question for many. What about R is more appealing to you?

EP: I don’t know enough about Python to definitively say one is “better” than the other. Programming languages are not dissimilar to spoken languages, and it would be unfair to claim one is better than the other simply because you learned it first or never learned the other. For whatever reason, R “stuck” with me at the very beginning in a way Python didn’t. That I’m offering a course on R and its applications in hockey analytics is not an indictment of other alternatives. I will say that there seems to be a larger community of R users than Python users in the hockey world. (ed. note: the staff at Nylon Calc is about 50/50 in terms of preference between the two.) This has an added benefit when it comes to being able to ask others for help, collaborate or share code and data.

NC: Moving on to the course itself, how much of it is going to be focused on R sytnax, how much on doing “hockey” things, and how much on the :”mini-package” functions you’ve created?

EP: The current course being offered, “An Intro to R”, is very much that. I decided to make it separate from the course on the coRsica package I’m building because I wanted to make absolutely sure that those enrolling in the advanced course could obtain the prerequisite knowledge. I do focus on analyzing hockey data, but several of the earlier lessons represent introductory theory on R syntax, basic functions, object types and more. Once the fundamentals are covered, the focus turns to statistics and plotting, at which point real-world hockey data is introduced and used in examples.

NC: Obviously, your focus is on hockey, but a large proportion of the course will be applicable to general sports analytics and certainly to basketball given the similarity in a lot of the underlying data right?

EP: I like to think the material covered in the “An Intro to R” course has applications in a number of fields. Of course, due to the commonality between statistical analysis of various sports, the lessons and exercises are especially relevant to basketball analytics. The skills you acquire should translate seamlessly to data analysis in any number of sports.

NC: So what will somebody be able to “do” after this intro course?

EP: Someone who has completed the “An Intro to R” course will be able to import and export data from and to various sources and locations; trim and transform that data in anticipation of performing analysis; ascertain vital information about the data set and its elements; perform a variety of statistical methods such as predictive modelling, testing internal consistency and evaluating distributions; visualize the data with ggplot and/or produce interactive charts with Plotly and more.

NC: You said “trim and transform” data. We at Nylon Calculus are very well aware that the least fun part, most time consuming part of doing this sort of analysis is “cleaning” data, getting it into usable form. Can you give an example or two of the kinds of problems you might run into that can be laborious to address in Excel, but can be fixed quickly and easily in R?

EP: I’m the kind of person that likes checkpoints – to take a moment to evaluate where I am before moving on. Often, I create temporary objects between A and B to take stock of what remains to be done and ensure everything looks right. This type of approach is a sure way to crowd your worksheets and can do more harm than good if you’re working in Excel. R is much better equipped to create , view and destroy objects without muddying the water. The functions offered by R and its packages are designed for optimal efficiency and ease of use. Creating a summary of a large data table can be done in one line of code.

NC: What are some of the operations you’ve functionalized for the advanced course/package, and how do they apply to hockey analysis?

EP: Of note, the coRsica package has summarized the entire process used on Corsica to scrape and process raw data, compile a bevy of stats, add that data to a local database and condense many of the tables for easier and quicker access, in three simple functions. This means that anybody can reproduce the Corsica database on their own machine. In addition, there will be a number of helper functions to retrieve and summarize this data, perform some basic statistical tests like year-to-year repeatability and visualize data in Plotly.

 

NC: Actually collecting the raw data is a big stumbling block for people just getting started. Obviously, part of that is know “where” on the internet to find the data you need. But the more daunting part for many relative novices once it has been located is getting into an environment to be manipulated. Cutting and pasting can be inexact and time consuming! Will this intro course cover some tools and techniques for scraping?

EP: That’s not something I’ll cover in the introductory course. I feel this is a skill usually learned by intermediate or advanced programmers, and I don’t want to distract beginners from the importance of building a solid foundation. I also greatly facilitate the scraping of NHL game data with the functions contained in the coRsica package. As such, it’s a lesson I decided to include in the advanced course.

NC: Do you think the functions from the coRsica pacakage are somewhat easily modified or adapted to a basketball or other sporting context?

EP: The coRsica package is very specific to hockey, but there are meta functions that can be used in any context.

NC: Last, and probably most important question; will someone with zero programming experience be able to keep up with the course?

EP: Yes! The intro course is designed to guide you through the very basics of R, starting with installation. It’s not necessary or expected that you have any previous knowledge of computer programming.