Update 04/28/2008 – I uncovered a bug in the parser script that was causing the nightly update to fail for all dates after the 10th of the month. Grab the new ZIP file for the fix. Also, you’ll need to run the parser manually starting from April 10. If you need any help send a comment or an email.
The hot area of study in baseball today is detailed pitch analysis made possible by data from the PITCHf/x system. Analysts like John Walsh at the Hardball Times, Dan Fox from Baseball Prospectus and Mike Fast (among many others) are producing some amazing research on identifying pitch types, the consistency of release points and many other topics that were impossible to study before having the detailed PITCHf/x data. Mike Fast provides a running catalog of PITCHf/x studies at his FastBalls blog.
Last year, Mike also provided his method for capturing the PITCHf/x data and storing it in a relational database. He details the steps needed to download the XML from MLB, to parse it, and to write it to a MySQL database. These instructions are a great way to get started in downloading and analyzing the data, but there were a few areas for improvement I saw – namely, the process as described requires you to manually run the programs, and there’s no easy way to tie the PITCHf/x data to the play-by-play data from Retrosheet.
Let me take this in reverse order. Tying the data to Retrosheet is important if you want to pull in any information that’s not captured in the PITCHf/x data. In my case, I’m interested in the pitcher/catcher relationship, and that’s not explicitly available from PITCHf/x. But creating the relationship to Retrosheet isn’t necessarily that easy. First off, PITCHf/x data can be downloaded nightly throughout the season, whereas Retrosheet releases a complete season during the winter. Because of the time lapse, you need to anticipate what the Retrosheet data will look like while parsing the PITCHf/x data.
Mike provides both a spider to download the PITCHf/x information and a parser to transform the data and store it in a database. Both are written in Perl and are based on the Baseball Hacks book by Joseph Adler. The spider does exactly what I want, so that’s unchanged. However, I did need to make changes to the parser. Since I’m not great in Perl, I started from scratch on a parser using the Python language. The parser code can be found here and here. Don’t worry about downloading them now, I’ll provide a ZIP file at the end of the post that contains the whole package of code. The parser takes the PITCHf/x data and builds a Retrosheet-like game id and event number. Storing this forced me to change Mike’s database structure as well. A SQL script for creating the new structure can be found here. As with Mike’s setup, this is a MySQL database and everything he talks about still applies.
I found my parser to match up very, very well with the Retrosheet data from 2007. The only discrepancies I uncovered were the strange behavior of the PITCHf/x data missing the very last at-bat of the game. If it were only a handful of times, I would shrug it off, but it appears to have happened over 400 times, or basically once every six games. I’ve confirmed that it’s not my code – the XML files really are missing the last at-bat. I can’t explain why it happens so frequently, but hopefully it’s something that will be resolved this season. I am aware of a few issues with my parser. Just like Mike, I don’t handle mid at-bat pitching changes well. I also will be missing the pitches for the partial plate appearance when the runner ends up making the third out on the bases. This is because I use the Retrosheet event number as part of the unique identifier for pitches, and when the plate appearance is partial there is no Retrosheet event number. I don’t think I’m missing anything else major, but please let me know if you find anything.
The setup as described by Mike requires Perl and MySQL. I’m adding Python (and some libraries) to that list. I’m not going to rewrite how to setup Perl or MySQL – Mike does a very good job of explaining what’s needed there. I will share how to get Python going though.
First, download and install the Python language interpreter. Next, download and install the EasyInstall package. This will make your life a whole lot easier going forward when you try to install other packages. We’re not really going to be using the full power of EasyInstall, but if you’re going to be doing more with Python you should really understand how it works. Finally, download the mysql-python file which contains the code that allows you to connect to your database. You can download in a variety of packaging format. Personally, I’ve had good luck using the .egg format, but feel free to experiment with the others if you want. If you’ve downloaded the .egg format, go to where you’ve installed EasyInstall. For me, this was
C:\Python25\Scripts. Run easy-install.exe pointing it to where you downloaded the mysql-python.egg. For example,
easy-install.exe C:\Download\MySQL_python-1.2.2-py2.4-win32.egg. You’ve got everything you need to run the new scripts, so let’s talk about how they work.
As I mentioned above, I use Mike’s spider software, so I’m not going to go into details about that. I will talk about the parser though. Open up a command window and navigate to where you downloaded the scripts. Type python pitchfxparser.py -h. This should give you some instructions on how you can use the script. Basically, you need to provide a location that represents the top-level directory for the PITCHf/x files – mine is C:\Baseball\pitchfx\games. You can also specify which dates to parse by adding arguments for year, month and day. If you don’t provide any date arguments it will only parse yesterday’s information.
Let’s look at a couple of examples. Say you wanted to parse the entire 2007 season. You would use the following command:
python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007
If you wanted to only parse the games in October of 2007, you would use this command:
python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007 -m 10
If you only wanted the games from October 1, 2007, use this command:
python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007 -m 10 -d 1
And finally, if you only wanted the games from yesterday (whatever date yesterday turns out to be), just use this command:
python pitchfxparser.py -l "C:\Baseball\pitchfx\games"
Getting a Nightly Update
The last piece of the puzzle is setting the scripts to automatically run every night. I’m going to provide the instructions on how to do this for Windows. For those of you running Linux (or if there are any BSD or OpenSolaris users out there), you’ll want to look into cron jobs.
The first thing you need to do is create a Windows batch (.bat) file that will run both scripts in order. I’ve already written one, but it’s a very easy thing to do. The really important thing is to make sure you have your directories identified correctly. In my batch file, I assume everything is in the same directory as the batch file itself.
Next, you’ll create a Windows scheduled task. Go to your Windows Control Panel (the full or advanced version) and click on “Scheduled Tasks” followed by “Add Scheduled Task.” You should see a dialog that looks like this:
Click “Next”. The next screen will ask you to select a program to schedule. Click on “Browse” and select the batch file you created. Then click “Next”. You should see a screen asking how frequently you’d like to perform the task.
Select “Daily” and then click “Next”. After this you’ll be asked what time you want to run the program. Remember, the parser is set up to parse the previous day’s results, so you’ll want to run it after midnight. I use 8:00 AM EST since I don’t necessarily know what time the West Coast night games are going to end and this seemed safe enough. Hopefully I don’t need to mention that it needs to be a time when your computer is turned on.
Enter whatever time you want, make sure the task will run everyday and choose a start date. If you want, you can wait until Opening Day, but spring training games are currently available. Click “Next”.
Now you reach the really critical part. This is where you enter your Windows user name and password. If you do not provide a password, the task will not run. If your like me and don’t have a password set up to log into Windows when it starts, you’ll need to set one up. Look through Windows Help or on Microsoft’s site if you need more information.
Click “Next” and you’ll be shown a success screen which should look something like this:
Congratulations, you have successfully set your computer up to automatically download and parse the PITCHf/x data every day.
Now you’re ready to set out analyzing the data. I’ll provide the link to Mike’s wonderful library of PITCHf/x resources again, in case you’re looking for some help on what it all means.
Pitchfx.zip - a zip file containing the database definition file, both parsers and a sample batch file