How to link PITCHf/x to Retrosheet

Update 04/28/2008 – I uncovered a bug in the parser script that was causing the nightly update to fail for all dates after the 10th of the month. Grab the new ZIP file for the fix. Also, you’ll need to run the parser manually starting from April 10. If you need any help send a comment or an email.

The hot area of study in baseball today is detailed pitch analysis made possible by data from the PITCHf/x system. Analysts like John Walsh at the Hardball Times, Dan Fox from Baseball Prospectus and Mike Fast (among many others) are producing some amazing research on identifying pitch types, the consistency of release points and many other topics that were impossible to study before having the detailed PITCHf/x data. Mike Fast provides a running catalog of PITCHf/x studies at his FastBalls blog.

Background

Last year, Mike also provided his method for capturing the PITCHf/x data and storing it in a relational database. He details the steps needed to download the XML from MLB, to parse it, and to write it to a MySQL database. These instructions are a great way to get started in downloading and analyzing the data, but there were a few areas for improvement I saw – namely, the process as described requires you to manually run the programs, and there’s no easy way to tie the PITCHf/x data to the play-by-play data from Retrosheet.

Let me take this in reverse order. Tying the data to Retrosheet is important if you want to pull in any information that’s not captured in the PITCHf/x data. In my case, I’m interested in the pitcher/catcher relationship, and that’s not explicitly available from PITCHf/x. But creating the relationship to Retrosheet isn’t necessarily that easy. First off, PITCHf/x data can be downloaded nightly throughout the season, whereas Retrosheet releases a complete season during the winter. Because of the time lapse, you need to anticipate what the Retrosheet data will look like while parsing the PITCHf/x data.

Mike provides both a spider to download the PITCHf/x information and a parser to transform the data and store it in a database. Both are written in Perl and are based on the Baseball Hacks book by Joseph Adler. The spider does exactly what I want, so that’s unchanged. However, I did need to make changes to the parser. Since I’m not great in Perl, I started from scratch on a parser using the Python language. The parser code can be found here and here. Don’t worry about downloading them now, I’ll provide a ZIP file at the end of the post that contains the whole package of code. The parser takes the PITCHf/x data and builds a Retrosheet-like game id and event number. Storing this forced me to change Mike’s database structure as well. A SQL script for creating the new structure can be found here. As with Mike’s setup, this is a MySQL database and everything he talks about still applies.

I found my parser to match up very, very well with the Retrosheet data from 2007. The only discrepancies I uncovered were the strange behavior of the PITCHf/x data missing the very last at-bat of the game. If it were only a handful of times, I would shrug it off, but it appears to have happened over 400 times, or basically once every six games. I’ve confirmed that it’s not my code – the XML files really are missing the last at-bat. I can’t explain why it happens so frequently, but hopefully it’s something that will be resolved this season. I am aware of a few issues with my parser. Just like Mike, I don’t handle mid at-bat pitching changes well. I also will be missing the pitches for the partial plate appearance when the runner ends up making the third out on the bases. This is because I use the Retrosheet event number as part of the unique identifier for pitches, and when the plate appearance is partial there is no Retrosheet event number. I don’t think I’m missing anything else major, but please let me know if you find anything.

Software Needs

The setup as described by Mike requires Perl and MySQL. I’m adding Python (and some libraries) to that list. I’m not going to rewrite how to setup Perl or MySQL – Mike does a very good job of explaining what’s needed there. I will share how to get Python going though.

First, download and install the Python language interpreter. Next, download and install the EasyInstall package. This will make your life a whole lot easier going forward when you try to install other packages. We’re not really going to be using the full power of EasyInstall, but if you’re going to be doing more with Python you should really understand how it works. Finally, download the mysql-python file which contains the code that allows you to connect to your database. You can download in a variety of packaging format. Personally, I’ve had good luck using the .egg format, but feel free to experiment with the others if you want. If you’ve downloaded the .egg format, go to where you’ve installed EasyInstall. For me, this was C:\Python25\Scripts. Run easy-install.exe pointing it to where you downloaded the mysql-python.egg. For example, easy-install.exe C:\Download\MySQL_python-1.2.2-py2.4-win32.egg. You’ve got everything you need to run the new scripts, so let’s talk about how they work.

The Scripts

As I mentioned above, I use Mike’s spider software, so I’m not going to go into details about that. I will talk about the parser though. Open up a command window and navigate to where you downloaded the scripts. Type python pitchfxparser.py -h. This should give you some instructions on how you can use the script. Basically, you need to provide a location that represents the top-level directory for the PITCHf/x files – mine is C:\Baseball\pitchfx\games. You can also specify which dates to parse by adding arguments for year, month and day. If you don’t provide any date arguments it will only parse yesterday’s information.

Let’s look at a couple of examples. Say you wanted to parse the entire 2007 season. You would use the following command:

python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007

If you wanted to only parse the games in October of 2007, you would use this command:

python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007 -m 10

If you only wanted the games from October 1, 2007, use this command:

python pitchfxparser.py -l "C:\Baseball\pitchfx\games" -y 2007 -m 10 -d 1

And finally, if you only wanted the games from yesterday (whatever date yesterday turns out to be), just use this command:

python pitchfxparser.py -l "C:\Baseball\pitchfx\games"

Getting a Nightly Update

The last piece of the puzzle is setting the scripts to automatically run every night. I’m going to provide the instructions on how to do this for Windows. For those of you running Linux (or if there are any BSD or OpenSolaris users out there), you’ll want to look into cron jobs.

The first thing you need to do is create a Windows batch (.bat) file that will run both scripts in order. I’ve already written one, but it’s a very easy thing to do. The really important thing is to make sure you have your directories identified correctly. In my batch file, I assume everything is in the same directory as the batch file itself.

Next, you’ll create a Windows scheduled task. Go to your Windows Control Panel (the full or advanced version) and click on “Scheduled Tasks” followed by “Add Scheduled Task.” You should see a dialog that looks like this:

Scheduled Task - Screen 1

Click “Next”. The next screen will ask you to select a program to schedule. Click on “Browse” and select the batch file you created. Then click “Next”. You should see a screen asking how frequently you’d like to perform the task.

Scheduled Task - Screen 2

Select “Daily” and then click “Next”. After this you’ll be asked what time you want to run the program. Remember, the parser is set up to parse the previous day’s results, so you’ll want to run it after midnight. I use 8:00 AM EST since I don’t necessarily know what time the West Coast night games are going to end and this seemed safe enough. Hopefully I don’t need to mention that it needs to be a time when your computer is turned on.

Scheduled Task - Screen 3

Enter whatever time you want, make sure the task will run everyday and choose a start date. If you want, you can wait until Opening Day, but spring training games are currently available. Click “Next”.

Now you reach the really critical part. This is where you enter your Windows user name and password. If you do not provide a password, the task will not run. If your like me and don’t have a password set up to log into Windows when it starts, you’ll need to set one up. Look through Windows Help or on Microsoft’s site if you need more information.

Scheduled Task - Screen 4

Click “Next” and you’ll be shown a success screen which should look something like this:

Scheduled Task - Screen 5

Congratulations, you have successfully set your computer up to automatically download and parse the PITCHf/x data every day.

Now you’re ready to set out analyzing the data. I’ll provide the link to Mike’s wonderful library of PITCHf/x resources again, in case you’re looking for some help on what it all means.

Resources

Pitchfx.zip - a zip file containing the database definition file, both parsers and a sample batch file

31 Comments »

RSS feed for comments on this post. TrackBack URI

  1. Hi, I’m in the process of building the database like you described but I’m getting an error when I try to start parsing (both in IDLE and also the command line)

    SyntaxError: EOL while scanning single-quoted string

    Any advice on how to fix this?

    Comment by Corey Dawkins — June 10, 2008 #

  2. [...] Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.) Dan Turkenkopf has a great parser that links the Gameday XML files up with the appropriate Retrosheet IDs. A few SQL queries later, and bingo, data! I filtered for all ground balls and divided them up by [...]

    Pingback by Can we measure a fielder’s range? « MySportsScoop.com — March 6, 2009 #

  3. Thanks for making these scripts public!

    After setting up the pitchfx database, I run the python script and get a 1049 error: unknown database ‘pitchfx’. When I launch mysql and do “show databases;”, pitchfx is there along with all the necessary tables.

    And if I create a different connection to a different database with the MySQLdb call, it works fine (so the problem is not a bad host name or environmental variable, it seems).

    Any ideas on what’s going on here?

    Comment by Dan — August 13, 2009 #

  4. Hi Dan,

    I don’t know if the hardcoded username/password near the top of the pitchfxparser.py file might be causing you problem?

    If that’s not it, drop me an email and I’ll see if I can help some more.

    Comment by Dan Turkenkopf — August 13, 2009 #

  5. Hi, I do think this is an excellent blog. I stumbledupon
    it ;) I am going to revisit yet again since
    I book marked it. Money and freedom is the best way to change, may you be rich and continue to guide other people.

    Comment by http://www.phpfox.com/ — April 11, 2013 #

  6. There is certainly a lot to know about this issue. I like all of
    the points you have made.

    Comment by hack a twitter account — May 18, 2013 #

  7. Great overcome! I’d like to beginner concurrently since you amend your website, just how could we subscribe for a site internet site? Your profile made it easier for us a adequate package. For a nice and a bit familiar of this your own transmit supplied amazing obvious strategy

    Comment by Hacking Forum — July 1, 2013 #

  8. My partner and I stumbled over here from a different web address and thought I might check things out.
    I like what I see so now i’m following you. Look forward to looking over your web page for a second time.

    my website :: Steam Wallet Hack 2013

    Comment by Steam Wallet Hack 2013 — July 15, 2013 #

  9. Somebody necessarily help to make severely posts I might state.
    That is the first time I frequented your web page and to this point?
    I surprised with the analysis you made to create this
    actual submit extraordinary. Fantastic process!

    Comment by nowoczesne kominki — July 16, 2013 #

  10. What a material of un-ambiguity and preserveness of valuable experience
    on the topic of unexpected emotions.

    my web blog company of heroes 2 serialz serial

    Comment by company of heroes 2 serialz serial — October 8, 2013 #

  11. Hello, i think that i saw you visited my blog thus i got here to return the desire?.I’m trying to in finding things to enhance my site!I suppose its adequate to use a few of your concepts!!

    Comment by Christie — October 17, 2013 #

  12. A global warming controversy still exists for a range of reasons.
    Support lifestyles that are more environmentally-friendly.
    t walk into the grocery store without being asked if I.

    Comment by Lupe — October 17, 2013 #

  13. Every weekend i used to go to see this website, as i wish for enjoyment, since this this web site conations really fastidious funny information too.

    Comment by Garcinia cambogia — October 17, 2013 #

  14. This amount is non-refundable and hence the companies should take stringent
    steps to avoid carbon outcome as much as possible, to increases their revenue.

    The effects of higher daytime lows are mostly good.
    Mining and processing the oil sands wreaks havoc on
    the environment.

    Comment by designer jeans — October 17, 2013 #

  15. Wow, this piece of writing is fastidious, my sister is analyzing these
    things, therefore I am going to let know her.

    Comment by Rashad — October 18, 2013 #

  16. I couldn’t refrain from commenting. Perfectly written!

    Here is my blog post: company of heroes 2 key generator cracked – laurencepaling.webs.com,

    Comment by laurencepaling.webs.com — November 12, 2013 #

  17. Howdy just wanted to give you a quick heads up and let you know a
    few of the pictures aren’t loading correctly.
    I’m not sure why but I think its a linking issue. I’ve tried it in two different web browsers and both show
    the same results.

    Comment by second match — December 14, 2013 #

  18. These patches cut down- and eѵen stop- electrical tranѕmissions from thе bbrain to the rest of the body and vice versa.
    Neurollogists have found thаt women who suffer from thiѕ diseasee is usually seen the number of rings after biгth.

    Sometimes an occasional erection problem caused by a minor everydy
    problem such аs a ffew too many ԁrinks, or a particularly hard
    dаy att work, can begin a cycle of worrying.

    Comment by Broderick — December 15, 2013 #

  19. ” A sniffer intercepts information by spoofing the user’s IP address. Take Just use the secret to success to a phisher, which I think is quite simple. Dictionary Attack enables you to remove RAR password according to default dictionary from the program.

    Have a look at my web page; wifi password hack windows xp, Roseann,

    Comment by Roseann — February 18, 2014 #

  20. Hey very cool site!! Man .. Beautiful .. Amazing ..
    I will bookmark your blog and take the feeds also?
    I am happy to find so many helpful info right here within the publish, we need work out extra strategies in this regard,
    thank you for sharing. . . . . .

    Comment by Fernando Sharpless — June 3, 2014 #

  21. Terrific to be right here!I’m truly pleased to possess discovered this forum getting been pointed right here by a person on another discussion board.

    Comment by Philip Boudjouk — June 4, 2014 #

  22. I loved as much as you will receive carried out right here.

    The sketch is attractive, your authored subject matter stylish.
    nonetheless, you command get got an impatience over that you wish be delivering the following.
    unwell unquestionably come further formerly again since exactly the same nearly very often inside case you shield this hike.

    Comment by Danial Brownley — June 7, 2014 #

  23. cold coffee shop…

    How to link PITCHf/x to Retrosheet | Stealing First…

    Trackback by cold coffee shop — September 16, 2014 #

  24. This is very interesting, You are a very skilled blogger.
    I have joined your rss feed and look forward to seeking more of your great post.
    Also, I havge shared your website in my social networks!

    Alsoo visit my web-site :: Internal communications

    Comment by Internal communications — September 19, 2014 #

  25. Google Sites is a website builder that is quick and easy to learn. Penguin algorithm that got rolled out in April, 2012 almost shook SEO industry.
    Link building: A site with more number of links would rank well in SERPin the past; but in the current scenario, it is not the case.

    Comment by google — September 21, 2014 #

  26. Very quickly this website will be famous among all blogging
    users, due to it’s good posts

    Comment by google adwords coupon — September 24, 2014 #

  27. Appreciating the hard work you put into your website and in depth information you offer.
    It’s nice to come across a blog every once in a while that isn’t the same unwanted rehashed information. Fantastic read!

    I’ve saved your site and I’m adding your RSS feeds to
    my Google account.

    Comment by game hack — October 2, 2014 #

  28. Hi there everybody, here every person is sharing these know-how,
    therefore it’s pleasant to read this weblog, and I used to pay a visit this web
    site daily.

    Comment by video marketing tips 2014 — October 17, 2014 #

  29. Dr Patel Dental Center…

    How to link PITCHf/x to Retrosheet | Stealing First…

    Trackback by Dr Patel Dental Center — October 20, 2014 #

  30. May I simply say what a comfort to uncover somebody that actually knows what they are talking
    about on the internet. You definitely realize how to bring a
    problem to light and make it important. More people need to check thijs
    outt and understand this side of your story. I was surprised that you
    are not more popular given that you surely have the gift.

    Comment by Saleh Stevens — October 21, 2014 #

  31. Erik Pitoniak…

    How to link PITCHf/x to Retrosheet | Stealing First…

    Trackback by Erik Pitoniak — October 21, 2014 #

Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WordPress with GimpStyle Theme design by Horacio Bella.
Entries and comments feeds. Valid XHTML and CSS.