Thursday, October 3, 2013

The Sacrifice Bunt

It all started so simply--just another night of watching TV on my computer and keeping a casual eye on Twitter with the iPad. I wish I had captured the tweet but I didn't--it came from MLB Network's Brian Kenny (@MrBrianKenny) making a passing reference to the ridiculousness of the sacrifice bunt. I'm always on the prowl for something to write about, and this was right up my alley--investigate some "common knowledge" and see if the data backs it up. 

I amassed data going back to the 2000 season for a total of around 2.5+ million lines of data. As such, I know the following:
1. How many bunts (not just sacrifice bunts) there were
2. The inning, score and number of outs
3. The batting order position and the INNING batting order position of the bunter
4. The runners on base
5. Most importantly, how often the sacrifice bunt was ABSOLUTELY VITAL to scoring a run
I'll illustrate point 5 with two examples from 2013--the first is from a Rangers-Tigers game in May:
The Rangers were leading the Tigers 9-7 and Leonys Martin led off the bottom of the 7th with a double. Leury Garcia (the #9 hitter), for reasons that are so beyond my comprehension that I'm not sure where to begin, laid down a sacrifice bunt. The play-by-play data suggests Detroit attempted to make a play on Martin and were unsuccessful, leaving everyone safe. Elvis Andrus (the #1 hitter) grounded out but advanced Garcia to 2nd (a productive out). David Murphy sacrificed to drive in Martin, and Garcia scored on a single by Lance Berkman.

Was Garcia's bunt necessary? NO--subsequent events would have brought home Martin if the same scenario had played out. This isn't the best example because Garcia's bunt did allow him to reach base, but this is my strict standard--a sacrifice has to put a base runner in position to score, he must score AND no further runs are scored in the inning. This will be my primary measure, because the underlying premise of the sacrifice bunt is that trading an out is worth the scoring of a run--the value of the run outweighs the cost of an out. In this case the Rangers got lucky--they didn't get an out, and they were able to score more than one run.

Contrast that with this play from a game between St. Louis and San Diego in July:








I apologize for not expanding the column wide enough to show the pitcher was Jason Marquis. The game was tied 0-0 as the Cardinals came to bat in the bottom of the 3rd. Jake Westbrook led off with a single and advanced to 2nd with a walk to Matt Carpenter. Jon Jay (#2 hitter who had a .250 average and a .671 OPS at that time) laid down a successful sacrifice, moving the runners to 2nd and 3rd. Carlos Beltran drove in Westbrook with a sac fly and Allen Craig popped out to end the inning. The bunt was ABSOLUTELY ESSENTIAL in scoring that run, since Westbrook would not have been driven in without being moved to 3rd.

This is where the problems begin to creep in. There is no argument that in this situation it was essential for Jay advance the runners--consider how he could have done that AND how often he had done each so far in 2013:
1. Hit a home run, and drive everyone in--at that point he had hit 5 home runs in 368 PA, or in around 1.4% of his plate appearances
2. Hit a triple and drive everyone in--he'd hit 1 (.2%)
3. Hit a double and drive in one or both runners (both runners were driven from 1st and 2nd 47.7% of the time in 2013. It happened 40.7% of the time the double was hit to left, 65% when hit to center and 42% when hit to right). He had 12 (3.3%)
4. Hit a single and drive in Westbrook--he had 62 (16.8%)
5. Get on base via walk (32, 8.7%), HBP (8, 2.2%) or reach on an error (6, 1.6%)
6. Make a productive out and advance Westbrook to 3rd--he had 63 opportunities to make a productive out in 2013 and did so 22 times (34.9%). I don't have actual numbers, but chances are the 34.9% number was somewhat constant throughout the season.
7. Make an out and not advance the runners

When you add up the hit percentages you'll notice that it doesn't add up to his batting average--that's because I'm using plate appearances (368) instead of at-bats (320). This table summarizes the above numbers:


When all this is taken together, it suggests that Jon Jay had a 37.8% chance of advancing the runner, a 25% chance (that would be his batting average) of getting a hit  and a 33.1% chance (his OBP) of reaching base. In 2013 72 of the 1,383 sacrifice bunts (5.2%) reached on an error like Garcia did in the first example, meaning that 94.8% of sacrifice attempts were indeed that--sacrifices. 

Jon Jay at this point in the season had a 65.7% chance of making an out (productive and non-productive outs added together), vs. the 94.8% chance if sacrificing. In other words, he might make an out if he was swinging away, but almost certainly WOULD make an out if sacrificing. If all these numbers are turning your brain into jelly, let me re-state my main point:

A hitter MIGHT make an out if he's swinging away 
He almost certainly WILL if sacrificing

As I was amassing the data I was beginning to get confused as to how best present it (probably very similar to how you're feeling right now), and I had a vague notion that the Tom Tango Run Expectancy Index would give me some insight. I consider that index to be so important that it has it's own link on the side of this blog, and it was after looking at it for the 10,000th time I realized that the index measures precisely these types of events--how many runs score AFTER something happens, and if that can be determined, the true value of an out can be placed in context. I did this at least two years ago with regard to stolen bases (I'd link to the post but it's extremely out-of-date--I'll update it before the beginning of the 2014 season). Regarding sacrifice bunts, I asked a simple question:

How often does a sacrifice bunt lead to a run--you'll find the answer surprising:


This chart shows the number of runs that score in an inning AFTER a sacrifice bunt is made. Runs prior to the sacrifice are irrelevant as the sacrifice had nothing to do with them scoring. In 2013, 23.4% of sacrifice bunts accomplished exactly what they were supposed to, advance a base runner who subsequently scored. 59.1% just gave away an out and 17.5% ended up making an unnecessary out, not that an out might not have been made anyway. I must state that in this analysis I was EXTREMELY GENEROUS--I counted the sacrifice a success if only one run scored regardless of the type of hit that drove in the run. For example, runner on 1st sacrificed over to 2nd, next hitter hits a triple--that runner would have scored regardless of the sacrifice. I LEFT THAT IN, suggesting that the 23.4% of sacrifice bunts being absolutely essential in allowing a run to score is the absolute high mark. If I were to adopt more stringent standards, that number would be lower. These numbers are very constant over the 2000-2012 period as well.






 
I never made myself clear at the beginning--I have ABSOLUTELY NO PROBLEM with bunting to get on base. These are the batting averages of players who bunted with the bases empty since 2000:























 Works for me, I just needed to be 100% clear that bunting isn't the issue as much as sacrifice bunting. I'm not categorically stating that the sacrifice bunt is bad strategy--in the modern era (at least as far back as 2000), the bunt (any bunt, not just the sacrifice) is used significantly more often in the #9 slot:

I didn't stratify this data by league or by sacrifice vs. non-sacrifice situation, but it's clear the thinking on sacrifice bunting has progressed to a point where it is utilized in the situation in which the tradeoff between an out and a successful outcome is the smallest, i.e., with the pitcher at bat. I won't argue against this unless it involves  Travis Wood or other good-hitting pitchers. As I accumulate further data (I'm back to 1998) it will be very interesting to see if this trend persists.




 



I'll finish with one pet peeve and one request. The pet peeve is simple--the awarding of the sacrifice bunt is very capricious, completely at the whim of the official scorer. Consider these two lines:





What's the difference? Eric Young was credited with a sacrifice hit and Gerardo Parra was not. Nothing about the data gives me any clue as to why this subjective judgment was made--in Parra's game it was close and it was a potential sacrifice situation. I suspect in the scorer's mind Parra wasn't attempting to sacrifice but was trying to reach base via a bunt, and why not? He's in the top 40 in terms of bunt hits, so he has the ability, but how did the scorer KNOW? 

As I entered in game data throughout the year I noticed the at-bats didn't always agree with the box score even though I have checks for sacrifice hits. To give some insight into how often this happens, Baseball-Reference made a change in 2013 in their play-by-play descriptions, no longer using the term "Sacrifice" in the play description like they had in years prior. I had to manually clean up the 2013 data and remove around 100 bunts that looked like sacrifices (runners on base, 0 or 1 out) that weren't--and I have no idea why they weren't scored as sacrifice bunts.

VERY IMPORTANT--PLEASE READ
My last point is a very simple one--I have lots of data on this issue, but no clue if people care to read about it. I'll write more on this topic if I see interest. If there's enough readers of this post, I'll cover some of the other sacrifice bunt data I uncovered, and it's voluminous. I can tell you more about bunt doubles, sacrifice attempts when up or down by more than 10 runs (???) and a bunch of other stuff I find interesting but I'm not sure others will. 

I would also greatly appreciate it if someone could get the Joes, Sheehan (@joe_sheehan) and Posnanski (@JPosnanski) to read this, since both have written and commented on this subject recently and I'd be very interested in what they thought if they were so inclined to take the time.

DATA AND DATA COLLECTION
I had a head start as I have play-by-play data going back to 2009, giving me 5 complete seasons with which to work, but the database had become indwelt by Satan as the 2013 season came to a close. I use Excel because I'm an idiot and haven't been able to teach myself database programs like mySQL, and this particular database was showing me the practical limits of Excel as it had around 960,000 lines of data, each line of which had around 75 columns of data, or a total of...A LOT of data. I used to be a pharmaceutical sales representative for GlaxoSmithKline, and I received my first company computer in 1994, a Mac with a hard drive of (wait for it) 150 MB. My primary database was pushing 700 MB until I pared it back to make it manageable again. It was becoming unstable, I couldn't update the Mistake Index because the tables were groaning by performing all those calculations, and I'm sure I've bored you silly by this point anyway.

So I found a new way to amass the data--using the Event Finder feature of Baseball-Reference's Play Index ($36 a year, you won't regret it) I was able to ALMOST replicate these databases and actually expand them. This is NOT as complete as grabbing play-by-play data on a daily basis, but I was able to amass a season's worth of play-by-play data, approximately 185,000 lines of data in about 45 minutes. Do the math--a VERY complete play-by-play database that essentially takes 162 man-hours to compile (it takes me an hour to process a typical 15-game slate of games), or a PRETTY GOOD one that takes 45 minutes. Guess which one won out?

In the Event Finder I chose "Plate Appearances" and then selected the tab which shows all the plate appearances for a year. I've honed my text-grabbing techniques so I can now get an entire season of data for all 30 teams in around 45 minutes, around 25 of which are waiting for either my computer or the B-R site to respond. My computer is 3 years old but has 4 GB of RAM and I use a relatively fast Internet connection, but when you're entering a request for around 6,000 lines of data, I suppose a little lag is a small price to pay. Besides, the other Reference sites (Pro-Football Reference, Basketball-Reference and Hockey-Reference) only show 100 lines of data at a time and don't give the number of instances there are of a given event.

For all you young Excel database creators , this formula will immediately leap to the top of your favorites if you do any kind of data collection that involves text:
IF(ISNUMBER(FIND("text you want to find",A1)),1,"")
Assume this is in cell A1:
Bunt Groundout:C-2B (Front of Home); Duda to 2B
If you use the formula:
IF(ISNUMBER(FIND("Bunt",A1)),1,"")
Excel will put a 1 in your bunt column, which is how I tabulate just about every event I measure. I discovered this formula about a year or so ago and it expanded the power and capability of my database in ways that I can't describe. However, by utilizing this source of data vs. the data I entered manually (and I use the term "manually" loosely), I sacrifice some accuracy. This is 2013 data for Oakland:
The first line of data comes from deep inside B-R split data and the second is from gleaning data from the Event Finder. Let me make 100% clear that there is NOT an issue with B-R data, only with my methods of obtaining it and cleaning it up. These types of plays amount to around 40 of of almost 2,800 bunt occasions, or around 1.5%. I'll live with this level of error instead of trying to truly dig in but I thought it only fair to make it known. If the day comes when I have to defend this while undergoing my orals to obtain my PhD in advanced baseball metrics and Ken Harrelson is part of the review panel, I'll make a concerted effort to clean this up. For now, the amount of time it would take to find those 40 mistakes is far out of proportion, and in the end doesn't affect my primary measure, runs scored after a sacrifice.

In case it's not obvious, these databases I created have virtually EVERYTHING that has happened on a baseball field since 1998. They do NOT include:
1. Stolen bases, caught stealing or pickoffs
2. Wild pitches, passed balls or balks
3. The information that comes along with pitches in the box score play-by-play data (like throws to 1st, the runner breaking for 2nd, etc.)
4. Runs scored. Consider the following:

Mark Trumbo drove in Erick Aybar on a groundout--my database will give Trumbo credit for the RBI but it WON'T give Aybar credit for the run. There are stunningly complicated ways in which I could correct this but for the purposes of these databases, it's more work than I would receive in value.

My primary database DOES have all this, making it far more valuable to me when I research arcane issues like who made the most throws to 1st (answer--can't remember). I can't replicate the Mistake Index with them, because I've also adapted my primary database to measure pitching stats as well. That's why the big database is better--it takes more work, but has more information. The tradeoff is that sometimes all that information isn't necessary, just MOST of it, and that's where these databases can be handy.

If you've made it this far and are familiar with advanced database programs, have I got a proposition for you. Email me (at the top of the blog) and we'll see if we can work something out--I'll get the data, you "manage" it and teach me how to perform queries, and you'll have access to data that few have...and it can include the NFL, NBA, college football, college basketball and hockey.

1 comment :

  1. I notice in the league season totals and averages that sacrifices were at an all time low in 2013.

    ReplyDelete