Thursday, May 2, 2013

Fun With Linear Regression

I was playing around with one of my datasets and decided to combine my MBA statistics class with some curiosity and see what happened. I'm sure the post title alone piqued your interest--it only gets BETTER!

I started out looking at data values from 2000-2012 for all baseball teams. This table shows the r-squared, or the coefficient of determination. Every time I go to the explanatory Wikipedia article to be sure that I'm using the measure correctly, I get more confused, but I'll go with this quote:
R2 = 1 indicates that the fitted model explains all variability in y, while R2 = 0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope=0, intercept=y) between the response variable and regressors). An interior value such as R2 = 0.7 may be interpreted as follows: "Seventy percent of the variation in the response variable can be explained by the explanatory variables. The remaining thirty percent can be attributed to unknown, lurking variables or inherent variability."
I did some bush-league scattergraph plots on numbers by themselves and Excel gave me these values:
I compared everything against a team's winning percentage. As an example, I'll show the runs scatterplot:

Don't worry about the odd shape of the chart, the important thing is that the axes are aligned to show as best as possible any linear (or lack thereof) relationship. This is illustrated by the blue square, showing my attempt to make sure the scales match up.

If we accept the r-squared value, this suggests that 28.2% of a team's winning percent is due to the runs they score, a real shock to baseball fans around the world. 

It was at this point I went off the deep end. So far, I've only mentioned offensive variables, leaving out pitching altogether. I wasn't up for figuring out the r-squared values for pitching, so I turned to another trick--multiple linear regression. I added in a bunch of pitching data and told ye olde computer to fit a regression line using (or rejecting) these values:
Offense--Hits, Home Runs, Stolen Bases, Walks, Strikeouts, GDP, Batting Average, On-Base Percent, Slugging and OPS
Pitching--the OPPONENT values of the above except for SB and GDP, and adding WHIP, H/9, BB/9 and HR/9
So, 22 variables for the program to choose from, and it ultimately used nine of them--take a moment to consider what those were.

A brief aside to allow you more time to consider my question above. My computer isn't exactly a Cray supercomputer, but it's not some ancient piece of crap. It's about 3 years old, has 4 MB of RAM and can get things done. It's not my computer, it's what I ask it to do. For example, my two primary baseball databases take about 5 minutes to open--that's not a typo. The play-by-play database has 800,000 lines and about 75 columns of data per line, so when I work with that mass of data, well, it takes awhile. If I was smart I'd learn mySQL, but I've already bored you to death by this point.

Variable 1--WHIP. I use a set of macros that came with my MBA statistics class, which took 10 minutes to crunch this data. According to this, it stated that WHIP accounted for 42.9% of the explanation. Offense comes and goes, but if a team can keep a team off base, it can win low-scoring games. The rest of the values are shown in this table:

Pretty obvious when you think about it--WHIP very effectively measures how well a team pitches and OPS is a useful measure of how effective their offensive output was. It's not exactly Pythagorean Wins, but it's walking down the path. In case it isn't obvious, those measures with an "o" in front refer to the opponent's values. In a sense, there's some overlap, in that an opponent's hits are already accounted for in WHIP, but the amount of difference that it makes is miniscule.

What does all this mean (if I've calculated and presented this all correctly, and those are REALLY BIG IFs)? The "Amount" column is the r-squared value, so taken together, these nine variables account for 82.5% of how a team attained its win percentage. This chart shows it graphically:

To see if this accurate, I'll pull a couple of teams completely at random and see how well they match up--I'll leave out the math:

Not too shabby. I chose teams that ranged from awful to pretty darn good to check for bias  in the data. The values in yellow are what the regression model calculated as the expected winning percent, and the actual winning percent is the second-to-bottom column. Any model that can get within approximately 5% of reality probably has value.

I have two daughters, one a recent college graduate and the other close to finishing. As we visited schools, I always checked the business section of the bookstores to see if there were any decent statistics books that included software, because what I use is pretty touchy and not exactly what I would consider intuitive. I also instructed my daughters to date men who took statistics classes so I could glean their textbooks, but oddly enough, that didn't occur. My point with this is that these macros did in 10 minutes what would have taken me HOURS in my undergrad days--linear regression isn't difficult as much as tedious.

I'm willing to share this small data set with anyone who wants to test it in their own ways, since I would very much like this to be double-checked to see if I'm accurate. It certainly supports the crazy notion that the team that scores more runs than they give up have more success, but helps isolate what is truly important. I'm probably going to run this analysis in some different ways, but I highly doubt the outcomes will be all that different--it's been true since organized baseball began in 1871.

One last chart--I couldn't resist, I wanted to see how well WAR (using Baseball-Reference values) correlated with winning--the r-squared measure suggests that adding together a team's hitters and pitchers WAR can explain about 71.7% of the reason behind winning percent:

So much for those new-fangled baseball metrics--they don't tell us ANYTHING.

1 comment :

  1. Did you look at adjusted R-square to compare the regression models with multiple explanatory variables? I'm curious if the variables in your model are really contributing anything or if you are just getting a spurious fit to the data.