Take the Baseball Data Analysis Challenge with Vertica Machine Learning

Posted April 7, 2022 by Jim Harris, Vertica Senior Product Marketing Manager

Today is Opening Day of the 2022 Major League Baseball (MLB) season. I was born in Boston, Massachusetts and raised as a fan of the Boston Red Sox, who play at historic Fenway Park, just six or so miles from our Vertica headquarters in Cambridge.

Baseball data is mostly transaction data describing the statistical events of games played. Statistical analysis has been a beloved pastime even longer than baseball has been America’s Pastime. Number-crunching is far more than just a quantitative exercise in counting. The qualitative component of statistics – discerning what the numbers mean, analyzing them to discover predictive patterns and trends – is the very basis of data-driven decision making. And watching the Red Sox round the bases from an early age became the basis of my life-long fascination with data.

There may be no crying in baseball, but there is an awful lot of data analytics. At least there has been since the powerful paradigm shift pioneered by Bill James, the baseball writer, historian, statistician, and former longtime Red Sox consultant. James was the founder of Sabermetrics, which was made famous by the best-selling 2003 Michael Lewis book Moneyball: The Art of Winning an Unfair Game that was the basis for the 2011 Academy Award nominated film starring Brad Pitt and Jonah Hill. Interesting fact – Brad Pitt played then-GM and now executive vice president of baseball operations Billy Beane of the Oakland Athletics, who was a keynote speaker at our very first user conference.

As part of my role as Senior Product Marketing Manager for Vertica, I create demonstrations and tutorials about unified, end-to-end, in-database data analytics, data science, and machine learning. A few weeks ago, on my personal blog I announced a Baseball Data Analysis Challenge. I shared an input dataset containing 6 years (2016-2021) of Red Sox regular season game results, including a Game_Result column, labeled either 0 or 1, where 0 = Loss and 1 = Win. I invited everyone to use whatever techniques and tools they would like to discover any insights and/or make any predictions.

I completed my initial work in time for opening day, the results of which you can find in this Microsoft Excel file: Baseball Data Analysis Challenge 2022-04-05.xlsx. My baseball data analysis was performed using Vertica’s in-database machine learning capabilities, and you can find my SQL scripts on GitHub.

I used logistic regression classification models to calculate win probabilities for the Red Sox across nine (9) game metrics: opponent, opponent’s division, month of year, day of week, runs scored, hits, extra base hits, home runs, and walks versus strikeouts. I also used the input data to train a Naïve Bayes classification model to predict wins and losses with an associated probability based on the runs scored, hits, extra base hits, home runs, and walks versus strikeouts game metrics (all of which are binned ranges of input data values). Its initial accuracy is only 77%, but I plan on making some adjustments. I also plan on using the 2022 baseball season as my test data. So not only will I be watching how many games the Red Sox win or lose this season, but I will also be watching how many games my machine learning model predicts correctly.

Think you can best my model? Game on! The baseball data analysis challenge continues. Play ball!