KBO Model Day 10: Beginnings of a Bottom-up Model

Back on the horse! After just a shade under 2 weeks of not placing wagers, I bet on 4 games last night. Went 2-2, all dogs, but finished up on the day with an ROI of 25%. I sized the wagers manually instead of using the robo-recommended bet sizes, and of course it’s a small sample size, but I’m excited none the less. Even the “bad” earlier model is starting to claw it’s way back. Note, these results are paper-traded, meaning I didn’t place these bets, I’m just tracking what would have happened if the model got it’s way. One more note, I also discovered my line tracking was flawed. I’ll come back to that later.

Original KBO Model Performance to Date

Original KBO Model Performance to Date

A rollercoaster.

So why did I resume? Well, it’s certainly not because the imaginary blue line started to go up again. I built a rough bottom-up model instead of relying on the Fangraphs ZiPS projections and random WAR from some Korean site. I tried to reverse engineer the datasets I used for last year’s MLB model, using a combination of scraping, the python Pandas data analysis package, and old fashion spreadsheets.

I wanted to rebaseline my win-expectancy for each team. I still don’t have the ability to automatically grab lineups, so I used this year’s stats to date to determine the primary starters for each team. I blindly just used most plate appearances by position to build a default starting line up. Next, I looked at 2019 stats for those players, using the league average for a starter for anyone who didn’t match on last year. I took the number of games played in 2019 per starter and scaled the stats up as if they had played all 144 games. I then aggregated their hitting stats to build an expected amount of baseruns scored.

Next, I tackled pitchers. I flagged pitchers who’ve started 75% or more of games they’ve appeared in as starters. Again using 2019 stats (and league averages as defaults), I figured out the expected baseruns allowed by just the relievers, and divided that innings pitched to get a per/IP baseruns allowed number for a relieving staff. I’ll come back to why in a minute.

Next I built full season projections of baseruns allowed based on expected starts and relief innings. I normalized the runs allowed for the league to the runs scored from the offensive numbers, then compared this with each offense’s expected runs scored to get a run differential. I used pythag expectations based on run differential to get a winning percentage baseline for each team.

I then fleshed out baseruns allowed expectations for every pitcher if they started every game. I used their average number of innings per game, then multiplied the relievers’ baseruns allowed by the remaining innings up to 9. This gives me an expected runs allowed for each pitcher who may start a game. I compared this individualized baseruns allowed number to calculate a new run differential and winning expectation.

Now my model looks at each pitcher and gets a full season win expectancy for that starter, the team’s bullpen, and the default starting lineup. With this in place I placed my first four wagers in a couple of weeks (there was no play on one game.)

In the course of this work, I discovered another flaw elsewhere in my process that I fixed this morning. When I went to place the wagers last night, I noticed a couple of lines in the market didn’t match my sheet. I’m scraping and logging new lines every 10 minutes and the model should grab the latest line. However, I discovered that I was only checking to see if a line was unique, not that it matched the previous latest line. Therefore if a line kept fluctuating between the same prices, I’d only display the latest unique one. I pushed the fix for that this morning.

My starting lineup projections are also not perfect, and I’d like to address that today. There are a couple of issues:

  1. For some reason I don’t see catchers in the lineups. I think that they are there, but their position is blank for some reason. There are legit blank positions which represent DHs. Edit: Yep, my scraper was forgetting to label catchers’ positions. Fixed manually in the last mile spreadsheet as well as the scrapper code.
  2. Because I’m looking at total plate appearances, 6 of the 10 lineups have 10 starters instead of 9, so I’m overstating the runs they should produce. This should be an easy fix. Edit: more manual work. In one instance, my master database builder missed a guy completely because he was yet to play in the field. For the rest, they were tagged for their non-DH positions as backups.
  3. Most importantly, I’m not taking into account injuries, of which there seem to have been a lot lately. I need to validate that the starters I’m using can actually play in games right now. Ideally I’d tune each one for the starting lineups for each game, but that’s not possible right now.

Lastly, learning from last year, I’d like to account for what has happened this year and weighting it proportionally to my full season expectation.