I’ve taken a few days off of writing, but not the grind. Am I too invested in the KBO?
Yup, starting to get ads in Korean. Anyway, no results to post today. Not because there haven’t been any results, it’s just that the results are bad. End of Update.
So what’s wrong? Well I’ve spent the last few days trying to:
- Improve the model
- Figure out what is going on
Some improvements I’ve made:
- I’m now scraping the lines every 10 minutes, and updating the sheet when their are changes
- I’m factoring the starting pitchers into win calculations
- I’m messaging myself on Slack to alert to new line changes
I wish this was a bit more nuanced, but basically the model sucks right now. This is most clearly evident each day when it wants to put as much money as possible on the SK Wyverns. Now, the Wyverns tied for the league lead in wins last year, with an automatic ticket punched to the semi-finals of the KBO’s laddered playoff. The model, which starts with Fangraphs ZiPS Projected Standings, has the Wyverns as the third best team in the league, with a projected win percentage of .563. Their current win percentage is…
┻┳| •.•) 0.083
Not great Bob dot gif
I was discussing this conundrum with a friend at 11:30pm on Friday. He was feeding his baby. I was feeding my model.
Me: Now I’m digging into why the model is doing so poorly. It turns out that a team that tied for the most wins last year is…
1-8 (Editors note: They lost three more times since this conversation)
Very cool. Now I’m trying to figure out if they were lucky last year, unlucky this year, or if I should not worry about 9 games in a 144 game season.
Friend: I think it’s the latter.
Reader, I no longer think it’s just small sample size. The Wyverns are bad.
Now I had read the season previews. SK had a commanding lead last season and fell apart down the stretch, losing the pennant on the last day of the regular season. (Do they call it the pennant?) They also lost their two best pitchers to the MLB and Japan. BUT! I read good things about their organization. Plus, the kernel of my model, the Fangraphs ZiPS projections, is based on the current roster, so it should in theory account for the roster turnover.
But when I started the model all I had to go on was the Fangraphs projections and my own home field advantage calculation.
Over the last week I’ve been factoring in the starting pitchers to adjust the win percentages based on WAR I’ve again sourced from elsewhere (theme here). And while it seems there has been a rash of injuries lately that I’m not taking into account, the model still LOVES the Wyverns, picking them as favorites in matchups with +170 and up prices. I had convinced myself that the casual observer would see a 1-8 team as huge underdogs and the market would respond appropriately. I’m also smart enough to know that the markets are usually fairly accurate.
Time to go back to the drawing board. So I’ve spent the last couple of days reviewing the 2019 MLB model that was the genesis of my KBO model. Here’s what the MLB model did:
Every day in the middle of the night, a cron job would kick off a scraper to go get the latest data for me. This meant grabbing:
- The Fangraphs win projections
- Fangraphs’ batters, pitchers and staring pitcher depth charts and associated stats and projections
- Baseball Prospectus’s PECOTA win projections
- BP’s PECOTA projections for batters and pitchers
Separately, I’d scrape MLB and the market sites every 10 minutes. I’d isolate the parts of the specific pages I cared about (eliminating noise from things like changing banner ads), chop them up into the elements that made sense for the context (games/players for MLB, game lines for the markets), and then I’d sort the lists of those objects and hash them and store them in a database. This allowed me to store big chunks of content in very small packages, and then look for changes over time.
I’ve been reviewing all this code over the last couple days, and this certainly was not the most efficient approach, but whenever a change was detected across any of my targets, it would kick off the function to calculate odds.
The secret sauce of the model processes the lineups and compares them to the market lines. To build the lineups and win expectancy adjustments, the model:
- Takes the home and away baseline win percentages from PECOTA (NOTE: This is where I screwed up. Baseball Prospectus pecota projections for team was just the projection of remaining wins, ignoring everything that had happened at that point in the season. On the last day of the season it projected every team to either be 162-0 or 0-162.)
- Grabs the PECTA projections for the starting pitchers, mainly PECOTA’s Deserved Runs Average, and figures out how many more or less runs the teams would give up if these particular pitchers pitched every singe game vs the baseline expectations. It would then divide this run differential by 10, given the assumption that 10 runs of run differential is worth about 1 win or loss. For example, if all of a team’s starters would give up 500 runs, and today’s starter would have only given up 480 runs in the same number of innings, the offset would be 20 runs of positive run differential. Divide that by 2, and the team would win 2 more games. The opposite applies as well. If a different starter for the same team would give up 530 runs, the offset would be -30 runs of differential, divided by 10 for -3 wins. Note: I’m using Deserved Runs Average which tries to be defense independent and focuses just on the pitching. I’m making up for the defensive part with the position players below.
- Does the same for each position player vs the baseline expectation for the position they are in that game, but does so with WAR, including offensive and defensive contributions. Say a team’s catching platoon’s full season WAR was 7, but the primary backup’s WAR was 1 for a full season, the adjustment would be -6 wins.
- Adjusts the baseline team win expectancies based on the starting pitcher and lineup adjustments and builds a revised expected win percentage for each team’s full season if this exact lineup played all 162 games.
- The two win expectancies for each match up are normalized, with the home team getting an 8% home field advantage to determine what the chances of each team are in each game.
Now that we have a win expectancy for each team and each game, the model converts the moneyline for each side into an implied probability for each team. Because of the juice, the probability for each matchup sums to greater than 100%. I’m not normalizing this as I want to compare my expected odds to the breakeven percentage of each bet. Note: Early versions of the KBO model were comparing my win expectancy with normalized market implied odds. This would recommend plays where the chances of team winning were less than the bet’s breakeven point.
Where the model would find a higher win expectancy than the bet breakeven percentage, it would recommend a play. The higher the delta, the larger the bet.
The KBO Model Gap
After reviewing last year’s MLB model, here’s the score card for the current KBO model
- Rich baseline win projections for each team: INCOMPLETE. Yes, I started with bottom up model, ZiPS projections from Fangraphs. However, I don’t have visibility into the distribution of total wins by player by position. And because this is a novel data set, there’s no visibility to year-over-year performance. Which, isn’t really fair for me to ask for, but still, it’s an assumption in the model
- Access to each day’s starters and lineups: INCOMPLETE. I have been using MyKBO to get starters about 15 hours in advance. However, I still haven’t found a great source for lineup info. RotoGrindersMLB tweets out lineups at like 5 AM. Not super easy to parse and process. I’m still on the hunt for a source that I can process.
- Win adjustments based on the lineups: INCOMPLETE: Again, I’m using a separate WAR number for starters, and have no way to deal with position player adjustments.
The Path Forward
I can try to recreate the structure of the data and model from last year. I’m slowly find more sources. Alternatively I can try to build my own bottom up model, using Runs Scored and Allowed, and Base Runs. I’ve already validated that the run differential lines up between the last few years of the KBO (since the league expanded to 10 teams) with the MLB so that win % = .500 * 0.0006(Run Differential). However, because of the shorter season in the KBO, 1 win equals a little over 11.5 in run differential, instead of 10 in the MLB.
I may have to also build a base runs model for the KBO, or could start with the MLB one, but if I’m going to that effort might as well do it specifically for the KBO. This would help me evaluate position players expected offensive run production. I’m still not sure where to find defensive grading for position players, so perhaps I should use pitchers’ ERA instead of DRA if I can’t make it up on the defenses side?
Or I can try to find a more complete data set that works across the whole model. Baseball Prospectus and FanGraphs have been around the block and I trust their work a lot more than just random stuff I come across translated from Korean by Google.
However, in the meantime, I’m going to do the following intermediate steps:
- See if I can dig up injury info and just downgrade base rosters to get better estimates of win expectancy contribution from position players
- Stop betting until the model starts to perform better.
More to come soon. Well, soon-ish.