KBO Model Day 10: Beginnings of a Bottom-up Model

Back on the horse! After just a shade under 2 weeks of not placing wagers, I bet on 4 games last night. Went 2-2, all dogs, but finished up on the day with an ROI of 25%. I sized the wagers manually instead of using the robo-recommended bet sizes, and of course it’s a small sample size, but I’m excited none the less. Even the “bad” earlier model is starting to claw it’s way back. Note, these results are paper-traded, meaning I didn’t place these bets, I’m just tracking what would have happened if the model got it’s way. One more note, I also discovered my line tracking was flawed. I’ll come back to that later.

Original KBO Model Performance to Date

Original KBO Model Performance to Date

A rollercoaster.

So why did I resume? Well, it’s certainly not because the imaginary blue line started to go up again. I built a rough bottom-up model instead of relying on the Fangraphs ZiPS projections and random WAR from some Korean site. I tried to reverse engineer the datasets I used for last year’s MLB model, using a combination of scraping, the python Pandas data analysis package, and old fashion spreadsheets.

I wanted to rebaseline my win-expectancy for each team. I still don’t have the ability to automatically grab lineups, so I used this year’s stats to date to determine the primary starters for each team. I blindly just used most plate appearances by position to build a default starting line up. Next, I looked at 2019 stats for those players, using the league average for a starter for anyone who didn’t match on last year. I took the number of games played in 2019 per starter and scaled the stats up as if they had played all 144 games. I then aggregated their hitting stats to build an expected amount of baseruns scored.

Next, I tackled pitchers. I flagged pitchers who’ve started 75% or more of games they’ve appeared in as starters. Again using 2019 stats (and league averages as defaults), I figured out the expected baseruns allowed by just the relievers, and divided that innings pitched to get a per/IP baseruns allowed number for a relieving staff. I’ll come back to why in a minute.

Next I built full season projections of baseruns allowed based on expected starts and relief innings. I normalized the runs allowed for the league to the runs scored from the offensive numbers, then compared this with each offense’s expected runs scored to get a run differential. I used pythag expectations based on run differential to get a winning percentage baseline for each team.

I then fleshed out baseruns allowed expectations for every pitcher if they started every game. I used their average number of innings per game, then multiplied the relievers’ baseruns allowed by the remaining innings up to 9. This gives me an expected runs allowed for each pitcher who may start a game. I compared this individualized baseruns allowed number to calculate a new run differential and winning expectation.

Now my model looks at each pitcher and gets a full season win expectancy for that starter, the team’s bullpen, and the default starting lineup. With this in place I placed my first four wagers in a couple of weeks (there was no play on one game.)

In the course of this work, I discovered another flaw elsewhere in my process that I fixed this morning. When I went to place the wagers last night, I noticed a couple of lines in the market didn’t match my sheet. I’m scraping and logging new lines every 10 minutes and the model should grab the latest line. However, I discovered that I was only checking to see if a line was unique, not that it matched the previous latest line. Therefore if a line kept fluctuating between the same prices, I’d only display the latest unique one. I pushed the fix for that this morning.

My starting lineup projections are also not perfect, and I’d like to address that today. There are a couple of issues:

  1. For some reason I don’t see catchers in the lineups. I think that they are there, but their position is blank for some reason. There are legit blank positions which represent DHs. Edit: Yep, my scraper was forgetting to label catchers’ positions. Fixed manually in the last mile spreadsheet as well as the scrapper code.
  2. Because I’m looking at total plate appearances, 6 of the 10 lineups have 10 starters instead of 9, so I’m overstating the runs they should produce. This should be an easy fix. Edit: more manual work. In one instance, my master database builder missed a guy completely because he was yet to play in the field. For the rest, they were tagged for their non-DH positions as backups.
  3. Most importantly, I’m not taking into account injuries, of which there seem to have been a lot lately. I need to validate that the starters I’m using can actually play in games right now. Ideally I’d tune each one for the starting lineups for each game, but that’s not possible right now.

Lastly, learning from last year, I’d like to account for what has happened this year and weighting it proportionally to my full season expectation.

KBO Model Day 9: Home Field Advantage Completed

Still in the lull of being productive on the KBO model. I’ve done more background work just to wrap my head around the game. Things like:

  • Some baseruns calculations + pythag for win expectancy. I did this via just manually grabbing data from Baseball Reference, but can automate because…
  • Built a scraper for Baseball Reference to grab all KBO team stats (batting, pitching, and defense) for 2019 and 2020 for building my bottom up win expectancy
  • Been more closely reviewing my old MLB model to game plan my approach for KBO Model 2.0

One thing has been bugging me is all the missing games from my home field advantage modeling. I did some digging and discovered one thing I expected, and another that I did not.

As expected, the scraper I used as the core of my system couldn’t handle double headers, but in a way that was different than I expected. The scraper would take the league’s daily schedule, grab some attributes, then build the game recap IDs from the attributes. There were two issues with this:

  1. The daily schedule page didn’t even list the second game of a doubleheader
  2. The factory that built the game IDs assumed a component that was just 0. However, in the first game of a double header this value was 1, and it was 2 for the second game. Therefore, it didn’t include either game. I resolved this by grabbing results from each team’s individual results page, then deduping.

Unexpectedly, the scraper has a list of teams as constant, and one of the teams for 2019 was just missing. Turns out, it’s because the Kiwoom Heroes were missing because prior to 2019 they were the Nexen Heroes. I fixed this by just adding the team’s current name in both Korean and English to the constants module.

I’ve re-run the data, and now have 410 more games for the decade than before. Overall, it pushes the KBO home field advantage to around 5% for the decade (bringing up numbers lower than that and dropping those that are higher). I’ve also included 2020’s numbers through today’s games.

KBO Home Field Advantage, 2010-2020 (through 5/27)

KBO Home Field Advantage, 2010-2020 (through 5/27)

SeasonHome Team WinsVisiting Team WinsTotal GamesHome Field Advantage
2010264268532-0.75%
20112762485245.34%
20122712615321.88%
20133183096271.44%
20143042795834.29%
20154113737844.85%
20164223798015.37%
20174103637736.08%
20184123477598.56%
201942033975910.67%
202056461029.80%
Totals3,5643,2126,7765.19%

KBO Model Day 8-ish

I’ve taken a few days off of writing, but not the grind. Am I too invested in the KBO?

Like a Good Neighbor...

Yup, starting to get ads in Korean. Anyway, no results to post today. Not because there haven’t been any results, it’s just that the results are bad. End of Update.

So what’s wrong? Well I’ve spent the last few days trying to:

  1. Improve the model
  2. Figure out what is going on

Some improvements I’ve made:

  • I’m now scraping the lines every 10 minutes, and updating the sheet when their are changes
  • I’m factoring the starting pitchers into win calculations
  • I’m messaging myself on Slack to alert to new line changes

What’s Wrong

I wish this was a bit more nuanced, but basically the model sucks right now. This is most clearly evident each day when it wants to put as much money as possible on the SK Wyverns. Now, the Wyverns tied for the league lead in wins last year, with an automatic ticket punched to the semi-finals of the KBO’s laddered playoff. The model, which starts with Fangraphs ZiPS Projected Standings, has the Wyverns as the third best team in the league, with a projected win percentage of .563. Their current win percentage is…

┳┻|
┻┳|
┳┻|
┻┳|
┳┻|
┻┳|
┳┻|
┻┳|
┳┻|
┻┳|
┳┻|
┻┳|
┳┻| _
┻┳| •.•) 0.083
┳┻|⊂ノ
┻┳|

Not great Bob dot gif

I was discussing this conundrum with a friend at 11:30pm on Friday. He was feeding his baby. I was feeding my model.

Me: Now I’m digging into why the model is doing so poorly. It turns out that a team that tied for the most wins last year is…

1-8 (Editors note: They lost three more times since this conversation)

Very cool. Now I’m trying to figure out if they were lucky last year, unlucky this year, or if I should not worry about 9 games in a 144 game season.

Friend: I think it’s the latter.

Reader, I no longer think it’s just small sample size. The Wyverns are bad.

Now I had read the season previews. SK had a commanding lead last season and fell apart down the stretch, losing the pennant on the last day of the regular season. (Do they call it the pennant?) They also lost their two best pitchers to the MLB and Japan. BUT! I read good things about their organization. Plus, the kernel of my model, the Fangraphs ZiPS projections, is based on the current roster, so it should in theory account for the roster turnover.

But when I started the model all I had to go on was the Fangraphs projections and my own home field advantage calculation.

Over the last week I’ve been factoring in the starting pitchers to adjust the win percentages based on WAR I’ve again sourced from elsewhere (theme here). And while it seems there has been a rash of injuries lately that I’m not taking into account, the model still LOVES the Wyverns, picking them as favorites in matchups with +170 and up prices. I had convinced myself that the casual observer would see a 1-8 team as huge underdogs and the market would respond appropriately. I’m also smart enough to know that the markets are usually fairly accurate.

MLB Model

Time to go back to the drawing board. So I’ve spent the last couple of days reviewing the 2019 MLB model that was the genesis of my KBO model. Here’s what the MLB model did:

Baseline

Every day in the middle of the night, a cron job would kick off a scraper to go get the latest data for me. This meant grabbing:

  • The Fangraphs win projections
  • Fangraphs’ batters, pitchers and staring pitcher depth charts and associated stats and projections
  • Baseball Prospectus’s PECOTA win projections
  • BP’s PECOTA projections for batters and pitchers

Scraping

Separately, I’d scrape MLB and the market sites every 10 minutes. I’d isolate the parts of the specific pages I cared about (eliminating noise from things like changing banner ads), chop them up into the elements that made sense for the context (games/players for MLB, game lines for the markets), and then I’d sort the lists of those objects and hash them and store them in a database. This allowed me to store big chunks of content in very small packages, and then look for changes over time.

I’ve been reviewing all this code over the last couple days, and this certainly was not the most efficient approach, but whenever a change was detected across any of my targets, it would kick off the function to calculate odds.

The Meat

The secret sauce of the model processes the lineups and compares them to the market lines. To build the lineups and win expectancy adjustments, the model:

  • Takes the home and away baseline win percentages from PECOTA (NOTE: This is where I screwed up. Baseball Prospectus pecota projections for team was just the projection of remaining wins, ignoring everything that had happened at that point in the season. On the last day of the season it projected every team to either be 162-0 or 0-162.)
  • Grabs the PECTA projections for the starting pitchers, mainly PECOTA’s Deserved Runs Average, and figures out how many more or less runs the teams would give up if these particular pitchers pitched every singe game vs the baseline expectations. It would then divide this run differential by 10, given the assumption that 10 runs of run differential is worth about 1 win or loss. For example, if all of a team’s starters would give up 500 runs, and today’s starter would have only given up 480 runs in the same number of innings, the offset would be 20 runs of positive run differential. Divide that by 2, and the team would win 2 more games. The opposite applies as well. If a different starter for the same team would give up 530 runs, the offset would be -30 runs of differential, divided by 10 for -3 wins. Note: I’m using Deserved Runs Average which tries to be defense independent and focuses just on the pitching. I’m making up for the defensive part with the position players below.
  • Does the same for each position player vs the baseline expectation for the position they are in that game, but does so with WAR, including offensive and defensive contributions. Say a team’s catching platoon’s full season WAR was 7, but the primary backup’s WAR was 1 for a full season, the adjustment would be -6 wins.
  • Adjusts the baseline team win expectancies based on the starting pitcher and lineup adjustments and builds a revised expected win percentage for each team’s full season if this exact lineup played all 162 games.
  • The two win expectancies for each match up are normalized, with the home team getting an 8% home field advantage to determine what the chances of each team are in each game.

Now that we have a win expectancy for each team and each game, the model converts the moneyline for each side into an implied probability for each team. Because of the juice, the probability for each matchup sums to greater than 100%. I’m not normalizing this as I want to compare my expected odds to the breakeven percentage of each bet. Note: Early versions of the KBO model were comparing my win expectancy with normalized market implied odds. This would recommend plays where the chances of team winning were less than the bet’s breakeven point.

Where the model would find a higher win expectancy than the bet breakeven percentage, it would recommend a play. The higher the delta, the larger the bet.

The KBO Model Gap

After reviewing last year’s MLB model, here’s the score card for the current KBO model

  • Rich baseline win projections for each team: INCOMPLETE. Yes, I started with bottom up model, ZiPS projections from Fangraphs. However, I don’t have visibility into the distribution of total wins by player by position. And because this is a novel data set, there’s no visibility to year-over-year performance. Which, isn’t really fair for me to ask for, but still, it’s an assumption in the model
  • Access to each day’s starters and lineups: INCOMPLETE. I have been using MyKBO to get starters about 15 hours in advance. However, I still haven’t found a great source for lineup info. RotoGrindersMLB tweets out lineups at like 5 AM. Not super easy to parse and process. I’m still on the hunt for a source that I can process.
  • Win adjustments based on the lineups: INCOMPLETE: Again, I’m using a separate WAR number for starters, and have no way to deal with position player adjustments.

The Path Forward

I can try to recreate the structure of the data and model from last year. I’m slowly find more sources. Alternatively I can try to build my own bottom up model, using Runs Scored and Allowed, and Base Runs. I’ve already validated that the run differential lines up between the last few years of the KBO (since the league expanded to 10 teams) with the MLB so that win % = .500 * 0.0006(Run Differential). However, because of the shorter season in the KBO, 1 win equals a little over 11.5 in run differential, instead of 10 in the MLB.

I may have to also build a base runs model for the KBO, or could start with the MLB one, but if I’m going to that effort might as well do it specifically for the KBO. This would help me evaluate position players expected offensive run production. I’m still not sure where to find defensive grading for position players, so perhaps I should use pitchers’ ERA instead of DRA if I can’t make it up on the defenses side?

Or I can try to find a more complete data set that works across the whole model. Baseball Prospectus and FanGraphs have been around the block and I trust their work a lot more than just random stuff I come across translated from Korean by Google.

However, in the meantime, I’m going to do the following intermediate steps:

  1. See if I can dig up injury info and just downgrade base rosters to get better estimates of win expectancy contribution from position players
  2. Stop betting until the model starts to perform better.

More to come soon. Well, soon-ish.

KBO Model Day 7: WAR

First of all, for today’s RESULTS: None. As I started to clean up the model yesterday, I decided that my win percentage estimates were too crude for what I’m trying to accomplish. Today I bit the bullet and started to build a preliminary WAR model. Here’s how I approached it:

  • I used the roster data I had previously grabbed from MyKBOStats
  • Then I used the old version of Statiz to grab players WAR for 2017-2020
  • Next I weighted 2017-2019 season WARs to get a WAR estimate for 2020. This is still VERY rough and certainly an area for improvement
  • I estimated who the starting pitchers were for each team and summed their estimated WARs.
  • Then for each pitcher, I compared what their WAR would be for pitching an entire season to the summed WAR of all starters. I divided this win delta by 144 to determine how much of an adjustment I should apply to my base win expectancies.
  • Finally, I updated the spreadsheet-based model to include the starting pitchers, lookup the win expectancy adjustments to compare with the market numbers.

I also stubbed out a scraper to grab opening lines. I’m just waiting for the lines to be posted to finish it.

Future To Dos/Ideas:

  • Factor in season-to-date records in win expectancy calcs. I made this mistake in last year’s MLB model. I had intended to factor in actual records over the course of the season, but screwed it up. We’re only a few games into the season, so the adjustments at this point will be small, but better to do it before I forget.

Previously:

KBO Model Day 6: Next Steps

Just wanted to jot some quick notes to start the day. I may come back later as time allows.

First, recap of today’s action:

  • 2-3 (thanks to a walkoff homerun by the Dinos in the 10th a few minutes ago)

Yesterday, I adjusted the spreadsheet that is standing in for the model to compare my win percentages against the breakeven odds of the bets, rather than the market implied win percentages. I did so after placing today’s bets. Had I done som earlier, I only would have placed 4 wagers, because the Dinos had a -0.75% value against the breakeven percentage. That would have made me 1-3.

I duplicated my current sheet and made a few cleanups:

  • Made sure my new homefield advantage constant was being used
  • Calculated edge by delta between my win expectancy and the breakeven percentages
  • Like my MLB model, I added a bet sizing function based on edge size
  • Added some better calculations for spitting out what the play is, what the risk and to win numbers are
  • Calculating results based on who won each match up
  • Applied all of the above back to the beginning of the season to get a look at how the more “mature” model would have performed to date.

Here is a comparison between what’s happened to date, and the more mature model:

ModelActual To DateOriginal ModelMature Model
Wins6107
Loses131715
Win %31.58%37.04%31.82%
Total Plays223124
ROI-38.39%-16.50%-38.15%

None of that is great. The original model is more aggressive, both making more plays and betting higher, flat amounts. So while the ROI on it seems best, it’s down more than either reality or the revised model.

The big take away here for me is that the real secret sauce in the model should be the bottoms up player based adjustments. I need to work on that before going any further.

To that ends, on Sunday I expanded my scraper to grab the rosters from each team from MyKBOStats and just dumped them into a text file for now so that I wouldn’t have to keep scraping. I’ve saved the following attributes:

  • Team Name
  • Position group (Pitchers, Catchers, Infielders, Outfielders)
  • Anglicized name
  • MyKBOStats player page URL
  • Whether the player is on the active roster or not
  • Korean name

Now that I have a database, I’ll need to figure out how to fetch stats and build statistical profiles. MyKBO has some cursory stats, but Fangraphs or even some sites in Korean will likely be a better source.

Future potential ideas:

  • How is homefield advantage tracking this year? I know it’s small sample size, but the stadiums are empty and well, just the entire world is different this year. Might give me an indicator on the potential impact of playing in empty stadiums for upcoming sports.

Previously:

Next

KBO Model Day 5

In general, the KBO takes Mondays off, so there was no action overnight. However, with the COVID-delay to the season, it seems like they will be playing Mondays going forward.

But today, I needed a break. First, the recap from Sunday:

  • 0 - 5

Got crushed. There was some crazy action, like multiple huge comebacks IN THE SAME GAME. Bad days happen, so I’m not going to dwell. But I figured I’d describe my process until the model gets fully automated.

First, I manually grab each matchup for the day and put them into a Google Sheet that for now is the model. The sheet does a lookup on the Fangraphs ZiPS full season win projections to get the baseline win expectancy. I then apply my proprietary KBO home field advantage to the home team, then calculate the new expected win expectancy for each team in each matchup.

Next, I grab the market prices for each team in each matchup. The sheet then calculates the following:

  • Implied odds based on the market prices
  • Adjusted win % based on market implied odds
  • The delta between my win expectancy and that adjusted odds*
  • Which team the model favors more against the adjusted win %*

I ignore the results of these calculations until I eye-ball test the pitchers. I do this by going to MyKBOStats, selecting each matchup, then each pitcher, which I then lookup on Fangraphs. Since I’m looking at just pitchers, I try to compare the last 3 years of FIP, to see which pitcher seems better. This isn’t perfect because a lot of pitchers only have KBO stats, but some are washed out MLB players, who’s projections are based on MLB expectations.

I’ll color code each team based on which pitcher I like better.

  • Yellow for both if it’s a wash
  • Lighter to darker shades of green for pitchers I like more
  • Lighter to darker shades of red for pitchers I like less

I like to compare the pitching matchups “blind” to the model so that I don’t over value the starter for a team the model likes more so that my assessments are a bit more objective. Then, I compare the starter evaluations to the model’s recommendations. I’ll nudge recommendations based on the net pitching assessment.

Finally, after weighing how strongly the model recommends a team along with the pitching matchup, I’ll determine my bet size. The stronger the net recommendation, the larger the bet.

*Starting with Wednesday’s games, I’ll be comparing my win expectancy with the market implied odds. These are the “break even” odds of the bet, and are a better judge of value for the bet. The way I have been doing it, the adjusted odds and my odds always added up to 1, which meant wherever the model found value on one team, it found an inverse value on the other side. However, I should not be ignoring the “juice”. One of the plays I mad for Tuesday’s games have a negative value for BOTH sides when comparing to the breakeven percentages. This will help me weed out games that I should avoid playing. This is how my original MLB model worked as well.

Previously:

Next:

KBO Model Day 4

Yesterday I refined the my KBO home field advantage constant for my KBO model. I also tried to see how important starting pitchers are, as I’m still not factoring them directly into the picks. Turns out, it looks like starting pitchers surrender 60% of runs scored in KBO games. I’ve started to explore Fangraphs KBO data, and while it seems like it important, it will take some time to figure out how to automate its use in the model.

In the meantime, I wanted to check the performance of the model to date. First, here’s today’s results. Best day so far!

  • 5 bets (one on each game)
  • Took all 5 favorites
  • Went against the model in 3 of the games based on pitching matchups
  • 4 home teams, 1 road
  • 2 wins, no losses, and 3 no action (games were rained out)
  • 1 win was the model’s pick, 1 was betting against the model

I haven’t been religiously sticking to the recommendations, so here’s the performance I’ve what I’ve actually done to date:

  • 12 bets
  • 6 home, 6 road
  • 6 favs, 6 dogs
  • 4 wins, 5 losses, 3 no action
  • Win percentage: 44% (not great, Bob.gif)
  • -14.57% ROI (cool cool cool.gif)

To recap:

  • Went 1-4 on day 1
  • I took the second day off after the first day was a bloodbath
  • Placed only 2 bets out of 5 on day 3, but bet AGAINST the model on one of the games after looking at the starting pitchers. I won the model play and lost the one I went against the model.
  • Day 4 (today): won 2, 3 rained out

So, all around bad start to my KBO experience. Now let’s look at just the model’s performance:

  • 20 players
  • 13 home, 7 road
  • 3 favs, 17 dogs (!)
  • 8 wins, 9 losses, 3 no action
  • Win percentage: 47%
  • 8.25% ROI

So, if I had just stuck to the model I’d be in better shape. However:

  • Both sets are EXTREMELY low sample size
  • A model like this, if at all accurate, benefits from higher volumes of play
  • The model only plays are profitable, even though it’s won less than half of it’s plays. Why? Because we bet on moneylines in baseball, and the model has found value on dogs, you win more than you risk when underdogs win.

I’m still not ready to blindly bet the model, but will likely take a blended approach of looking at starting pitchers until manually until I can factor directly into the model.

Sports are starting to come to life again. Tonight is a big UFC slate, NASCAR is coming back in the next couple weeks, as well as the Bundesliga. Since I know nothing about any of those, I’ll probably keep focusing on KBO. I have a friend who’s pretty knowledgable about UFC, so I may kick the tires in exploring how those markets work.

Previously:

Next:

KBO Model Day 3: KBO Home Field Advantage Update and Starting Pitchers

Note: I’ve fixed an issue with the scraper that didn’t include double headers. You can see the updated KBO home field advantage here.

Yesterday I took a crack at finding the KBO’s home field advantage as an input into my KBO prediction model. I had hoped to find a unique value for the league’s home field advantage to bake into my model. Unfortunately, after grabbing results from that last 4 years, it turns out that the leagues home field advantages was the same as the MLB number I had used a placeholder. For some reason my scraper would die in the 2015 season.

I did some debugging and realized that there was just a single game that had a bad data layer that the parser could deal with. After hard coding it to skip that game, I was able to compile records for that last 10 seasons.

SeasonHome Team WinsVisiting Team WinsTotal GamesHome Field Advantage
201933126259311.64%
20183873227099.17%
20174023487507.20%
20164013607615.39%
20153923587504.53%
20142662535192.50%
20132992965950.50%
20122852775621.42%
20112672385055.74%
2010258262520-0.77%
Total3288297662644.98%

Whoa! So while over the last four seasons, the KBO home field advantage has been around 8%, over the last decade it’s less than 5%! And at the beginning of that span, it was negligible! Graphically, here’s each year’s HFA, and the cumulative HFA since the beginning of the sample:

KBO Home Field Advantage, 2010-2019

KBO Home Field Advantage, 2010-2019

The home field advantage has been steadily climbing. While I could probably go further back, I’m not sure it would be useful for my model, which is what I really care about here. And given the shape here, I’ve decided to weigh the last 3 years at 3x, previous 3 at 2x and first 4 at 1x. My new HFA constant is 6.67%.

For today’s games, I factored that new HFA in. The model had 2 games with about a 10% disagreement with the market, and 3 games with less than a 3% disagreement. I opted to play the top 2, but looked at the starting pitchers first. I did this manually as I still don’t have a bottoms up way to factor in individual players. My “eye-ball” test said I should stick with the home dog Samsung Lions over the Kia Tigers. However, I liked the pitcher better for the LG Twins, so I went against the model and bet against the NC Dinos.

The results: 1-1. Would have been 2-0 if I would have stuck with the NC Dinos, and 4-1 if I would have played all 5 of the model’s recommendations.

Since the games happen at 5:30, I’ve been checking them out bleary-eyed around 6 AM when my 6 year-old burst through the door. I’m starting to get a sense that runs happen in bunches in the second half of the game. Perhaps starting pitchers are less important than I thought. Perhaps starters are less significant in KBO than in the MLB? There are lots of theories like this to develop and test.

Since I have the data saved in JSON files thanks to my scraper, I was be able to whip up a quick analysis of runs by inning over time, re-using the code I wrote to determine the home field advantage, as well as the length of an average start.

KBO Runs Scored by Inning, 2010-2019

KBO Runs Scored by Inning

Percent of KBO Runs Scored by Inning, 2010-2019

Percent of KBO Runs Scored by Inning

Over the last 5 years, starters have averaged just a shade over 5 innings per game in the KBO. Over that same period, the first 5 innings have accounted for an average of 58.5% of scoring. Seems valuing starting pitchers will still be an important input to the model.

Luckily, I just saw this morning that Fangraphs now has player level ZiPS projections. I’ll dig into that later today as well.

Previously:

Next:

KBO Model Day 2: KBO Home Field Advantage

Yesterday I discussed the first simplistic step in building my KBO model. Given my reservations after its first performance, I skipped action today, but for completeness, it again, recommended all 5 dogs, 3 home and 2 visitors. The results: 2-3. Progress!

The most important part of my MLB model was the adjustments made to win expectancy based on who was slated to start each game. Given that I still don’t have a path forward on that, I wanted to revisit home field advantage. As a recap, I could not find anything online about an established KBO home field advantage, so I figured I’d try to calculate it myself.

I found the KBOPrediction python repo on github. While it seems like the objective of the code is to “employ the notion of Deep Learning to predict” individual game results, the only part I care about here is the scraping and legwork on translating to English. While it is true that I too want to predict the results of individual games, this model’s approach just relies on the previous k games for each team. That’s not as sophisticated as I need.

After doing some hacking, I was able to run the scraper through a number of seasons to get the record of home and away teams. The results:

SeasonHome Team WinsVisiting Team WinsTotal GamesHome Field Advantage
201933126259311.64%
20183873227099.17%
20174023487507.20%
20164013607615.39%
Total1521129228138.14%

Well, shoot. I used a flat 8% for home field advantage in the model, and that’s what it’s been in the KBO on average over the last four seasons. I was trying to go back 10 seasons, but the scraper keeps breaking in 2015. I’ll see if I can fix that later.

Another thing to address later is the number of games in my sample. For some reason, I’m short games in 2019. Looks like I have roughly the right number of games for 2018, and seems like the playoffs are included in 2016-2017 numbers. The scraper I’m using takes years and months as input for the scraper, so maybe I need to be more rigorous about what games to grab.

Previously: KBO Model Day 1

Next up:

KBO Model Day 1

Yesterday was opening day for the KBO, the Korean baseball major league. While I missed it, as in, I didn’t follow it, I was excited for the first semblance of live sports returning.

To celebrate, and atone for missing opening day, I’ve decided to make a model similar to my 2019 MLB model, only applying what I have learned since.

BaseballMitchster

There is a ton more I can and should write about last year’s MLB model, but here’s the gist:

  • I based the work on Joe Peta’s Trading Bases
  • As a baseline, I used full season win projections.
  • For each match up, I’d use the announced starting pitchers and lineups to adjust the baseline winning percentages. For example, if a team was expected to win 82 games (.500 recored or 50% of their games), but was starting their ace, whose WAR was significantly higher than the rest of the staff, the team might be expected to win 98 games (.605 or 60% of their games). These adjustments were also applied to every position player as well.
  • Historically, home teams in the MLB have won 54% of their games. This has been an extremely stable fact for over 100 years. This equates to an 8% home field advantage (54% - 46% = 8%). So I’d add 8% win percentage to the home team.
  • To calculate the final win expectance for both teams, I’d divide their final expected win percentage by the sum of their two percentages.
  • I’d compare those percentages to the implied probability from the betting market moneylines. Where my win expectancy for a team exceeded the market’s implied odds, I’d bet on that team. The higher the discrepancy, the larger the wager.

I started all this work before spring training, and tested the model and automations through out it so that I was ready for opening day.

For the KBO, I was already a day late. So I started as simply as I could for today’s slate.

KBOMitchster, Day 1

Not only am I late to the party, I’m data poor. There is a lot of really rich, interesting MLB data thanks to Sabermetric sites like Baseball Prospectus and Fangraphs. I’m still trying to wrap my head around what data is available for the KBO. A lot of resources are, unsurprisingly, in Korean. Which I don’t read.

Luckily, Fangraphs published a ZiPS-based projected standings including total win projections. So the first cut of the model starts with that as a wins baseline. As of now, I don’t have granular WAR or similar player values, so I’m having to roll with just the baseline wins for now.

For home field advantage, some quick googling did not uncover a similar home field advantage published for the KBO, so I rolled with the 8% MLB number. For today’s games (which happen at 5:30am ET), the simplified model spit out picks on all 5 games, all 5 were heavy dogs, 3 home and 2 away, with “edges” of 3-10%.

The results: 1-4. Not great. Small sample size and all, but pausing until I can refine a bit. I’ve found a scraper for historical KBO results, so I’m going to try to tackle seeing what the real home field advantage in KBO is. Perhaps I’m over estimating the impact of home field in KBO—especially given that all the stadiums are empty right now.

Next up: KBO Model Day 2: KBO Home Field Advantage