Remember back in 2021 when Gen Z tried to tell everyone to move their side parts to the middle and swap their skinny jeans for a looser variety? While most Millennials responded with outward indigence, offline they begrudgingly tried on high-waisted mom jeans and posted up in the bathroom blowing out their hair in a new direction. But before long they let their hair go back to lying in the manner to which it had become accustomed and eschewed jeans completely in favor of athleisure-wear. Even as many of us considered complying with the directive of our teenaged overlords, it felt absurd that people who haven’t even finished developing their prefrontal cortexes are left in charge of dictating what’s cool. As it turns out, though, that’s exactly why teenagers decide what’s cool. Teenagers are the only members of society with the time, energy, and lack of rationality to care so deeply about something that matters so very little. Those who stuck to their dated stylings and weathered the petty hail storm of Zoomer mocking were vindicated a couple of months ago, when the celebrity and influencer cohort brought back the side part, declaring it on-trend once more.
Around that same time another trend was taking hold among the baseball commentariat: Using strength of schedule to determine which teams had actually earned their W-L records. Mostly, this meant arguing that the Phillies weren’t a top team in the league because they’d played a soft schedule. The discourse eventually spawned multiple articles arguing that while yes, Philadelphia hadn’t exactly been slaying dragons while walking a tightrope, its act wasn’t entirely smoke (generated by the clubhouse fog machine) and mirrors either.
Strength of schedule is not typically a prominent talking point when comparing MLB teams. It might occasionally come up when comparing September schedules in a tight postseason race, but as a phrase uttered in May, it’s typically part of a college baseball discussion, or because you’ve wandered into a BCS-era college football forum. College sports need strength-of-schedule metrics because teams don’t all play one another and the variation in team quality spans the Big Ten’s new geographical footprint. But in the major professional leagues, the schedule is fairly balanced, and even though the White Sox and Rockies exist, dominating the worst teams in MLB presents a tougher task than rolling over the University of Maryland Baltimore County Golden Retrievers.
But even though strength of schedule seemingly lacks utility in a professional baseball context, the amount of mud slung at seemingly good teams had me questioning my own assumptions. Maybe there is useful information to uncover in the muck.
So with roughly 90 games on each team’s odometer, I decided to pump the brakes and figure out if a team’s early-July winning percentage, combined with its strength-of-schedule rating (SoS), could more accurately predict its final record than its midseason record alone. So I gathered up each team’s strength of schedule and W-L record through a comparable point in mid-July for 2021, ’22, and ’23. Next, I calculated each club’s remaining strength of schedule as it would have stood at that point in the season, threw all three values (Win%, SoS so far, and remaining SoS) into a basic linear regression model and trained it to predict the team’s record at the end of the year. As a baseline comparison, I also trained a model that considered only the midseason win rates of teams.
Did the SoS model outperform the baseline model? No, it did not. Both models explained roughly 78% of the variation in the final winning percentages and made predictions with an average error of 30 percentage points. In the SoS model, neither of the SoS features were deemed statistically significant, though the remaining SoS metric came closer to providing some useful input.
But part of the pushback against the strength-of-schedule girlies concerned the context that gets tossed aside when you flatten a team into a single value. On the one hand, there’s the old Bill Parcells quote, “You are what your record says you are.” On the other hand, your record can only say so much. Standard SoS averages the winning percentages of a team’s opponents, with some debate over whether to use the team’s record from the time the game was played or update the calculation continuously throughout the season. Teams can be streaky, and how well a team is playing at the time of a matchup, in addition to the health of the roster, factors into the difficulty of the matchup. Winning percentages based on larger samples are more likely to represent a team’s true talent, but they discard in-the-moment context.
Fortunately, we’ve known for a while that winning percentage doesn’t tell a team’s whole story, leading to updated versions of the classic W-L record that capture at least some additional context. Pythagorean W-L was developed by Bill James and uses a team’s run differential to determine its expected W-L record. Here run differential acts as a proxy for a team’s proclivity for both scoring and preventing runs, which tends to be more indicative of its actual ability to win games than its record might imply, since wins and losses are garnished with a larger dollop of randomness and luck.
BaseRuns record goes a step further in cleansing the calculation of randomness by using the average run value associated with players’ actions on the field to determine the team’s expected run differential, rather than a run differential that may be inflated due to a fortuitous sequencing of hits.
Does calculating strength of schedule using win percentages based on Pythagorean W-L or BaseRuns W-L add enough context to create a metric that improves on the baseline model’s predictions? Answer: a little. Both the Pythagorean and BaseRuns versions of the model were able to explain 81% of the variation in teams’ 162-game win rates, up from 78%. The average error dropped a few percentage points as well, from 30 down to 27. It’s a slight improvement, but still not enough to suddenly convince me that strength of schedule as a metric has any super dishy secrets to spill about the true talent of a team.
In one final attempt to make fetch happen, I figured since we’re already borrowing tactics from college sports analysis, we might as well really do the thing. Boyd’s World posts Iterative Strength Ratings (ISRs) for college baseball, which work similar to Elo ratings in chess, to assign each team a score based on the quality of its opponents and its outcomes against said opponents. Which is to say, a team gets more credit for beating a good team than a bad one, and is docked more for losing to a bad team than a good one. Lastly, I added Relative Power Index (RPI) to the pile, which ESPN defines as “25% team winning percentage, 50% opponents’ average winning percentage, and 25% opponents’ opponents’ average winning percentage.”
The ISR version of the model performed comparably to the BaseRuns and Pythagorean models, with a slightly worse average error on the predictions. Meanwhile, the RPI model was worse than the baseline model across the board.
Despite the internet’s best efforts to shake me from what I thought was a fairly non-controversial belief that strength of schedule doesn’t matter all that much in a professional league with a 162-game season, I believe we have successfully touched grass and locked back in with reality and what actually matters, and it ain’t SoS. But with that said, we also learned that when calculated with a bit more context, SoS does matter a teeny, tiny bit. So as we enter trade deadline season, is there anything SoS can offer to sway our opinions on whether teams should buy or sell? If a team at the back of the pack in the wild card hunt has played a tough schedule so far, but has a relatively easier slate in the second half, is that a large enough factor…