Common refrains among SOM players are "he was lucky !" or "MY team was much better !" Is it possible to separate the hyperbole from the fact ? When can you state that one team is clearly better than another ? Or to put it another way, how much 'distance' between two teams is really significant ? This question can be answered with hard numbers if you are willing to stomach a little math. In this article the appropriate statistical theory will be reviewed and a summary 'rule-of-thumb' will be given that can be used to give a quick answer. The focus of this discussion will be on analyzing winning % (WP), although the technique could be applied to any proportion of success type measure like batting average, on-base %, etc.
A Table for Various Confidence Levels
We begin with a review of some necessary statistical preliminaries.
Assume that the random event under consideration can be modeled as independent Bernoulli trials. This means three things: (1) each occurrence can be classified into two categories, "successes" and "failures", (2) the probability of a success, denoted by p, is constant from trial to trial, and (3) each trial result has no influence on any other trial result. A random variable (RV) that obeys the above is termed a Binomial random variable. Typically one assigns a "1" to success and a "0" to failure for computational purposes. Finally, one denotes the probability of a failure by q and clearly p+q=1.
Purists will (correctly) argue that (2) isn't satisfied in the case of Baseball W/L results - your odds of winning against Greg Maddux are much different than when facing Felipe Lira ! It is possible to abandon assumption (2) and still end up in about the same place, but the math becomes much more complicated (you have to employ something called random process theory). For the purposes of this discussion, suffice to say that as long as (2) isn't significantly violated (p fluctuates between say 0.35 and 0.65) the results are essentially unchanged.
The expected value (or average in common parlance, otherwise know as the mean) of an n-sample Binomial RV is (big surprise !) n*p, where n is the number of trials. All this really says is you can anticipate getting n*p successes in n trials. Of course, you won't always see exactly n*p successes as otherwise we wouldn't have a RV !! However, the larger the number of trials, the more likely it is that you will get n*p successes, and in the limit (which is a fancy way of saying take trials until you wear out your dice !) the number of successes will converge to n*p. That is to say, if you divide the number of successes by the number of trials you will necessarily get p for an answer after an infinite # of trials.
Now, if all this talk of infinite trials seems silly to you, well, you are right - clearly there is not much useful about statements like this. Fortunately, you can say something about how quickly the results will approach the 'right' answer. A quantity know as the variance gives you an idea of how close you can expect to get to the average given n trials. Variance can be defined as the average of the square of the RV minus the square of the average of the RV. In the case of the a Binomial RV, you end up with Var=n*p*(1-p)=n*p*q. The square root of the Variance is called the Standard Deviation (SD).
Variance is a really useful measure because of something know as the Central Limit Theorem. The details are really boring, but here's the highlight as applied to our situation: For large enough n, a Binomial RV acts just like a Gaussian RV (Gaussian RVs follow the infamous bell-shaped curve you heard about in high school). This is handy because lots of statistically important stuff is known about Gaussian RVs. In particular, a Gaussian RV will fall within 1 SD of it's mean 68% of the time, within 2 SD of the mean 95% of the time, and within 3 SD of the mean over 99% of the time.
Now we need to make a small refinement. Since we are interested in WP and not the gross number of wins, we can define our RV to be the proportion of wins observed in n games. It's easy to show that this RV has mean p (no big surprise there !) and Variance p*q/n. This gives us a handy yardstick with which to assess performance data.
As an example, say you win 20 of your first 30 games this year - is this good evidence to show that you really have a better than 0.500 team ? Well, we can form an answer by assuming that the observed data gives an accurate picture of the team's abilities, and then see if 0.500 is distant in a statistical sense from the observed performance. The WP data has a mean of 0.667 and a SD of 0.086. Then 0.500 is 0.167/0.086=1.94 SD below the mean. The chances of your team really being a below 0.500 club are somewhere around 3%. This is because a Gaussian RV has a 0.974 chance of falling AT OR ABOVE 1.94 SD below the mean.
When trying to get a handle on information about a RV (like it's mean), it is common to construct what is know as a confidence interval (CI) for the item in question. This is really less complicated than it sounds. A CI is just a restatement of what was said above. In the previous example, we could just as well have said "a 0.974 confidence interval for the true winning % is (0.500,1.0)." Note that (x,y) is a common mathematical shorthand for "the range of values between x and y". In other words, a CI is just a probability statement about a given range of values for the target parameter of interest (the mean of the RV in this example).
Typically CIs are symmetric about the mean of the target parameter. Also, usually the probability that the target parameter falls within the CI is set to at least 90% (0.95 is a common value). This gives us our basic rule of thumb: A 95% CI for WP is + or - 2*Sqrt[p*(1-p)/n] about p, where p is the observed proportion of success (the measured WP), and Sqrt is the square root function. That is, there is a 95% chance that the true WP (let's call it x) falls within the interval
p-2*Sqrt[p*(1-p)/n]<= x <= p+2*Sqrt[p*(1-p)/n]
Since we are dealing with WP, we can further simplify our rule of thumb by noting that p*(1-p) has a peak value at p=0.5. Then near p=0.5, Sqrt[p*(1-p)]=p, at least roughly. Then you will always get a conservative interval (conservative in this context means a larger interval than the one you would get by using the more precise expression for SD) by using (p-1/Sqrt[n],p+1/Sqrt[n]) as your confidence interval. In other words, a nice, quick, upper bound expression for a 2 SD distance is 1/Sqrt[n], which I call the "upper rule of thumb".
Having been exposed to a bit of Statistics 101, we are now ready to do something actually useful for SOM !
The unfortunate bottom line of all this is that most of the time WP differences are too small to be significant in SOM since the number of games played is generally small. In a standard 162 game season, 0.079 is the upper rule of thumb 0.95 confidence level WP distance. In other words, you couldn't state that a 0.500 and a 0.550 team were truly different with high (=95%) confidence in 162 games (since 0.050 < 0.079). You'd have to be willing to accept a confidence level of about 80% in order consider 0.500 statistically distant from 0.550 in 162 games, which is perhaps more risky than most folks would accept in such a declarative statement. If you're gonna say team A and B are really different, you will generally want less than a 20% chance of error ! A Little Theory
A Rule of Thumb
The Bad News