When is it Luck ?

Common refrains among SOM players are "he was lucky !" or "MY team was much better !" Is it possible to separate the hyperbole from the fact ? When can you state that one team is clearly better than another ? Or to put it another way, how much 'distance' between two teams is really significant ? This question can be answered with hard numbers if you are willing to stomach a little math. In this article the appropriate statistical theory will be reviewed and a summary 'rule-of-thumb' will be given that can be used to give a quick answer. The focus of this discussion will be on analyzing winning % (WP), although the technique could be applied to any proportion of success type measure like batting average, on-base %, etc.

Table of Contents

A Little Theory

A Rule of Thumb

The Bad News

A Table for Various Confidence Levels

One Sided Intervals

A Little Theory

We begin with a review of some necessary statistical preliminaries.

Assume that the random event under consideration can be modeled as independent Bernoulli trials. This means three things: (1) each occurrence can be classified into two categories, "successes" and "failures", (2) the probability of a success, denoted by p, is constant from trial to trial, and (3) each trial result has no influence on any other trial result. A random variable (RV) that obeys the above is termed a Binomial random variable. Typically one assigns a "1" to success and a "0" to failure for computational purposes. Finally, one denotes the probability of a failure by q and clearly p+q=1.

Purists will (correctly) argue that (2) isn't satisfied in the case of Baseball W/L results - your odds of winning against Greg Maddux are much different than when facing Felipe Lira ! It is possible to abandon assumption (2) and still end up in about the same place, but the math becomes much more complicated (you have to employ something called random process theory). For the purposes of this discussion, suffice to say that as long as (2) isn't significantly violated (p fluctuates between say 0.35 and 0.65) the results are essentially unchanged.

The expected value (or average in common parlance, otherwise know as the mean) of an n-sample Binomial RV is (big surprise !) n*p, where n is the number of trials. All this really says is you can anticipate getting n*p successes in n trials. Of course, you won't always see exactly n*p successes as otherwise we wouldn't have a RV !! However, the larger the number of trials, the more likely it is that you will get n*p successes, and in the limit (which is a fancy way of saying take trials until you wear out your dice !) the number of successes will converge to n*p. That is to say, if you divide the number of successes by the number of trials you will necessarily get p for an answer after an infinite # of trials.

Now, if all this talk of infinite trials seems silly to you, well, you are right - clearly there is not much useful about statements like this. Fortunately, you can say something about how quickly the results will approach the 'right' answer. A quantity know as the variance gives you an idea of how close you can expect to get to the average given n trials. Variance can be defined as the average of the square of the RV minus the square of the average of the RV. In the case of the a Binomial RV, you end up with Var=n*p*(1-p)=n*p*q. The square root of the Variance is called the Standard Deviation (SD).

Variance is a really useful measure because of something know as the Central Limit Theorem. The details are really boring, but here's the highlight as applied to our situation: For large enough n, a Binomial RV acts just like a Gaussian RV (Gaussian RVs follow the infamous bell-shaped curve you heard about in high school). This is handy because lots of statistically important stuff is known about Gaussian RVs. In particular, a Gaussian RV will fall within 1 SD of it's mean 68% of the time, within 2 SD of the mean 95% of the time, and within 3 SD of the mean over 99% of the time.

Now we need to make a small refinement. Since we are interested in WP and not the gross number of wins, we can define our RV to be the proportion of wins observed in n games. It's easy to show that this RV has mean p (no big surprise there !) and Variance p*q/n. This gives us a handy yardstick with which to assess performance data.

As an example, say you win 20 of your first 30 games this year - is this good evidence to show that you really have a better than 0.500 team ? Well, we can form an answer by assuming that the observed data gives an accurate picture of the team's abilities, and then see if 0.500 is distant in a statistical sense from the observed performance. The WP data has a mean of 0.667 and a SD of 0.086. Then 0.500 is 0.167/0.086=1.94 SD below the mean. The chances of your team really being a below 0.500 club are somewhere around 3%. This is because a Gaussian RV has a 0.974 chance of falling AT OR ABOVE 1.94 SD below the mean.

When trying to get a handle on information about a RV (like it's mean), it is common to construct what is know as a confidence interval (CI) for the item in question. This is really less complicated than it sounds. A CI is just a restatement of what was said above. In the previous example, we could just as well have said "a 0.974 confidence interval for the true winning % is (0.500,1.0)." Note that (x,y) is a common mathematical shorthand for "the range of values between x and y". In other words, a CI is just a probability statement about a given range of values for the target parameter of interest (the mean of the RV in this example).

A Rule of Thumb

Typically CIs are symmetric about the mean of the target parameter. Also, usually the probability that the target parameter falls within the CI is set to at least 90% (0.95 is a common value). This gives us our basic rule of thumb: A 95% CI for WP is + or - 2*Sqrt[p*(1-p)/n] about p, where p is the observed proportion of success (the measured WP), and Sqrt is the square root function. That is, there is a 95% chance that the true WP (let's call it x) falls within the interval

p-2*Sqrt[p*(1-p)/n]<= x <= p+2*Sqrt[p*(1-p)/n]

Since we are dealing with WP, we can further simplify our rule of thumb by noting that p*(1-p) has a peak value at p=0.5. Then near p=0.5, Sqrt[p*(1-p)]=p, at least roughly. Then you will always get a conservative interval (conservative in this context means a larger interval than the one you would get by using the more precise expression for SD) by using (p-1/Sqrt[n],p+1/Sqrt[n]) as your confidence interval. In other words, a nice, quick, upper bound expression for a 2 SD distance is 1/Sqrt[n], which I call the "upper rule of thumb".

Having been exposed to a bit of Statistics 101, we are now ready to do something actually useful for SOM !

The Bad News

The unfortunate bottom line of all this is that most of the time WP differences are too small to be significant in SOM since the number of games played is generally small. In a standard 162 game season, 0.079 is the upper rule of thumb 0.95 confidence level WP distance. In other words, you couldn't state that a 0.500 and a 0.550 team were truly different with high (=95%) confidence in 162 games (since 0.050 < 0.079). You'd have to be willing to accept a confidence level of about 80% in order consider 0.500 statistically distant from 0.550 in 162 games, which is perhaps more risky than most folks would accept in such a declarative statement. If you're gonna say team A and B are really different, you will generally want less than a 20% chance of error !

A Table for Various Confidence Levels

Given below is a table for 80%, 90%, and 95% Confidence level WP interval half-widths for the upper rule of thumb:

Con. Level     width
    80       0.64/Sqrt[n]
    90       0.82/Sqrt[n]
    95       1.0/Sqrt[n]
Use this table as in the example above, that is, the difference between 2 observed WP values has to be greater than the width given above to declare them 'statistically different' at the given confidence level in the table.

Remember that these widths are larger than the true widths when dealing with data that has an average value far from 0.500. You can correct the widths given above by multiplying by 2*Sqrt[p*q]. For example, if the observed data has a mean of 0.650 (or a mean of 0.350 since the computation is symmetric about 0.5), then the 80% confidence level width would be 2*Sqrt[0.65*0.35]*0.64/Sqrt[n]=0.61/Sqrt[n].

One Sided Intervals

When the item of interest is a very rare (or very frequent) event, it is generally much more useful to compute a one-sided CI since an upper (in the case of a rare event) or lower (in the case of a frequent event) bound is all that is desired. One-sided intervals are of interest in SOM for certain categories that are always small or large such as extra-base hit frequency, or fielding percentage. For example, You don't gain much information specifying an upper bound on an observed fielding percentage of 0.995 !

In these cases it's much harder to get a nice, clean expression for a bound because for extreme p values the Gaussian approximation isn't very accurate for anything but VERY large n. It is possible to get an exact answer using the Binomial distribution directly, but in this case only discrete values of the confidence level are available, and the math is a bit complicated. Basically, it's hard to give a useful numerical answer in this case without the use of a computer. Note that most modern spreadsheet programs can support this kind of computation, so with a little reading and effort you can construct a spreadsheet program to compute your own one-sided binomial CIs if you are so inclined.


96JUL22