In the 5.4.2 Rotation Analysis post, I mentioned that I was looking into some odd behavior in the SimC error statistics:
I’m actually doing a little statistical analysis on SimC results right now to investigate some deviations from this prediction, but that’s enough material for another blog post, so I won’t go into more detail yet. What it means for us, though, is that in practice I’ve found that when you run the sim for a large number of iterations (i.e. 50k or more) the reported confidence interval tends to be a little narrower than the observed confidence interval you get by calculating it from the data.So for example, at 250k iterations we regularly get a DPS Error of approximately 40. In theory that means we feel pretty confident that the DPS we found is within +/40 of the true value. In practice, it might be closer to +/ 100 or so.
Over the past two weeks, I’ve been running a bunch of experiments to try to track down and correct the source of this effect. The good news is that with the help of two other SimC devs, we’ve fixed it, and future rotation analysis posts will be much more accurate as a result.
But before we discuss the solution, we have to identify the problem. And to do that, we need a little bit of statistics. I find that most people’s understanding of statistical error is, humorously enough, rather erroneous. So in the interest of improving the level of discourse, let’s take a few minute and talk about exactly what it means to measure or report “error.”
Disclaimer: While I’m 99.9% sure everything in this post is accurate, keep in mind that I am not a statistician. I just play one on the internet to do math about video games (and in real life to analyze experimental results). If I’ve made an error or misspoken, please point it out in the comments!
Lies, Damn Lies, and Statistics
Let’s start out with a thought experiment. If we’re given a pair of standard 6sided dice, what’s the probability of rolling a seven?
There’s a number of ways to solve this problem, but the simplest is probably to do some basic math. Each die has 6 sides, so there are 6 x 6 = 36 possible combinations. Out of those combinations, how many give us a sum of seven? Well, there are three ways to do that with the numbers one through six: 1+6, 2+5, and 3+4. However, we have two dice, so either one could contribute the “1” in 1+6. If we decide on a convention of reporting the rolls in the format (die #1)+(die #2), then we could also have 4+3, 5+2, and 6+1. So that’s six total ways to roll a seven with a pair of dice, out of thirtysix possible combinations; our probability of rolling a seven is 6/36=1/6=0.1667, or 16.67%.
We could ask this same question for any other possible outcome, like 2, 5, 9, or 11. If we did that for every possible outcome (anything from 2 to 12), and then plotted the results, it would look like this:
This gives a visual interpretation of the numbers. It’s clear from the plot that an 8 is less likely than a 7 (as it turns out, there are only five ways to roll an 8) and that rolling a 9 is even less likely (four ways) and that rolling a 2 or 12 is the least likely (one way each). What we have here is the probability distribution of the experiment. It tells us that on any given roll of the dice there’s a ~2.78% chance of rolling a 2 or 12, a 5.56% chance of rolling a 3 or 11, and so on.
Now let’s talk about two terms you’ve probably heard before: mean and standard deviation. These terms show up a lot in the discussion of error, so making sure we have a clear definition of them is a good foundation on which to build the discussion. The mean and the standard deviation describe a probability distribution, but provide slightly different information about that distribution.
The mean tells us about the center of the distribution. You’re probably more familiar with it by another name: the average. Though both of those names are a bit ambiguous. “Average” can refer to several different metrics, though it’s most commonly used to refer to the arithmetic mean. “Mean” is used slightly differently in different areas of math, but when we’re talking about statistics it’s used synonymously with the term “expected value.” The Greek letter $\mu$ is commonly used to represent the mean. If you want the mathy details, it’s calculated this way:
$$ \mu = \sum_k x_k P(x_k)$$
where $x_k$ is the outcome (i.e. “5”) and $P(x_k)$ is the probability of that outcome (i.e. “11.11%” or 0.1111). For our purposes, though, it’s enough to know that the mean tries to measure the middle of a distribution. If the data is perfectly symmetric (like ours is), it tells you what value is in the center. In the case of our dice, the mean is seven, which is what we’d expect the average to be if we made many rolls.
The standard deviation (usually represented by $\sigma$), on the other hand, describes the spread or width of the distribution. Its definition is a little more complicated than the mean:
$$ \sigma = \sqrt{\sum_k P(x_k) (x_k\mu)^2} $$
But again, for our purposes it’s enough to know that it’s a measurement of how wide the distribution is, or how much it deviates from the mean. A distribution with a larger $\sigma$ is wider than a distribution with a smaller $\sigma$, which means that any given roll could be farther away from the mean. For our distribution, the standard deviation is 2.45.
The thing I want you to note is that neither of these terms tell us anything about error. We aren’t surprised if we roll the dice and get a 10 or 12 instead of a 7. We don’t return them to the manufacturer as defective. The mean and standard deviation tell us a little bit about the range of results we can get when we roll two dice. To talk about error, we need to start looking at actual results of dice rolls, not just the theoretical probability distribution for two dice.
Things Start Getting Dicey
Okay, so let’s pretend we have two dice, and we roll them 100 times. We keep track of the result each time, and plot them on a histogram like so:
Now, this doesn’t look quite the same as our expected distribution. For one thing, it’s definitely not symmetric – there were more high rolls than low rolls. We could express that by calculating the sample mean $\mu_{\rm sample}$, which is the mean of a particular set of data (a “sample”). By calling this the sample mean, we can keep straight whether we’re talking about the mean of the sample or about the mean of entire probability distribution (often called “population mean”). The sample mean of this data set is 7.40, as shown in the upper right hand corner of the plot, which is higher than our expected value of 7.00 by a fair amount.
We can also calculate a sample standard deviation $\sigma_{\rm sample}$ for the data, which again is just the standard deviation of our data set. The sample standard deviation for this run is 2.52, which is a bit higher than the expected 2.45 because the distribution is “broader.” Note that the maximum extent isn’t any wider – we don’t have any rolls above 12 or below 2 – but because the distribution is a little “flatter” than usual, with more results than expected in some of the extremes and fewer in the middle, the sample standard deviation goes up a little.
But note that, by themselves, neither $\mu_{\rm sample}$ nor $\sigma_{\rm sample}$ tell us about the error! They’re still just describing the probability distribution that the data in the sample represents. At best, we might be able to compare our results to the theoretical $\mu$ and $\sigma$ we found for the ideal case to identify how our results differ. But it’s not at all clear that this tells us anything about error. Why?
Because maybe these dice aren’t ideal. Maybe they differ in some way from our model. For example, maybe you’ve heard the term “weighted dice” before? What if one of them is heavier on one side? That might cause it to roll e.g. 6 more often than 1, and give us a slightly different distribution. You could call that an “error” in the manufacturing of the dice, perhaps, but that’s not what we generally mean when we talk about statistical error.
So perhaps it’s time we seriously considered what “error” means. After all, it’s hard to identify an “error” if we haven’t clearly defined what “error” is. Let’s say that we perform an experiment – we make our 100 die rolls and keep track of the results, and generate a figure like the one above. And in addition, let’s say we’re primarily interested in the mean of this distribution; we want to know what the average result of rolling these particular two dice will be. We know that if they were ideal dice, it should be seven. But when we ran our experiment, we got a mean of 7.40.
What we really want to know is the answer to the question, “how accurate is that result of 7.40?” Do we trust it so much that we’re sure these dice are nonstandard in some way? Or was it just a fluke accident. Remember, there’s absolutely no reason we couldn’t roll 100 twelves in a row, because each dice roll is independent of the last, and it’s a random process. It’s just really unlikely. So how do we know this value we came up with isn’t just bad luck?
So let’s say the “error” in the sample mean is a measure of accuracy. In other words, we want to be able to say that we’re pretty confident that the “true” value of the population mean $\mu$ happens to fall within the interval $\mu_{\rm sample}E < \mu < \mu_{\rm sample} + E$, where $E$ is our measure of error. We could call that range our confidence interval, because we feel pretty confident that the actual mean $\mu$ of the distribution for our dice happens to be in that interval. We’ll talk about exactly how confident we are a little bit later.
It should be clear now why comparing our distribution to the “ideal” distribution doesn’t tell us anything about how reliable our results are. We might know that the sample mean differs from the ideal, but we don’t know why. It could be that our dice are defective, but it could also just be a random fluctuation. But since nothing we’ve discussed so far tells us how accurate our measured sample mean is, we don’t know for sure. To get that, we need to figure out how to represent $E$, the number that sets the bounds on our confidence interval.
It’s a common misconception that $E$ should just be the sample standard deviation $\sigma_{\rm sample}$. You may have seen results presented like $\mu \pm \sigma$, or $7.40 \pm 2.52$, to suggest an interval of confidence. That is, generally speaking, not correct. Or at least, very misleading. Because that’s not what the standard deviation means.
What we really want here is something called the standard error, though it’s also commonly called the standard error of the mean. It’s also sometimes (mistakenly or carelessly) called the “standard deviation of the mean,” but we’ll clarify the difference in a second. I like the term “standard error of the mean,” because it makes it clear that this is a measurement of accuracy of the sample mean. As you might guess, it’s closely related to the sample standard deviation, but not quite the same. It’s calculated by dividing the sample standard deviation by the number of individual “trials,” or dice rolls, $N$:
$${\rm SE_{\mu}} = \frac{\sigma_{\rm sample}}{\sqrt{N}}.$$
This, at long last, is a good measurement of error. It’s worth noting that the standard deviation of the mean is defined similarly, but uses the true standard deviation of the distribution:
$${\rm SD_{\mu}} = \frac{\sigma}{\sqrt{N}}.$$
The reason the two are often used interchangeably is that we generally don’t know what the actual distribution looks like, nor do we know the expected values of $\mu$ and $\sigma$. Sometimes we do, of course; if we have a theory describing the process we’re measuring, then we can often calculate the theoretical values of $\mu$ and $\sigma$. But we don’t always know if our experiment matches the theory as well as we’d like – for example, if one of the dice is weighted and rolls more sixes than ones.
And sometimes, we don’t have a welldescribed theory at all, we just have a pile of data. This is the case for most Simulationcraft data runs, because we don’t have an easy analytical function that accurately describes your DPS due to any number of factors: procs, avoidance, movement, and so on. In that sort of situation, we can never truly know $\sigma$, so the lines between ${\rm SE}_{\mu}$ and ${\rm SD}_{\mu}$ blur a little bit, and we tend to get sloppy with terminology.
Double Standards
Now, we’ve thrown around a lot of terms that have “standard deviation” in them. It’s no wonder the layperson is easily confused by statistics. So it’s worth spending a moment to make the differences between these terms abundantly clear. Let’s reiterate quickly why we use standard error to describe the accuracy of the sample mean rather than just using $\sigma$ or $\sigma_{\rm sample}$.
We have a theoretical probability distribution describing the result of rolling two 6sided dice. Here’s what each of the terms we’ve discussed so far tells us:
 The mean (or “population mean”) $\mu$ tells us the average value of a single roll.
 The standard deviation $\sigma$ tells us about the fluctuations of any single dice roll. In other words, if we make a single roll, $\sigma$ tells us how much variation we can expect from the mean. When we make a single roll, we’re not surprised if the result is $\sigma$ or $2\sigma$ away from the mean (ex: a roll of 9 or 11). The more $\sigma$s a roll is away from the mean, the less likely it is, and the more surprised we are. Our distribution here is finite, in that we can never roll less than two or more than 12, but in the general case a probability distribution could have nonzero probabilities farther out in the wings, such that talking about $4\sigma$ or $5\sigma$ is relevant.
 The sample mean $\mu_{\rm sample}$ tells us the average value of a particular sample of rolls. In other words, we roll the dice 100 times and calculate the sample mean. This is an estimate of the population mean.
 The sample standard deviation $\sigma_{\rm sample}$ tells us about the fluctuations of our particular sample of rolls. If we roll the dice 100 times, we can calculate the sample standard deviation by looking at the spread of the results. Again, this is an estimate of the population’s standard deviation, and it tells us how much variation we should expect from a single dice roll.
 The standard deviation of the mean $SD_{\mu}$ tells us about the fluctuations of the mean of an arbitrary sample. In other words, if we proposed an experiment where we rolled the dice 100 times, we would go into that experiment expecting to get a sample mean that’s pretty close to (but not exactly) $\mu$. $SD_{\mu}$ tells us how close we’d expect to be. For example, under normal conditions we’d expect to get a result for $\mu_{\rm sample}$ that is between $\mu2{\rm SD}_{\mu}$ and $\mu+2{\rm SD}_{\mu}$ about 95% of the time, and between $\mu2.5{\rm SD}_{\mu}$ and $\mu+2.5{\rm SD}_{\mu}$ about 99% of the time.
 The standard error of the mean $SE_{\mu}$ tells us about the fluctuations of the mean of our particular sample of rolls. Once we actually make those 100 rolls, and calculate the sample mean and sample standard deviation, we can state that we’re 95% confident that the “true” population mean $\mu$ is between $\mu_{\rm sample}2{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2{\rm SE}_{\mu}$, and 99% confident that it’s between $\mu_{\rm sample}2.5{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2.5{\rm SE}_{\mu}$
You can see why this gets confusing. But the key is that the standard deviation and sample standard deviation are telling you about single rolls. If you roll the dice once, you expect to get a value between $\mu+2\sigma$ and $\mu2\sigma$ about 95% of the time.
Whereas the standard deviation of the mean and standard error tell us about groups of rolls. If we make 100 rolls the sample mean should be a much better estimate of the population mean than if we made only a handful of rolls. And if we make 1000 rolls, we should get a better estimate than if we only made 100 rolls.
So we use the standard deviation of the mean to answer the question, “if we made 100 rolls, how close do we expect $\mu_{\rm sample}$ (our sample mean) to be to $\mu$ (the population mean)?” And we use the standard error to answer the related (but different!) question, “now that I’ve made 100 rolls, how accurately do I think my calculated $\mu_{\rm sample}$ (sample mean) approximates $\mu$ (the population mean)?”
You might wonder what voodoo tricks I played to get these “95%” and “99%” values. These come from analysis of the normal distribution, which is a probability distribution that comes up frequently in statistics. If your probability distribution is normal, then about 68% of the data will fall within one standard deviation in either direction. Put another way, the region from $\mu\sigma$ to $\mu+\sigma$ contains 68% of the data. Likewise, the region from $\mu2\sigma$ to $\mu+2\sigma$ contains about 95% of the data, and over 99.7% of the data will fall between $\mu3\sigma$ to $\mu+3\sigma$.
Our probability distribution isn’t a normal distribution. First of all, it’s truncated on either side, while the normal distribution goes on infinitely in either direction (we’ll never be able to roll a one or 13 or 152 with our two dice). Second, it’s a little too discrete to be a good normal distribution – there isn’t quite enough granularity between 2 and 12 to flesh the distribution out sufficiently. It’s really more of a triangle than a nice Gaussian, though it’s not an awful approximation given the constraints. Luckily, none of that matters! As it turns out, the reason our distribution looks vaguely normal is closely related to the reason that we use the normal distribution to determine confidence intervals.
Limit Break
The Central Limit Theorem is the piece that completes our little puzzle. Quoth the Wikipedia,
the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a welldefined expected value and welldefined variance, will be approximately normally distributed.
That’s a bit technical, so let’s break that down and make it a bit clearer with an example. We start with a dice roll (a “random variable”) that has some probability distribution that doesn’t change from roll to roll (“a welldefined expected value and welldefined variance”) and each roll doesn’t depend on any of the previous ones (“independent”). Now we roll those dice 10 times and calculate the sample mean. And then roll another 10 times and calculate the sample mean. And then do it again. And again, and again, and… you get the idea (“a sufficiently large number of iterates”). If we do that, and plot the probability distribution of those sample means, we’ll get a normal distribution centered on the population mean $\mu$.
The beautiful part of this is that it doesn’t matter what the probability distribution you started with looks like. It could be our triangular dice roll distribution or a “tophat” (uniform) distribution or some other weird shape. Because we’re not interested in that; we’re interested in the sample means of a bunch of different samples of that distribution. And those are normally distributed about the mean, as long as the CLT applies. Which means that when we find a sample mean, we can use the normal distribution to estimate the error, regardless of what probability distribution that the individual rolls obey.
Now, there are two major caveats here that cause the CLT to break down if they aren’t obeyed:

The random variables (rolls) need to be independent. In other words, the CLT will not necessarily be true if the result of the next roll depends on any of the previous rolls. Usually this is the case (and it is in our example), but not always. There are two wowrelated examples I can think of off the top of my head.
Quest items that drop from mobs aren’t truly random, at least postBC (and possibly postVanilla). Most quest mobs have a progressively increasing chance to drop quest items, such that the more of them you kill, the higher the chance of an item dropping. This prevents the dreaded “OMG I’ve killed 8000 motherf@$#ing boars and they haven’t dropped a single tusk” effect (yes, that’s the technical term for it).
Similarly, bonus rolls have a system where every failed bonus roll will cause a slight increase in the chance of success with your next bonus roll against that boss. So this would be another example where the CLT won’t apply, because the rolls aren’t truly independent.

The random variables need to be identically distributed. In other words, the probability distribution can’t be changing inbetween rolls. If we swapped one of our 6sided dice out for an 8sided or 10sided die, all of the sudden our probability distribution would change and there would be no guarantee that the CLT would apply.
You might ask if you could cite either of the two examples of dependence here as examples of nonidentical distributions. After all, in each case the probability distribution is changing between rolls. However, that change is due to dependence on previous effects – in a sense, the definition of dependence is “changing the probability distribution between rolls based on prior outcomes.” So dependence is a more specific subset of this category.
If either of those things occur, then we can’t be sure that the CLT is valid for our situation. Luckily, none of that applies to our dicerolling example, so we can properly apply the CLT to estimate the error in our set of 100 rolls.
Keep Rollin’ Rollin’ Rollin’ Rollin’
So now that we’ve talked a lot about deep probability theory, let’s actually do that. The standard error of our 100roll sample is,
$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.52/\sqrt{100} = 0.252 $$
To get our 95% confidence interval (CI), we’d want to look at values between $\mu_{\rm sample}2{\rm SE}_{\mu}$ and $\mu_{\rm sample}+2{\rm SE}_{\mu}$, or $7.40 \pm 0.504$. And sure enough, the actual value of the population mean (7.00) falls within that confidence interval. Though note that it didn’t have to – there was still a 5% chance it wouldn’t!
We could improve the estimate by increasing the number of dice rolls. For example, what if we rolled 1000 dice instead? That might look something like this:
We see that our new sample mean is $\mu_{\rm sample}=6.95$ and our sample standard deviation is $\sigma_{\rm sample}=2.41$. But now $N=1000$, so our standard error is much smaller:
$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.41/\sqrt{1000} = 0.0762$$
As before, we’re 95% confident that our sample mean is within $\pm 2{\rm SE} = 0.1524$ of the population mean in one direction or the other, and sure enough it is.
Of course, we could keep going. Here’s what 10000 rolls looks like:
And if we calculate our standard error for this distribution, we get:
$$ {\rm SE}_{\mu} = \sigma_{\rm sample}/\sqrt{N} = 2.43/\sqrt{10000} = 0.0243$$
So now we’re pretty sure that the value of 7.01 is correct to within $\pm 0.0486$, again with 95% confidence. Like before, there’s no guarantee that it will be – there’s still that 5% chance it falls outside that range. But we can solve that by increasing our confidence interval (say, looking at $\pm 3{\rm SE}_{\mu}$) or by repeating the experiment a few times and thinking about the results. If we repeat it 100 times, we’d expect about 95 of them to cluster within $\pm 2{\rm SE}_{\mu}$ of 7.00.
You may have noticed that while the confidence interval is shrinking, it’s not doing so as fast as it did going from 100 to 1000. That’s because we’re dividing by the square root of $N$, which means that to improve the standard error by a factor of $a$, we need to run $a^2$ times as many simulations. So if we want to increase our accuracy by a whole decimal place (a factor of 10), we need to make 100 times as many rolls. This is important stuff to know if you’re designing an experiment, because you don’t want your graduate thesis to rely on making five trillion dice rolls. Trust me.
You probably also noticed that the more rolls we make, the more the sample probability distribution resembles the ideal “triangular” case we arrived at theoretically. That’s to be expected – the more rolls we make, the better the sample approximates the real distribution. This is related to another law (the amusinglynamed law of large numbers) that’s important for the CLT, but I don’t have time to go into that here. But it was worth mentioning just because “law of large numbers” is probably the best name for a mathematical law ever.
Finally, I mentioned that our “triangular” distribution for two dice looks vaguely normal, and that this relates to the CLT somehow. Here’s how. Each die is essentially its own random variable with a “flat” or “uniform” probability distribution (you have an equal chance to roll any number on the die). So when we take two of them and calculate the sum, we’re really performing two experiments and finding two sample means (with a sample size of 1 roll each). The sum of those two sample means, which is just twice the average of the sample means, is our result. This is exactly how we phrased our description of the CLT!
The reason we get a triangle rather than a nice Gaussian is that two dice is not “a sufficiently large number of iterates.” There is, unfortunately, no clean closedform expression for this probability distribution for arbitrary numbers of $s$sided dice (something called the binomial distribution works when $s$=2, i.e. for coin flips). But if we rolled 5 dice or 10 dice instead of two, and added all of those up, we’d start to get a distribution that looked very much like a normal distribution. And in fact, if you read either of the articles linked in this paragraph, you’ll see that they both become wellapproximated by a normal distribution as you increase the number of experiments (die rolls).
World of Statcraft?
Now that you’ve read through 4000 words on probability theory, you may ask where the damn World of Warcraft content is. The short answer: next blog post. But as a teaser, let’s consider a graph that shows up in your Simulationcraft output:
When you simulate a character in SimC, you run some number of iterations. Each iteration gives you an average DPS result, which is essentially one result of a random variable. In other words, each iteration is comparable to a single roll of the dice in our example experiment. If we run a simulation for 1000 iterations, that gives us 1000 different data points, from which we can calculate a sample mean (367.7k in this case), a sample standard deviation, and a standard error value.
And all of the same statistics apply here. This plot gives us the “DPS distribution function,” which is equivalent to the triangular distribution in our experiment. The DPS distribution looks Gaussian/normal, but be aware that there’s no reason it has to be. It generally will look close to normal just because each iteration is the results of a large number of “RNG rolls,” many of which are independent. But some of those RNG rolls are are not independent (for example, they may be contingent on the previous die roll succeeding and granting you a specific proc, like Grand Crusader). With certain character setups you can definitely generate DPS distributions that deviate significantly from a normal distribution (skewed heavily to one side, for example).
But again, because of the Central Limit Theorem, we don’t care that much what this DPS distribution function looks like. As long as each iteration is independent, we can use the normal distribution to estimate the accuracy of the sample mean. So we can calculate the standard error and report that as a way of telling the user how confident they should be in the average DPS value of 367.7k DPS.
At the very beginning of this post, I said I was looking into a strange deviation from the expected error. What I was finding that my observed errors were larger than what Simulationcraft was reporting. Next time, we’ll look a little more closely into how Simulationcraft reports error, and discuss the specifics of that effect – why it was happening, and how we fixed it.