As I said in Part I, I observed some strange error behavior in the 5.4.2 Rotation Analysis post. Now that we’ve had a thorough (and lengthy) review of the statistics of error analysis, It’s time we looked more carefully at the problem that started this whole mess.

**Mo’ Iterations, Mo’ Problems**

Once again, here was my comment about error from the Rotation Analysis blog post:

The “DPS Error” that Simulationcraft reports is really the half-width of the 95% confidence interval (CI). In other words, it is 1.96 times the standard error of the mean. To put that another way, we feel that there’s a 95% chance that the actual mean DPS of a particular simulation is within +/- DPS_Error of the mean reported by that simulation. There are some caveats to this statement, insofar as it makes some reasonably good but not air-tight assumptions about the data, but it’s pretty good.

I’m actually doing a little statistical analysis on SimC results right now to investigate some deviations from this prediction, but that’s enough material for another blog post, so I won’t go into more detail yet. What it means for us, though, is that in practice I’ve found that when you run the sim for a large number of iterations (i.e. 50k or more) the reported confidence interval tends to be a little narrower than the observed confidence interval you get by calculating it from the data.

So for example, at 250k iterations we regularly get a DPS Error of approximately 40. In theory that means we feel pretty confident that the DPS we found is within +/-40 of the true value. In practice, it might be closer to +/- 100 or so.

So let’s talk about these “deviations.” What caught my attention at first was that, even though the DPS Error reported by SimC was $\pm$ 40 DPS, I could sim the same rotation several times and get values that differed by much more than that, often in the hundreds of DPS. After looking into it more carefully, I’d say that the “$\pm$ 100 or so” I quoted in the last blog post was probably a bit of an under-estimate; $\pm$ 200 to 300 DPS might be a closer estimate to the actual variations I was seeing.

And while this is less than a 0.1% relative error given that we’re talking about DPS means near 400k, it’s still a little disconcerting. First, on a theoretical level, I believe in statistics, so it’s unsettling when they appear not to be behaving properly. Second, it struck me as very odd that going from 50k iterations to 250k iterations didn’t seem to have a meaningful impact on the error fluctuations. As an experimentalist, I’m familiar with the process of determining how much error I can accept and how much integration time (in this case, iterations) it will take to achieve that level of confidence. So when these sims failed to meet the spec that I set, I took notice.

But a handful of assorted simulations that violate spec isn’t enough information to base a hypothesis on. I knew it wasn’t demonstrating the desired behavior. But to figure out what was wrong, I needed to first figure out exactly what behavior the system *was* exhibiting. And to do that, I needed more data.

**Confidence Boost**

In the quoted passage above, I said that what Simulationcraft reports as “DPS Error” is really $1.96 {\rm SE}_{\mu}$, which is the half-width of the 95% confidence interval (CI). The full 95% CI is $\mu_{\rm sample} \pm 1.96 {\rm SE}_{\mu}$, so it’s appropriate to say that when you look at a SimC report, the “DPS” value it reports is accurate to about $\pm$ “DPS Error.” This is a pretty natural way of reporting error, as we’ve seen in Part I.

Thinking back to our dice experiment in Part I, we said that if we repeated the experiment 100 times, we’d expect that about 95 of them would fall within the range $\mu_{\rm sample}\pm 2{\rm SE}_{\mu}$ (I’m rounding 1.96 to 2 here for simplicity). That was the meaning we ascribed to the 95% confidence interval. So one way to test the system is to do exactly that: run the simulation 100 times and take a look at the distribution of sample means.

And just to be abundantly clear about what that means, let’s assume we’re interested in the simulation error when we run it for 100k iterations. We can do that once to get a sample mean and 95% CI. We can then do it 99 more times, running the sim for 100k iterations *each time*, which gives us 100 sample means from the 100 independent simulations.

Our best guess at the population mean $\mu$ is the mean of those 100 sample means $\mu_{\rm sample}$ (I feel like I need an Xzibit image here…). And we could then empirically determine a value $\delta$ such that 95 of those means fit in the range $(\mu-\delta,\mu+\delta)$. If we did that, then $2\delta$ is our empirical estimate of the 95% CI. We could compare that to twice the value SimC reports as “DPS Error” to check for consistency.

There’s a number of ways to make that empirical estimate, but two of them are relatively easy in MATLAB. The first is to use the prctile() function, which we can use to find the DPS values that are the 2.5th and 97.5th percentiles of the data set. The difference of those two values is the empirical estimate of the 95% CI.

The second method is more involved, and uses Principle Component Analysis, or PCA. It also goes by a number of other names: eigenvalue decomposition, empirical component analysis, singular value decomposition, and several more. It’s related to finding principal axes in mechanics, if you’re familiar with mechanical engineering concepts. It attempts to find the confidence region (or “confidence ellipsoid”) of the data set, which is the generalization of a confidence interval into higher dimensions. When you apply it to a one-dimensional data set, though, you get the usual confidence interval.

In any event, it’s a powerful linear algebra technique that would require another whole blog post to explain, so if you’re really interested in the guts of it I suggest you read the Wikipedia article. For those that care, I’m using a function from this thread of MATLAB Central, which uses the finv() and princomp() methods from the statistics toolbox. (fun coincidence: I worked in the same building as the author of this code as a postdoc, though in a different department). The only change I’ve made is a minor correction; I’m fairly certain that the line

ab = diag(sqrt(k*lat));

should be

ab = diag(k*sqrt(lat));

so I’ve made that correction. Without the correction, the 95% CI’s the code produces are approximately half the size they should be (because $k\approx 4$) as tested with a normally-distributed data set that I generated for the purpose of testing the code. With that correction, this prediction agrees very well with the percentile-based estimate (as it should!).

So, armed with two techniques to empirically estimate the 95% confidence interval, I set to the task of doing that for various simulation lengths. In other words, run 100 simulations with 50 iterations each, then do it again for 100 iterations, and again for 250 iterations, and then for 500, 1000, 2500, 5000, 10000, 25000, 50000, 100000. I did all of this with the T16H protection paladin profile and “default” settings in SimC.

That takes a while – the whole set of runs takes 5-8 hours depending on how many threads I use. But at the end, we get a graph that looks something like this:

It’s a little harder to tell what’s going on in the top plot because it’s a semilog, but the bottom loglog plot shows the problem very clearly. At 1000 iterations (10^{3}) the three error estimates agree very well. However around 5000 iterations we see the observed error exceeding the reported error, and as we increase the number of iterations further the gap just gets larger. By 100000 iterations (10^{5}), we’re reporting a confidence interval of almost 100 DPS, but observing a confidence interval of nearly 500 DPS.

This is a problem – it means that we’ve effectively hit an “error floor” in SimC, because no matter how many iterations we throw at the problem, the error doesn’t seem to improve. And that’s pretty weird. But why?

**Results Hazy, Ask Again Later**

The “why” took a little more thinking. I’ve had several discussions over the past month with other SimC devs and a few academics about what might cause this sort of thing. As it turns out, everyone I spoke to had the same first guess that I did. If you remember back to Part I, we said that our error estimates were based on the Central Limit Theorem. Maybe we were violating the CLT somehow, and as a result our actual errors were larger than we expected?

If you recall, the constraints on the CLT were that each iteration needed to be **independent** and **identically distributed**. In other words, none of the iterations should depend on any of the previous iterations, and the probability distribution we’re sampling shouldn’t change from iteration to iteration. Of the two, dependence seemed like the more likely culprit.

I should note that while this was the first thought I had, the second thought I had was “but how?” Most people I talked with were similarly stumped at first. The thing that stuck out to us as the most likely culprit also seemed… somewhat unlikely. And that was the “vary_combat_length” option in SimC.

See, the default setting in SimC is to vary the combat length from iteration to iteration to smooth out the impact of cooldowns and other fixed-time-interval effects. To illustrate that concept, let’s say you had a spell with a 1-minute cooldown that gave you a 30-second buff that significantly increased your DPS (say, Avenging Wrath on steroids). If you ran the sim for exactly 1 minute and 30 seconds, you’d get two casts of that spell (once at the pull, once at the 1-minute mark) and you’d have 66.67% uptime on that buff. But if you ran the sim for exactly 2 minutes, you’d have the same two casts but only 50% uptime. Your DPS would look really good in the first sim, and significantly lower in the second sim.

So to try and reduce that problem and give a more holistic view of your DPS that accounts for fluctuations in fight length, SimC varies the fight length by up to 20% from the default of 450 seconds. That way you get a spread of cooldown uptimes that more accurately represents and average encounter.

The reason that we thought this was an unlikely candidate was that it wasn’t clear how this violated either of the CLT constraints. See, SimC doesn’t just run arbitrarily for 450 seconds by default. It does that for the first iteration, during which it tallies up the amount of damage you do, and then for subsequent iterations it gives the boss that much health and lets you go to town on it, varying the health accordingly to get longer or shorter runs.

So varying the combat time doesn’t change the relative amount of time you spend in execute range, for example. That’s important, because if you spent e.g. half of the fight in execute range, and you do more DPS in execute range, then you’ve changed the probability distribution being sampled, so we’d be violating the “identically distributed” constraint.

However, the variation in combat length isn’t random either – it follows a predetermined sequence, where it alternates between extremes. As a rough example, it might start with a run that’s 20% shorter than the average, which we’ll call “-20%.” It’ would follow that with a run that’s 20% longer than average, or “+20%.” And then one that’s -19%, followed by another at +19%, followed by -18%, and so on. Note that these aren’t relative to the previous iteration – they’re all relative to the target length of 450 seconds. So in theory, these shouldn’t be violating the independence clause on that account. But they *are* somewhat deterministic because of the patterning.

So, it felt unlikely that this was the problem. But we really weren’t sure. So we tested it by repeating the experiment with the command-line argument “vary_combat_length=0” to disable the combat length variation code. And five to eight hours later, the result was this:

Well, that didn’t help. So at the very least, the combat length variation code isn’t the only problem. We can’t rule it out completely based on this data, because it’s possible (if unlikely) that it is one of two or more contributing factors. But it certainly looks like the culprit lies elsewhere.

**Death and Decay**

The next candidate we came up with was a quirk of how the boss health calculation works. I glossed over this above by saying that we determine the boss’s health based on the damage done in the first iteration. But that’s not really the whole story.

There’s no guarantee that the first iteration is a representative sample of your DPS. Maybe in that first iteration you had an unusually low number of crits or Grand Crusader procs, so your DPS was below average. In that case, the health we assign the boss for iteration #2 will be a little low, and you might blow through it in 425 seconds rather than 450 seconds. If we kept using that boss health value, we may find that after a large number of iterations the mean combat length is only 430 seconds rather than our target of 450 seconds.

So Simulationcraft incorporates that information by performing a moving average on boss health as we go. If iteration #2 was significantly shorter, it will add a little health to the boss for the next one. It basically makes an educated guess at how much more health it would take to bring the average back up to 450 seconds. It does that for each iteration, though with some amount of decay built-in to keep things from oscillating out of control. The technique is very good at homing in on an average of 450 seconds of combat after many iterations. This is called the “enemy health estimation model,” and it’s what SimC uses by default.

Unfortunately, it also means that each iteration is slightly dependent on the previous ones. If iterations one through 50 were a little short, then iteration 51 gets a little longer. Again, it’s not clear that this is a strong enough effect to matter, but we just weren’t sure, and it’s a pretty obvious place to check if you’re worried that dependence between iterations is a problem.

There are two ways we can reduce the impact of health recalculation in SimC. The first is to use a time-based model with the command-line option “fixed_time=1″, which tells the sim to run for exactly 450 seconds, period. It will still perform the boss health recalculation from iteration to iteration, but since we’re stopping the sim based on time, that won’t cause excessively long or short runs. This option also respects the user’s choice of the vary_combat_length option, and adjusts the time accordingly unless it’s disabled.

The second way is to use the Fixed Enemy Health model by setting “override.target_health=X”. This forces the boss to have exactly X health every iteration, and the sim ends when the boss runs out of health. So it automatically disables combat length variation and the health recalculation effect. This is the pinnacle of having independent trials, because it removes any possible dependence on previous runs.

So I ran three more configurations: One with fixed_time=1 and vary_combat_length left at the default of 0.2, one with fixed_time=1 and vary_combat_length=0, and one with target_health=171000000 (roughly appropriate for a 450-second run at ~400k sustained DPS).

Did I mention that each of these takes 5-8 hours?

Days later, here’s what I got out of the experiments:

Now we’re getting somewhere. It seems from this data that the fixed_time setting didn’t change anything, but fixing the target health *did*. The fixed health simulation gives us results in excellent agreement with the theoretical results. So we really are looking for a violation of the Central Limit Theorem, at least somewhere.

But where? Was it in the health recalculation? Or the combat length variation? Or something else entirely that I overlooked?

**Class Warfare**

Around this time, one of the other SimC devs asked me if I had tested this with other specs or classes. The thought being that maybe it was an issue specific to paladins. And of course, I hadn’t yet, because each experiment takes five to eight hours to run, and I was in the middle of the last of the three runs above. But it was definitely on my to-do list to run a few other specs as a control group.

So I queued up a few more of these experiments for other specs. For example, using the T16H retribution paladin profile:

Note that this is with default settings – the same settings that cause the error anomaly with protection. I ran a few more experiments to test enhancement shamans and protection warriors, with similar results. All of the *other* classes seemed to be obeying the CLT, even with combat length variation and health recalculation active. And even retribution seemed to be working properly under those conditions. It’s as if the problem was specific to protection paladins!

Which really meant that I thought the problem was something I did in the paladin module – i.e. it was my fault. So of course, I immediately went to digging through the paladin module looking for anything that would link one iteration to the next. Maybe I wasn’t re-initializing everything properly, so the state at the end of one iteration somehow was influencing the next? But after a few hours of combing through the code, I came up empty-handed. Nothing seemed to be persisting between iterations.

So I started debugging, literally running a few simulations and checking the state of the paladin at different break points in the process of the simulation. Combing through all of the relevant properties of the paladin object in Visual Studio, searching in vain for something – anything – that wasn’t being reset properly. And while I didn’t find anything, it did cause me to stumble over the answer in the dark almost by accident.

**Fix Me Up, Before You Go Go**

What I stumbled across was the fixed_time flag. I was running the T16H protection paladin profile through the simulation with completely default settings, and at one of my breakpoints I happened to notice that the fixed_time flag was active. Needless to say, this was… odd. It shouldn’t be on in a default simulation. Unable to figure out why it was on, I consulted the other devs, and was pointed to an old piece of code that had been hiding in the shadows:

if ( p -> primary_role() != ROLE_HEAL && p -> primary_role() != ROLE_TANK && ! p -> is_pet() ) zero_dds = false

If you’re not fluent in C++, that’s checking to see that the actor’s role is not “healer” or “tank”, and also that the actor is not a pet. And if the actor is none of those things, it sets a flag to false. Later on, that flag is used to forcibly enable “fixed_time=1″ if the flag is true. So in other words, the sim automatically shifts into fixed-time mode if you’re simming a healer or a tank!

Now, at the time it was written, this code makes sense. Keep in mind that Simulationcraft started out primarily as a DPS spec simulator. While it has the guts to support healers and tanks, it wasn’t until fairly recently that either of those roles were really supported *well. *Arguably, healers still aren’t, for a variety of reasons, and a lot of the reason that it’s been improving for tanks is because I got involved and started implementing stuff that we wanted to see.

That’s not meant as a shot at the existing SimC devs either, by the way. These folks work incredibly hard to improve and maintain the project, but it’s a hobby for all of us, and there’s more than enough work to be done keeping it running properly for DPS specs. Getting solid support for, say, tanking pretty much requires a dev who has the interest and time to spend implementing tanking stuff, not to mention other devs who are willing to maintain the tanking part of each class module. And that didn’t really happen until I got involved and gave tanks a reason to care about the results (being able to calculate TMI, and correcting a bunch of minor errors in combat, mitigation, and Vengeance calculations).

It’s also why I suspect healers won’t be well-supported until a serious healing theorycrafter decides to say, “here’s what we need the sim to do in order to be useful to us,” and then wade in and make those changes.

But back to the point, if you’re simming a healer, you’re not putting out any DPS. It makes very little sense to base the simulation time on boss health in that scenario, so you’d clearly want to default to a fixed-time model for a healer. That line of code was basically just a catch-all to say, “only use the boss health estimation model for DPS classes/specs.” The fact that it enabled it for tanks was mostly an afterthought, because nobody was using Simulationcraft to simulate tanking at that point.

In any event, this was a giant clue that the problem had to do with the fixed_time option, so we dug into that in more detail. What I learned, mostly from discussion with the other devs, is that fixed-time mode did a bunch of things it really shouldn’t. The root of the problem was that it was still basing the boss’s health percentage on the health recalculation algorithm in this mode. That poses two major problems:

- The boss still “died” when it reached 0% health, which meant that you could end the simulation earlier than your target time if you happened to be lucky on that iteration (i.e. had above-average DPS).
- If you had an iteration of below-average DPS, the simulation would hard-stop at the target time. So if you were supposed to run for 450 seconds, and the boss wasn’t dead yet, tough – the simulation just ends.

That seems perfectly logical, but it causes some major CLT violations. Hard-stopping the simulation at 450 seconds is essentially throwing a Heaviside function (or “step function”) into the mix. It’s saying, “we don’t care what happens after this point, and we’re going to ignore it.” But natural variations in DPS output should cause some iterations to be shorter than 450 seconds and other iterations to be longer. The hard-stop only applies to the longer runs, which means we’re artificially affecting some of our iterations but not others.

To see why this is a problem, consider the following two scenarios:

- An iteration where you had exactly average DPS, and the boss dies exactly at 450 seconds. You enter execute range ~370 seconds into the fight, so you spend about 80 seconds in execute range. Note that this is a little less than 20% of the time, because I’m assuming your DPS goes up in execute range.
- An iteration where you had bad luck and produced below-average DPS. You don’t enter execute range until ~400 seconds into the fight as a result, so you only get 50 seconds of execute range. The simulation forcibly ends combat at 450 seconds with the boss still having 5%-10% health remaining.

The second scenario should produce even lower DPS than expected, because not only did you have bad luck during the initial part of the iteration, but you were robbed of 30 seconds (or more) of higher-DPS execute time. Statistically, that means is that we’re changing the underlying probability distribution, because the relative time spent in execute range is changing *significantly* from iteration to iteration.

And that violates one of our CLT conditions – each iteration needs to be **identically distributed** if we want to be able to use the CLT. If we spend 10% of our time in execute range on one iteration, but 20% on another iteration, and 15% on a third iteration, that condition isn’t being adhered to anymore, and we can’t expect our error to conform to the predictions of the CLT.

**Ti-i-i-ime Is On My Side**

The correction, which was made in this commit, was to fix the way we calculate health percentage. Instead of using boss health in fixed_time mode, we now ignore boss health entirely and use time to estimate boss health. For example, if you’re running a 450-second simulation and you’re 270 seconds into it, the health_percentage() function just returns the percentage of time left in the simulation: $100\%\times(1-270/450)=40\%$. This fixes both of the problems above: we’re no longer chopping off low-DPS runs and skewing our distribution, and the boss can’t die early on high-DPS runs because the sim calls health_percentage() to determine if the boss is dead yet. And if we rebuild the simulation after that commit and run the T16H protection paladin profile through it, we get this:

Excellent. We’re now getting proper agreement with the CLT estimate even out as far as 10^{5} iterations. And we can expect that trend to continue as iteration numbers increase because we’re not violating any of the CLT conditions anymore.

In a later commit, the line of code quoted above was changed to remove the tank role check as well. In other words, we’re no longer running in fixed-time mode all the time, which is fine because we produce stable enough DPS that the health recalculation algorithm should work properly. While that doesn’t have a significant impact on the results, it’s nice to know that we use the same defaults as most other specs (excepting healers, of course).

If you were attentive, you may have noticed that I tested protection warriors and found that they weren’t exhibiting the same error behavior. Now that you know what the problem was, you may ask, “why not?” After all, they’re tanks, so they were also being forced into a fixed-time mode when being simmed. So what gives?

If you guessed “they don’t have an execute range,” pat yourself on the back. Oh sure, warriors have Execute – the entire “execute range” term is named for it, after all. But if you take a quick look at the T16H protection warrior profile, you’ll notice that it isn’t being used. Which makes sense, because a tank that’s actually in danger would rather use that rage on Shield Barrier for more survivability. Since the T16H protection warrior profile doesn’t change the player’s behavior during execute range, it’s irrelevant how much time they spend there, because their DPS doesn’t change when the boss drops below 20%. So the types of variations that caused error bloat for the protection paladin profile simply don’t exist in the protection warrior profile.

**OMG TLDR**

If you’re thinking to yourself, “Man I really don’t want to read 4600 words, could you get to the point already,” this section is for you. In short, here’s what happened:

- The simulation was forcing tanks into a “fixed-time” mode, where the sim runs for X seconds and stops if it reaches that time regardless of boss health.
- As a result, the relative amount of time spent in execute range could change significantly from iteration to iteration based on your DPS, changing the underlying probability distribution.
- Changing the underlying probability distribution violates the Central Limit Theorem, and makes Simulationcraft’s reported error estimate inaccurate, far lower than the actual error.
- We fixed it by (a) changing the way we calculate the boss’ health percentage in fixed-time mode, and (b) not forcing tanks into fixed-time mode in the first place.

For anyone who didn’t skip to the end, I hope this was an enjoyable read, and less technically grueling than Part I was. It was fun (if time-consuming) to write, if only because I get to mix in concepts that I use frequently in a professional (experimental physics) context, like error analysis and experimental design, with theorycrafting and simulation.

I think many people don’t realize how intertwined the two are in practice. I’m sure a lot of theorycrafters, especially newer ones or ones without a strong science background, ignore error entirely. It’s a lot easier to just look at things like mean DPS or HPS, damage per resource spent, or similar metrics. But especially when it comes to simulation, it’s important to know how good your estimates are and whether you can trust them.

Part of my goal in this pair of posts was to provide a good example of how one goes about doing that, and why. Together, they’re a good introduction to how to properly perform error analysis on results and what to look for when you find results that don’t meet your expectations. Hopefully, at least a few theorycrafters come out of reading these posts feeling like they’ve added a new tool to their skill set.

And more generally, that non-theorycrafters leave with a sense of what it means to talk about statistical (i.e. random) error. I’ll consider it a success if a few people walk away from this set of posts saying, “You know, I never understood how this works before, but now I get it.”