In the last post, we went through the analytical calculation to derive the total damage reduction (TDR) scale factors for dodge/parry, mastery, hit, expertise and haste. While the results seem reasonable, it’s always good to double-check, often using a different method entirely. Today, we’re going to use the Monte Carlo simulation to see if it predicts similar scale factors. In addition, the Monte Carlo gives us some information that would be trickier (though not impossible) to get from the analytical version.

As usual, the code is publicly available through the matlabadin project. The two files involved are:

montecarlo.m

montecarlo3.m

The first file is the guts of the Monte Carlo simulation, and handles all of the game mechanics. The second file is primarily a loop that runs the simulation the appropriate number of times. We’ll end up running this with a few different configurations to see how the simulation time and number of simulations affect the outputs. Character stats are chosen to be similar to the analytical formulation: 2% hit/expertise, 30% avoidance (5% miss, 15% parry, 10% dodge), 20 mastery, 33% block chance. I errantly used 10% haste instead of 5% haste (i.e., melee instead of spell) for the cooldown reductions granted from Sanctity of Battle, but that shouldn’t cause a huge discrepancy.

To start, let’s look at the results for $N=10$ samples of $\tau = 10 000$ minutes of combat each, for a total “integration time” of $N\tau = 100 000$ minutes of combat. The simulation calculates the damage reduction afforded by 600 of each type of rating. To put those results in a form that’s a little easier to compare to the scale factors from the analytical calculation, we’ll normalize by the mastery value (since that came out to be close to 1 in the analytical formulation). The table below presents the mean ($\mu$), standard deviation ($\sigma$), and standard deviation of the mean ($\sigma_{\rm mean}$) of the 10 samples.

Stat | Dodge | Hit | Exp | Haste | Mast |

$\mu\large$ | 1.2471 | 0.6192 | 0.6033 | 0.4841 | 1.0000 |

$\sigma\large$ | 0.0638 | 0.0608 | 0.0634 | 0.0648 | 0.0531 |

$\sigma_{\rm mean}\large$ | 0.0202 | 0.0192 | 0.0200 | 0.0205 | 0.0168 |

The scale factors (means) tell an interesting story. Just as in the analytical calculation, dodge comes out ahead of mastery, though the lead is slightly diminished. However, hit, expertise, and haste all perform much better than they did in the analytical version. If you’ll recall, the hit/exp scale factor was around 0.2 and the haste factor was around 0.15. The holy power generation stats all came out to be about 3x more important in the Monte Carlo simulation for some reason. Looking at the code, I can’t find any obvious factor-of-three error, but sometimes it’s tough to find an error like that in your own code. I encourage the programmer types in the audience to take a look at it.

If it’s a less obvious bug in the code, I’d need to do some debugging while the code is running. I usually do that by looking for observables that don’t make sense in the context of the simulation parameters. For example, if we found that somehow hit/exp/haste were triple-dipping somewhere, or causing negative probabilities for attacks to miss, or so on. While I don’t have time to do that now, I may do it later on to figure out what the problem is. And that’s assuming there’s a problem in the first place – this Monte Carlo performed pretty well before, and it didn’t need to be changed very much to support the new mechanics, so it’s possible that the error is in the analytical derivation. That’s probably the first thing I’d check (and again, I encourage the calculus types in the audience to check my math).

Nonetheless, from this simulation we can make a few conclusions. We’re still seeing that dodge/parry beat out mastery for TDR, and mastery maintains its lead over hit, expertise, and haste. So the order of stat priority hasn’t changed, even though the gap between mastery and the HPG stats has narrowed. Again, there’s the question of whether these TDR stat weights are the best measure of survivability, but that’s a topic for another blog post.

Since statistical error is a concern in these sorts of situations, let’s look at the standard deviations for a moment. For those with less background in statistics, the standard deviation describes the spread of the sample data. The interval $\mu \pm 2\sigma$ contains about 95% of the data. So if $\sigma$ is large, it means that any individual trial (i.e. any one of our ten samples) may deviate fairly significantly from the mean. On the other hand, if $\sigma$ is small, it means that each trial is fairly consistent. In this simulation, $\tau$ is very large, so each sample is fairly consistent, leading to a fairly small $\sigma$. Thus, any given 10,000-minute simulation will be a pretty good representative of the overall stat weights.

The standard deviation of the mean ($\sigma_{\rm mean} = \sigma/\sqrt(N)$) is a subtly different quantity. It tells us about the variation *of the mean* rather than about an individual sample. While we’d expect that an individual 10,000-minute simulation would fall within $\mu \pm 2\sigma$ 95% of the time, the mean of 10 runs will experience a smaller variation. If we ran the entire simulation again (all $N=10$ runs at 10,000 minutes each), the mean of the simulation will fall within $\mu \pm 2\sigma_{\rm mean}$ 95% of the time. Thus, the standard deviation of the mean represents our best estimate of the confidence we have in the value of $\mu$. And as you can see, it’s pretty good – with this amount of integration time, we feel pretty confident that the mean is within $\pm 0.04$ of the calculated value. Note that this is only talking about the statistical confidence of the calculation – if there’s a bug in the code, the values could be much farther off. So we know that the calculation is fairly consistent at this amount of integration time, but it doesn’t mean we haven’t made an error that causes us to be consistently *wrong*.

Next, let’s see how things change if we tweak our simulation settings a bit. What if we bump up $N$ and drop $\tau$ to compensate. We’ll keep the total integration time $N\tau$ constant, which should keep $\mu$ and $\sigma_{\rm mean}$ about the same. If we run the simulation with $N=100$ samples of $\tau= 1 000$ minutes, we get the following results:

Stat | Dodge | Hit | Exp | Haste | Mast |

$\mu\large$ | 1.2133 | 0.5825 | 0.6034 | 0.4853 | 1.0000 |

$\sigma\large$ | 0.2069 | 0.2056 | 0.2044 | 0.2047 | 0.1887 |

$\sigma_{\rm mean}\large$ | 0.0207 | 0.0206 | 0.0204 | 0.0205 | 0.0189 |

The only significant difference here is that $\sigma$ has ballooned from 0.06 to 0.2. That means that each individual trial has more variation than it used to, which makes sense because the integration time of each individual trial has been reduced. We still feel pretty confident about the means, just not about any particular 1,000-minute segment of combat.

Let’s run this again for $N=1 000$ samples of $\tau= 100$ minutes:

Stat | Dodge | Hit | Exp | Haste | Mast |

$\mu\large$ | 1.1932 | 0.5953 | 0.5952 | 0.4390 | 1.0000 |

$\sigma\large$ | 0.6914 | 0.6537 | 0.6732 | 0.6581 | 0.6605 |

$\sigma_{\rm mean}\large$ | 0.0219 | 0.0207 | 0.0213 | 0.0208 | 0.0209 |

Again, $N\tau$ hasn’t changed, so our means and standard deviations of the mean are still pretty consistent. But now the standard deviation is up to 0.65, which is larger than some of our means. In fact, some of the samples actually show a net TDR *loss* when adding 600 haste rating, simply due to statistical fluctuations. And the situation just gets worse as we reduce $\tau$ further. If we try $N=10 000$, $\tau= 10$ minutes, we get:

Stat | Dodge | Hit | Exp | Haste | Mast |

$\mu\large$ | 1.2418 | 0.6105 | 0.6228 | 0.5032 | 1.0000 |

$\sigma\large$ | 2.1871 | 2.1687 | 2.1676 | 2.1975 | 2.1888 |

$\sigma_{\rm mean}\large$ | 0.0219 | 0.0217 | 0.0217 | 0.0220 | 0.0219 |

Now every one of our standard deviations is larger than the mean. Note that this data represents the statistics of a large number of 10-minute fights, which is a fairly realistic encounter length. And to reiterate, the fact that $\sigma$ is larger than $\mu$ doesn’t mean we’re not confident in the accuracy of the mean; $\sigma_{\rm mean}$ is still around 0.02, so we’re confident the means are accurate. But we aren’t confident in the outcome of any individual encounter. *On average*, you reduce TDR more by stacking mastery instead of hit, expertise, or haste. But for any individual sample, it’s a crap-shoot.

This illustrates why it’s difficult to draw meaningful conclusions from logs or in-game data. Most of us simply don’t generate enough data under a stable set of conditions to get statistically significant results. Comparing data from an evening with one gear set to another evening’s data using a different gear set isn’t going to cut it, because random fluctuations prevent you from being too certain of the results. And that’s if you’re even lucky enough to pretend all of the other variables are constant (i.e. what healers did you have, were buff uptimes from other players similar, did you spend the same amount of time being targeted or taking damage, etc.).

But in any event, this set of simulations lends some credence to the analytical calculation. We’re seeing the same patterns in the scale factor data, even if the hit/exp/haste values don’t line up perfectly. With any luck, we’ll find an error or bug that explains the discrepancy in those values by doing some debugging of the code and double-checking the analytical calculation. Once we understand the lack of agreement, we can draw some more convincing conclusions about the different stat weights, and discuss whether TDR is the proper metric to consider.

Pingback: L90 Mitigation Stat Weights for Warriors | Sacred Duty

Pingback: MoP Mitigation Stat Weights – Monte Carlo – b16048 | Sacred Duty