Simulationcraft now has full support for protection paladin mechanics, and hopefully in a week or two I’ll get around to writing a howto blog post on using it and interpreting the data. Once I write a few batch files, it should produce all of the DPS results I could generate from the MATLAB FSM code and more.
What it doesn’t have yet is a good smoothness metric with which we can assess our survivability. Of course, that’s not really Simcraft’s fault, because a good smoothness metric doesn’t exist yet.
So if we want a metric, we have to build it ourselves.
Up until now, we’ve looked at data and made qualitative assessments about the results to come up with statements like “X smooths better than Y.” But now we want to quantify that thought, which is a lot more difficult. We want a numerical estimate of how much better X is than Y. And to do that, we need to get a little introspective and think about what we’re doing when we make those qualitative assessments. We need to analyze our process and figure out how to translate it into numbers.
Data
So let’s start with some data. Below are the results of a 10kminute sim like I usually show on the blog. We’ll use this sample data set for all of the following analysis. The gear sets are variants on the Control/Haste setup – the first is just C/Ha, followed by sets where I add 1000 of a given stat. In the case of hit and expertise, I subtract 1000 since we start at the cap. We’ll use a boss that swings for 350k after mitigation every 1.5 seconds, the standard SH1 finisher priority, backcalculated Seal of Insight with no overhealing apart from inherent, and Sacred Shield enabled.
Here are the gear sets:
 Set:  C/Ha  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   Str  15000  15000  15000  15000  15000  15000  15000  15000   Sta  28000  29000  28000  28000  28000  28000  28000  28000   Parry  1500  1500  1500  1500  1500  1500  1500  2500   Dodge  1500  1500  1500  1500  1500  1500  2500  1500   Mastery  1500  1500  1500  1500  1500  2500  1500  1500   Hit  2550  2550  1550  2550  2550  2550  2550  2550   Exp  5100  5100  5100  4100  5100  5100  5100  5100   Haste  12000  12000  12000  12000  13000  12000  12000  12000 
And here are the simulation results:
Finisher = SH1, Boss Attack = 350k, SoI model=nooverheal data set metric100003  Set:  C/Ha  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   mean  0.261  0.261  0.270  0.269  0.255  0.255  0.255  0.255   std  0.103  0.103  0.106  0.106  0.102  0.102  0.104  0.104   S%  0.522  0.522  0.507  0.513  0.531  0.522  0.523  0.523   HP  755k  775k  755k  755k  755k  755k  755k  755k   nHP  2.158  2.215  2.158  2.158  2.158  2.158  2.158  2.158        4  Attack  Moving  Avg.         50%  46.002  45.122  48.989  48.683  43.896  44.417  43.900  43.963   60%  29.901  27.764  32.974  32.571  27.952  27.702  28.195  28.208   70%  16.203  15.421  18.813  18.474  14.560  15.093  14.995  15.120   80%  7.291  6.228  9.267  9.050  6.320  6.214  6.705  6.811   90%  2.526  1.744  3.623  3.422  2.056  2.137  2.264  2.311   100%  0.617  0.394  1.046  0.966  0.443  0.571  0.543  0.544   110%  0.103  0.060  0.232  0.196  0.066  0.080  0.088  0.088   120%  0.024  0.013  0.070  0.055  0.016  0.018  0.023  0.021   130%  0.002  0.001  0.009  0.007  0.001  0.001  0.002  0.001   140%  0.000  0.000  0.002  0.001  0.000  0.000  0.001  0.001 
For now, I’ve narrowed our focus to 4attack moving averages. In theory this will be equally applicable to any string size, but there’s little point in providing data that we’re not going to use at the moment. And for Simcraft, we’re going to have to choose a default time window for our moving average. The 3 and 4attack moving averages are the ones we focus on the most, and 4 attacks is closest to the time window I had in mind (56 seconds).
Now, let’s think about how we analyze this data. Normally we look at the top few categories and draw qualitative conclusions from that. For example, adding 1000 haste vs. 1000 mastery, we see that haste comes in lower (or equal) in spike representation for all rows above 90% player health. On the other hand, we pay less attention to rows that contain a large percentage of attacks across the board, because those are very likely to happen, and reducing the amount is rarely meaningful. If something happens 8% of the time rather than 9%, that’s not a huge change because it’s still going to happen a lot during an encounter, so you need to plan for it happening anyway.
Within those top few categories that we consider, we tend to put heavier emphasis on the larger spikes than the smaller ones. If we can significantly reduce spikes that are 130% of our health, then that’s perceived as a lot more important than an equal reduction in spikes that are 110% of our health, especially if there’s still a sizable percentage of those events.
So according to this data, we would conclude that haste is better than mastery, though not by a huge amount. Dodge and parry are both worse than haste, but stamina is a little better. It’s hard to say how hit/exp fare since we subtracted 1000 points instead of adding 1000 points, so ignore them for now.
Analyzing the analysis
Our qualitative assessment primarily looked at two factors:
Spike magnitude – A 130% spike is more important than a 120%, than a 110%, and so on. We mentally assigned more importance to the largest spikes than the smaller ones. This has to be accounted for in our numerical analysis.
Spike frequency – This one is more complicated, because it’s not as straightforward as “bigger is worse.” We care about how frequently things happen, and we care about changes in that frequency. But not all changes are created equal. Some examples:
 If we have 5%10% representation in a category, those are going to happen unless we eliminate them. Going from 7% to 6% (as we do in the 80% spike category when we add 1000 stamina to C/Ha) isn’t really all that meaningful a change, and shouldn’t be worth a whole lot.
 But a representation of 0.002% isn’t that likely at all, and may not even be worth worrying about. Going from 0.002% to 0.001% is probably not that meaningful either even though you’re halving the number of spikes, because those spikes weren’t very likely in the first place.
 Reducing a 1% chance to 0.1% would be a meaningful change, because that’s a pretty noticeable reduction from a nontrivial amount. You’re taking something that was very likely and making it fairly unlikely. Similarly for 0.1% to 0.01% – something unlikely to something really unlikely.
 Going from 0.01% to 0.001% is the same change (a factor of 10), but maybe not as important because it wasn’t very likely to begin with. Going from 0.001% to 0.0001% is almost irrelevant, because both are so unlikely.
 That said, there’s always some comfort in the certainty that an event can’t happen, so there’s almost always some (admittedly nebulous) value in reducing a representation to 0.000%. But it’s obviously more valuable when you reduce a nontrivial chance into 0%.
It’s easier to describe how to do this with some pictures. Below is the overall spike histogram for the haste and mastery gear sets. This is exactly the same data that’s in the table, just in bar plot format, and instead of using bins that are 10% of your health wide, I’m using 2% wide bins. The xaxis is the percentage of your health (expressed in decimal form, so 1.00 = 100%) and the yaxis is the number of events. The distribution is roughly centered around 50% health, with a large spike at 0% due to 4attack avoidance or full absorb strings. This is pretty standard for these sorts of histograms.
However, we don’t look at the whole histogram (in fact, the table only ever shows the top half of it). We look only at the highestdamage parts, which you can’t really see on that plot because the bars are tiny compared to the bulk of events in the middle. So in the plot below I’ve zoomed in on the very top end of the distribution. This figure shows the top 5% of all events – i.e., the 5% of events that have the highest damage value. Relating this to the table, it’s cutting off every row that has a number greater than 5.000 in the C/Ha column (around 82% player health is the cutoff).
We see that haste and mastery are pretty similar here, which isn’t surprising since their data isn’t that different. Haste is a little better, but it’s easier to see that on the table than in this plot because the table uses a coarser binning. Nonetheless, we want to use this plot (or rather, the data it’s showing) to generate a quantitative metric.
The first thought is to subtract the two bar plots from one another. In other words, do something like this:
We could just sum all of the bins and get a number that represents the difference between haste and mastery. However, that’s ignoring the fact that these events are not equal: the events at 1.3 (130% of your health) are much, much worse than the ones at 0.85 (85% of your health). So what we really want to do is apply a weight function. Basically, we multiply each bar by a number representing how important that bin is, and then perform the sum. You may be familiar with the term “weighted average,” which is exactly the operation we’re performing.
For a simple example, consider the table below. The first two data columns show the representations for the +1000 haste and +1000 mastery gear sets. The third data column shows the difference between these values, just like we’re showing in the differential histogram above. The next column is the weight factor, which represents how much we care about a certain category. The final column is the product of the difference and the weight factor, which gives us a numerical representation of how much “value” we’ve gained or lost in that spike category. And if we sum that column we get the weighted average, which is an overall “score” that tells us how much better or worse haste performs than mastery at smoothing.
Percentile

Haste Set

Mastery Set

Diff

Weight

Weighted
Value 

80%  6.320  6.214  0.106  0.25  0.0265 
90%  2.056  2.137  0.081  0.5  0.0405 
100%  0.443  0.571  0.128  1.0  0.128 
110%  0.066  0.08  0.014  2.0  0.028 
120%  0.016  0.018  0.002  4.0  0.008 
130%  0.001  0.001  0  8.0  0 
140%  0.000  0.000  0  16.0  0 
Sum:  0.178 
In practice we’d do things slightly differently to get scale factors. We would start the same way, by calculating the histograms for a baseline (C/Ha) and for new configurations with 1000 of each stat added. But then we would subtract all of the +1000 sets from the baseline (rather than from each other), and then perform the same weighted average sum to get the scale factors describing how well each stat “smoothed” our damage intake.
Of course, we’ve glossed over the most important part of the problem: what weight function do we use? This is a critical consideration, because the weight function is everything. Get it right and you have a robust metric that works well; get it wrong and you have garbage. This is where our breakdown of the factors we used in the qualitative assessment come into play. We want to weight the higher spike categories more heavily than the lower spike categories, and we want larger changes to be more valuable than smaller changes.
There are lots of functions we could choose – a simple linear function, the Fermi function we explored when modeling Seal of Insight, and dozens of others. But what felt natural to me was an exponential function. For example, $w(x) = e^{a(x1)}$, where $a$ is some constant that determines how quickly the function changes and $x$ is the spike size (again, in decimal form, so 100% of your health is 1.0, 90% of your health is 0.9, and so on). For those who aren’t familiar with what an exponential function looks like, here it is:
So for example, the bin corresponding to 100% of your health is worth exactly $w(1)=e^{a(11)}=e^{0}=1$. The bin corresponding to 90% of your health is worth $w(0.9)=e^{a(0.1)}=e^{a/10}$. This form is a bit unwieldy because it’s hard to translate between the constant $a$ and the behavior of the function. We know that a larger $a$ will make the function steeper and increase the value change between one bin and the next, and that a smaller $a$ will reduce that value change. But it’s not obvious from looking at it what happens to the relative valuation of one spike size to another with an arbitrary change in $a$.
To make this more intuitive, let’s make a substitution. Let $a=10 \ln(h)$. That makes the equation $w(x)=e^{10 \ln(h) (x1)} = h^{10(x1)}$. Why does this make it easier? Well, consider what happens if we evaluate $w(x)$ for increments of 10% of your health, as I’ve done on the table below:
50%  60%  70%  80%  90%  100%  110%  120%  130%  140% 
1/32  1/16  1/8  1/4  1/2  1  2  4  8  16 
In other words, for every 10% of your health you get a factor of $h$. A spike that’s 10% larger is $h$ times more important, and a spike that’s 10% smaller is $1/h$ times as important. The single variable $h$ controls how much weight a bin gains or loses if it’s one health decade away from 100% health. So we could call it the health decade factor, or HDF for short. If our top events are at 140% health and the smallest event we want to consider is at 90%, the top events will be approximately $h^5$ more valuable than the bottom ones. Very straightforward.
The other reason this is good is that it works no matter where you are on the xaxis. If our top events were between 50% and 100% of our health, each bin would get multiplied by a smaller factor, but you still have the same relative $h^5$ weight between the top and the bottom. If we used a different function this might not be the case, and the metric wouldn’t work as well for an arbitrary distribution.
Of course, we still need a good value for $h$ that represents the effects we want. There’s no obvious choice here, either. It is, by its nature, sort of arbitrary. We’re not attempting to model an exact number we can expect to see in game, like DPS or DTPS. We’re trying to come up with a number that represents smoothness in damage intake, but the actual value isn’t that important. What is important is that the value mimics a thorough qualitative assessment. It doesn’t matter whether the value we get is 10 or 100 or 1000 as long as a similar amount of improvement from another stat gives a similar value and a larger or smaller amount of improvement gives a larger or smaller value, respectively.
Meloree and I discussed this at some length and guessed at a value of $h = 2$, which is also what I used for the weight function plot above. The idea here is that a 90% health spike is about half as important as a 100% spike, while a 110% spike is twice as important. To see how this weight function affects the histogram, let’s multiply the entire histogram by the weight function. That gives you the plot below.
If you compare this to the unweighted distribution provided earlier, you can clearly see that the weighted distribution gets shifted up by the nonlinear weighting function. The events near 0 are practically worthless, while the events near the top get more valuable. If we zoom in on the top 5% of events again, we get the following plot:
This has clearly increased the value of the highermagnitude spikes compared to the lowerlying spikes. Mission accomplished, perhaps?
Well, no, not quite. There are a few problems with this.
First, note that the few events occurring near 1.35 on the xaxis are still not worth very much – far less than the thousands of events that occur at ~0.85. Those events at 0.85 are going to happen, that’s around 1% of all events just in that bin. I’m not sure they should have so much more weight than the far more dangerous ones at the top.
Second, if you calculate the weighted average of the difference between the haste and mastery data, we get a number that suggests that mastery is better than haste. But our qualitative assessment gave us exactly the opposite answer! What’s going on here? Did we screw something up?
The result seems paradoxical at first until you look a little deeper and realize what’s happening. With this value of $h$ we’re still very, very sensitive to “edge effects.” You get a different answer if you look at the top 5% of events than if you look at the top 4%, top 3%, top 7%, etc. To illustrate that, here are the differential figures for the top 3%, 4%, 5%, and 7% of all events:
You can see that the last bin on the left is generally the largest one, and while these plots are unweighted, the problem is that the number of events in that last bin tends to increase in size faster than the weight function dies off. To quantify that further, let’s actually calculate some stat weights. Below is a table of the stat weights calculated this way for different cutoffs, starting from the top 1% and going to the top 10% most damaging attacks, along with a final row where we include 100% of attacks (i.e. allinclusive) for a reference. Note that since we’re considering stat weights, a bigger number means it’s a better smoothing stat. I’ve also properly adjusted for the fact that we’re subtracting hit and expertise instead of adding them, which just requires an inversion (e.g. 5922 becomes 5922).
hdf=2.00, N=200, vary pct  pct  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   0.010  1885  4481  3586  1507  712  688  680   0.020  3471  5448  4361  1986  2127  977  877   0.030  2947  5922  4710  2206  1724  1129  998   0.040  4013  6309  5070  2388  1547  1233  1034   0.050  4290  6652  5406  2597  2884  1358  1149   0.060  4290  6652  5406  2597  2884  1358  1149   0.070  3636  6843  5651  2715  2409  1427  1230   0.080  4907  6946  5643  2886  3173  1617  1433   0.090  4907  6946  5643  2886  3173  1617  1433   0.100  4814  7055  5740  2921  2188  1630  1436   1.000  4627  7309  6001  3257  2923  2096  1906 
As you can see, the values fluctuate a lot as we change the percentage of attacks we consider. If you go down the haste and mastery rows, you see that they’re swapping places depending on which row you look at. That’s not good. Ideally they would be fairly consistent from row to row. Even if the values change (which is fine), the relative value of the two shouldn’t. But because that last bin is so important, our results depend heavily on what exactly we choose as a cutoff, even though the events near that cutoff are the least important!
Apart from the oddity with haste/mastery, the order is generally Hit>Exp>Stam>(Haste/Mast)>Dodge>Parry. This is about what we expected, so that part at least is good. Dodge beats out parry because of diminishing returns, as in these gear sets dodge is diminished much less. Hit and expertise are both very strong, just as we know they are.
The bottom row includes 100% of events, so it sums the entire histogram. This tends to inflate the value of dodge and parry more. In general, the more exclusive you make the metric, the worse dodge and parry do because they trade the presence of worse spikes for lower average damage taken. Why? Well, when you restrict your view to just the top X% of events, those worse spikes really hurt dodge and parry. As you increase the percentage of events being considered though, dodge and parry start to perform better because it starts adding value to the large mass of events in the middle of the unweighted distribution.
This all leads to the third problem: the choice of bin size matters. I’ve made the plots with 100 bins (between 0% and 200% health so bins of 2% health), but the data table above uses 200 (bins of 1% health). You get slightly different answers with the exclusive metrics because changing the number of bins sometimes shifts a chunk of events into or out of the part of the data you’re considering. That’s not good because again, the events near the bottom of the region we’re considering are supposed to be the least valuable ones, and thus have the smallest influence on the distribution. Being so sensitive to edge effects introduces this artificial dependence on bin size. And we don’t want to get significantly different results just because we altered the bin size slightly.
There are a few ways to try and fix this, but the most obvious to me was to increase the HDF. That increases the weight discrepancy between lowerdamage bins and higherdamage bins, making the lower bins less valuable and the higher bins more valuable. So I repeated the calculation for an HDF of 3:
hdf=3.00, N=200, vary pct  pct  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   0.010  3126  8488  6824  2259  1318  537  781   0.020  4320  9221  7414  2622  2383  752  926   0.030  3954  9529  7642  2762  2086  847  1001   0.040  4548  9747  7845  2864  1984  905  1021   0.050  4670  9919  8014  2970  2647  968  1080   0.060  4670  9919  8014  2970  2647  968  1080   0.070  4373  10006  8126  3024  2418  999  1117   0.080  4879  10048  8122  3094  2711  1077  1200   0.090  4879  10048  8122  3094  2711  1077  1200   0.100  4844  10091  8159  3108  2331  1082  1201   1.000  4772  10194  8260  3226  2590  1221  1342 
Much betterlooking. The variation with percentage is smaller, though we’re still seeing edge effects. If you go down the haste and mastery rows you’ll see mastery jump around relative to haste a good bit. But the increased hdf has created a larger ramp on the weight function, making those edge effects smaller overall. The other interesting thing here is that the 100% (allinclusive) version is a pretty good average – we’re still seeing the approximate 3to1 ratio of hastetoavoidance value in the 100% row that we’re seeing in the 5% row. This allinclusive version has another perk – it isn’t subject to edge effects at all, so it’s quite a bit more robust.
Just for completeness, here are the weighted histograms we get with $h=3$:
Of course, 2 and 3 are pretty arbitrary choices for the HDF. To get the best metric, we really need to nail down the best value for $h$. So the next step is to explore what happens when we change $h$ while keeping the percentage of events fixed. When we do this for the top 5%, 10%, and 100% of events, we get the results below:
pct=0.05, N=200, vary hdf  hdf  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   1.5  4836  6256  5193  2735  3668  1566  1261   1.6  4660  6258  5170  2680  3443  1511  1227   1.7  4523  6301  5183  2640  3259  1465  1201   1.8  4419  6384  5228  2614  3108  1425  1179   1.9  4343  6501  5303  2600  2984  1390  1163   2.0  4290  6652  5406  2597  2884  1358  1149   2.1  4259  6836  5537  2602  2803  1328  1139   2.2  4247  7051  5695  2616  2738  1299  1130   2.3  4251  7297  5881  2637  2689  1270  1123   2.4  4271  7574  6094  2666  2652  1240  1118   2.5  4305  7883  6335  2701  2627  1207  1113   2.6  4353  8223  6607  2743  2613  1170  1108   2.7  4414  8596  6908  2790  2608  1130  1103   2.8  4488  9003  7242  2844  2613  1083  1097   2.9  4573  9443  7610  2904  2626  1030  1089   3.0  4670  9919  8014  2970  2647  968  1080   3.1  4779  10431  8455  3041  2676  897  1069   3.2  4900  10982  8936  3119  2713  815  1056   3.3  5032  11571  9460  3202  2757  720  1039   3.4  5177  12201  10029  3292  2809  611  1018   3.5  5333  12873  10645  3388  2868  486  994   3.6  5501  13588  11312  3490  2935  344  964   3.7  5681  14349  12033  3599  3009  182  929   3.8  5874  15157  12812  3715  3090  2  887   3.9  6080  16014  13651  3837  3179  210  839   4.0  6299  16922  14554  3967  3276  444  783   4.1  6531  17883  15526  4105  3382  707  719   4.2  6777  18899  16571  4251  3495  1002  646   4.3  7038  19971  17692  4404  3617  1331  562   4.4  7314  21103  18895  4566  3748  1698  468   4.5  7605  22295  20184  4737  3888  2105  362 
pct=0.10, N=200, vary hdf  hdf  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   1.5  5949  6996  5796  3333  2441  2068  1790   1.6  5601  6903  5698  3201  2363  1949  1689   1.7  5327  6869  5649  3098  2301  1849  1606   1.8  5111  6887  5642  3020  2252  1765  1538   1.9  4943  6950  5673  2962  2215  1693  1482   2.0  4814  7055  5740  2921  2188  1630  1436   2.1  4719  7199  5839  2895  2170  1573  1397   2.2  4653  7380  5969  2881  2161  1521  1364   2.3  4611  7597  6131  2879  2159  1472  1336   2.4  4592  7848  6323  2887  2165  1424  1312   2.5  4593  8134  6546  2904  2177  1375  1291   2.6  4611  8455  6801  2929  2195  1326  1272   2.7  4647  8810  7088  2963  2220  1273  1254   2.8  4698  9201  7409  3004  2251  1215  1236   2.9  4764  9627  7766  3052  2288  1152  1219   3.0  4844  10091  8159  3108  2331  1082  1201   3.1  4937  10592  8591  3170  2379  1003  1182   3.2  5045  11131  9064  3239  2433  914  1161   3.3  5165  11711  9580  3315  2494  813  1138   3.4  5298  12333  10141  3398  2560  699  1111   3.5  5444  12997  10751  3487  2633  569  1081   3.6  5604  13705  11412  3584  2711  422  1046   3.7  5776  14460  12128  3688  2797  255  1006   3.8  5962  15262  12901  3799  2889  67  961   3.9  6161  16114  13736  3917  2988  144  908   4.0  6374  17016  14635  4043  3094  382  849   4.1  6601  17972  15603  4177  3208  648  781   4.2  6843  18984  16644  4319  3329  946  705   4.3  7099  20052  17762  4469  3458  1278  619   4.4  7370  21180  18962  4628  3596  1647  522   4.5  7658  22369  20248  4796  3742  2056  413 
pct=100.00, N=200, vary hdf  hdf  Stam  Hit  Exp  Haste  Mast  Dodge  Parry   1.5  5269  7122  6009  3834  3753  3033  2764   1.6  5122  7133  5982  3692  3552  2793  2538   1.7  4971  7139  5953  3557  3361  2578  2339   1.8  4834  7165  5941  3437  3191  2393  2169   1.9  4719  7220  5956  3337  3045  2233  2026   2.0  4627  7309  6001  3257  2923  2096  1906   2.1  4558  7434  6077  3194  2823  1978  1805   2.2  4511  7596  6186  3149  2742  1874  1720   2.3  4485  7793  6328  3118  2678  1781  1648   2.4  4478  8027  6501  3101  2630  1696  1587   2.5  4489  8297  6708  3097  2595  1616  1534   2.6  4516  8603  6948  3103  2573  1540  1488   2.7  4558  8946  7222  3120  2563  1464  1447   2.8  4616  9324  7531  3147  2562  1387  1410   2.9  4687  9740  7876  3182  2572  1307  1375   3.0  4772  10194  8260  3226  2590  1221  1342   3.1  4870  10686  8683  3278  2617  1130  1310   3.2  4981  11218  9148  3339  2652  1029  1277   3.3  5105  11791  9657  3406  2695  918  1243   3.4  5241  12406  10213  3482  2746  794  1208   3.5  5390  13065  10817  3565  2805  657  1169   3.6  5552  13768  11474  3656  2872  502  1128   3.7  5727  14518  12185  3754  2946  330  1081   3.8  5916  15316  12954  3861  3028  136  1030   3.9  6117  16164  13785  3975  3118  81  973   4.0  6332  17064  14681  4097  3215  323  909   4.1  6561  18017  15646  4227  3321  593  837   4.2  6804  19025  16684  4366  3436  895  757   4.3  7062  20091  17799  4513  3559  1230  667   4.4  7335  21216  18997  4670  3690  1602  567   4.5  7624  22403  20281  4835  3831  2014  455 
There are a few effects happening here.
First, a small HDF tends to keep the scaling between stats small, which means they’re more vulnerable to edge effects. We can see this from the first two tables, as haste and mastery swap positions from one table to the other. We already sort of knew that, but it’s good that the data reaffirms that expectation.
A really high HDF tends to increase that gap dramatically, which reduces edge noise.
On the other hand, a high HDF does something strange to dodge and parry. As you can see, the stat weight of dodge and parry both plummet, and the dodge value even becomes negative at one point.
Your first thought might be to rationalize these results. For example, dodge and parry tend to give a wider distribution, allowing for larger spikes but reducing overall damage taken. But remember, we’re not comparing a dodge/parry gear set to a control gear set here, we’re adding 1000 dodge or parry to it. There should be absolutely no circumstance where adding 1000 dodge actually makes your smoothness metric worse! It should always make it better, just not necessarily as good as other stats.
In fact, it’s a different issue entirely. If you look at the source data, adding 1000 dodge reduces spike presence in every category except one: the very top one, 140%, where all of the sudden we have a 0.001 instead of a pure zero. This isn’t because adding dodge suddenly created higherdamage spikes; it’s simulation noise. That’s what’s causing the huge drop in dodge’s value here, and a similar (though less pronounced) effect is happening in the parry data. This is bad, it essentially means that our HDF is so large that it’s becoming supersensitive to simulation noise.
So we’re stuck between a rock and a hard place. If the HDF is too high it causes simulation noise problems, but reduces our edgeeffect issues. But if it’s too low we have the reverse: edgeeffect noise problems, but low sensitivity to simulation noise in the higher categories. One solution is to just simulate longer and reduce the noise, but that’s not a great solution either. Ideally, we want to pick an HDF somewhere in the middle, so that we get good data fidelity and low sensitivity to noise so that we don’t have to simulate for hours at a time.
For the tables where we look at only the top 5% and 10% of spike events, it’s clear that $h=2$ is too low. 2.5 is on the border of what I’d deem acceptable, but even that is a little volatile. On the high end, 3.5 is pretty good, but is right on the borderline of the region where simulation noise is a problem. But probably our ideal value lies somewhere between 2.5 and 3.5. Nominally, I’m going to say it’s around 3.0, because that data seems pretty consistent between both tables. But I have to do a more detailed analysis of this range to really feel comfortable picking a final value.
However, there’s also another solution to consider. The table that uses 100% of the data eliminates boundary issues entirely because it includes all of the data, and the weight function makes sure that each category is worth less as we work our way from higher to lower damage ranges. That makes it more stable at low $h$values than either of the percentilebased versions. I’d want an HDF larger than 2, because if it’s too low it risks watering down the strength of small changes near the top end and overvaluing avoidance. But it’s even passable at $h=2$ in this table, unlike the percentile versions.
From considering all of this data, I’m leaning towards defining the metric using an allinclusive histogram with a moderately high HDF, probably around $h=3$. However, we have a lot of further testing to do before we settle on final values. And of course, we’ll want to implement some sort of normalization scheme such that the values are independent of the number of iterations we use. I’ll detail all of that in the next two blog posts.
What’s in a name?
However, before we end, I want to offer a parting thought about the greater applicability of this metric, and suggest a name for it. While I’ve framed this entire discussion in terms of subtracting histograms from one another, the way you would likely do this in a computer is a little different. The distributive property tells us that $a(bc) = ab – ac$. That means it doesn’t matter whether we subtract the two histograms first and multiply by the weight function afterwards or vice versa.
This is actually really good news, because it means you could multiply the weight function times the histogram for one experimental configuration (e.g. gear set, talent combination, glyphs, etc.) to get a single number. Then you could repeat that process for any other gear set or configuration you want and get a new number. And you’ll get a unique number for each configuration that describes the smoothness of your damage intake under those particular conditions.
There’s a bit of arbitrariness to this, in that the weight function is whatever we decided to come up with. However, that’s OK, since we’re not trying to give a measurable ingame number like DPS or DTPS. As an analogy, think about the stock market. The Dow Jones Industrial Average (DJIA) is a stock market index, which is just a weighted average of the stocks of 30 large companies. It’s used to reflect the performance of the entire stock market, but the companies chosen are pretty much arbitrary. Even if some sort of formula is used to choose them, that formula is essentially arbitrary, and there are other indexes that use different formulas (S&P 500, Russel 2000, etc.) and give slightly different impressions of the overall health of the market.
Another good analogy is your credit score. Your credit score is calculated by an algorithm, which is somewhat arbitrary and chosen to model creditworthiness. But your credit score isn’t an estimate of some measurable value. Having a credit score of 750 doesn’t tell you exactly what APR you’ll get on your mortgage or how large a loan you can take out, even though it affects those things. But it’s also very clear that a score of 750 is better than a score of 600. And there are multiple ways of calculating credit score with different ranges, all trying to convey the same rough information about your borrowing risk.
And that’s really what we’re doing if we come up with a unique number for each simulation configuration: producing an index. You sim your gear set and you get a number out that tells you how smooth your damage intake is. You can resim with a different gear set to see if it improves. In this case, that means the number goes down, because just like DTPS a lower number is better with this type of smoothness metric. And you can calculate scale factors, which is just a fancy way of saying “sim once with a baseline gear set, then sim again with +1000 haste, then with +1000 mast, etc., and subtract each of those indices from the baseline value.”
In short, it’s exactly what an index should be: a solid number with a very clearlydefined calculation method that you could compare to any other configuration that uses the same boss mechanics. That’s really the only constraint here: the boss’ attacks need to be identical for two different indices to be comparable. You can’t compare an index generated with Hogger to one generated by Lei Shen and expect to get anything useful out of it. We’ll go into more detail on that point in the third post in this series, later this week.
But it will be clear that a result of 100 is a lot smoother than a result of 1000. It may not be clear exactly why until you look at the details – maybe one boss was Hogger and the other was Lei Shen, or maybe it was a gearing difference, or who knows what. Either way, it will be clear that the tank with the result of 100 wasn’t in much danger, while the tank clocking in at 1000 was. If we choose our overall normalization factor properly, we’ll be able to do this very accurately. For example, a smoothness value larger than 1000 would clearly be dangerous, while one below 500 wouldn’t be, or something like that. Again, we’ll talk about the details of that normalization process later this week.
The concept of the DJIA came up while Mel and I were discussing this, so my first thought was that we should call this the TheckMeloree Industrial Smoothness Average as a bit of a joke. However, it struck me that there was a much more natural name that was less of a mouthful: the TheckMeloree Index. Which, of course, would be abbreviated TMI. Which is not only amusing, but also eerily accurate based on the amount of background work we’re presenting here to develop it.
So yes, you heard it here first. We are going to get tanks to compare their TMIs. And it will be glorious.
Good read.
Incidentally I can’t wait for the first “/2 Tank LFR SoO, have acheiv and TMI of X”.
It’ll take a while depending on how the metric is marketed and eventually understood by SimCraft users, but it will happen eventually. Not likely in this expansion though since it’ll take many months before this will be widespread. I’m just hoping people can be smart enough to realize that setting an arbitrary number for that isn’t going to work so well, particularly with how different tanks have different mechanics that make a number across all tanks untenable.
My biggest curiosity is still how Blizz will react to this kind of metric coming to all tanks eventually and what they’ll do with dodge and parry.
I’m actually really curious about one point you mentioned: how TMI will vary from tank to tank. I would not be surprised if there was some pretty serious variation based on class.
Regarding Blizzard’s response: I wouldn’t expect much at first. They may have their own internal smoothness metrics that they check while balancing tanks (or not – it’s clear they put a lot more emphasis on DTPS than smoothness based on their opinion of Dodge/Parry). But if the metric becomes widespread enough, maybe it will encourage them to give more weight to smoothness arguments.
Perhaps, however I’ve been playing long enough to realize that portions of the community seem to thrive on the touting of certain metrics as a means of deciding who is “good enough” to join in with them.
With the new FlexRaid option I wouldn’t be surprised at someone latching on to this as a way to decide on who can join it for a weekly server raid that they are running, without fully understanding what the term means, only that the lower you can get the better.
It would go the same way that “gear score” did in BC and Wrath, and “item level” does currently. This concept simply is better at judging a tank’s theoretical durability, rather than a metric for all classes, which gives it more weight to the data that the number will represent.
I look forward to getting to test it out in SimC myself on my gear and set up, but at the same time I will still chuckle to myself when I see a post like that in trade chat, when someone is looking for a tank in the future.
At least TMI is a simulationbased metric. Meaning that it’s not something they can easily intuit from an addon like GearScore or read off of the armory. So unless they’re going to Sim your character to find out your actual TMI, they’ll have no idea whether you’re lying or telling the truth.
That’s probably the best reason that TMI won’t take off as a PUG filter statistic. It’s just not as easy to verify as ilvl or GearScore.
I would hope that you are right however I have been in a guild, for about a month, that had a “gear score” officer who’s job in the guild was to verify that everyone was “qualified” to join a particular raid group.
Needless to say the guild didn’t last long, but I still wouldn’t put it past people to Sim others characters to make sure they are up to their standards. The morality of doing that is up to discussion, but it isn’t out of the realm of possibility, and really isn’t too much of a time commitment, depending on how many people you are running with, and how often you get new people.
I guess time will tell.
I don’t think that simming characters for TMI will be anywhere likely to happen.
See, the TMI that you’d receive, would depend on the rotation and what not. If the tank uses a different rotation, or prefers Shield Block over Shield Barrier, or something, the TMI value has just become entirely useless.
@Thels: You do not expect that such a “gear score” officer would care about such naunces, right? That persons is disqualified in my eyes already by being actively doing such a function in the guild.
The HDF weighing of bins would end up up being prettymuch synonymous with varying the bin size, wouldn’t it?
Personally I tend to think in terms of variable bin sizes. Anything under (e.g.) 0.5 is just one big lump of “meh”, up to 0.8 gets into “now i’m starting to pay attention”, et cetera.
This is one of those posts I’m going to have to reread a couple of times to digest =) – looking forward to more.
They’re actually different degrees of freedom. Even in your variable bin size example, the relative weight of the 0.5 bin (which could even be zero) to the 0.8 bin to the 0.9 bin and so on is determined by HDF (or a similar degree of freedom).
I have maybe an odd question. Where do I begin to learn how to understand this? I did trig and precalc in high school, then technical engineering (12T MOS) and radiation safety in the Army, which was mostly more advanced geometry, trig and calc, no algebra above high school that I can remember. This kind of math is way over my head, but still intriguing.
There’s nothing in this post in particular that requires more complicated math than trig/precalc. If you’re familiar with the exponential funcion (i.e. $e^x$) and the natural logarithm ($ln(x)$), and understand the basics of bar charts (histograms) and sums, you’ve got the background required. There will be a tiny bit of calculus in parts 2 and 3, but nothing worth agonizing over.
I spent the afternoon on Wikipedia, which I’ve been using a lot to try and keep up, and reacquainted myself with natural log, along with a few tangents. Get it? Eh? Alright. What made it really click into place was realizing that in your graphs you were referring to damage taken as percent of health, not percent of health remaining after damage. Seems so obvious in retrospect.
Interestingly, I think, I’ve typed 5 different questions into this box, scrolling back and tabbing around to form them precisely, and just the act of doing so has lead me to the answers.
I guess I’m realizing the decay of having been out of school for a decade, with only the Army’s notoriously subpar technical standards to meet. Seriously, outside of radiation safety (which was very important!), most of my job was basic construction surveying, finding the optimal moisture content of a soil sample, and slump tests on concrete. Not very brainchallenging tasks. Use it or loose it for sure.
Wikipedia has some nice stuff of course, but you might be able to get more out of intmath.com since it has more detailed explanations as well as a forum.
When Theck starts posting actual formulas, I usually just skim past them, because he could just as well have typed them in Chinese.
However, he does a good job of explaining what the formula is for. For example, the 1x multiplier at 100%, 0.5x at 90% and 2.0x at 110% makes sense to me. I understand the logic and reasoning behind it.
So I don’t think you have to understand the math, as long as you understand the logic that the math is supposed to represent.
@ Jack, thanks for the Intmath.com tip. I like it! While Wikipedia is very, very detailed and informational, it’s just not very educational by itself. I’ve always seen it as a good supplement to education. Intmath.com seems organized in a more educational, progressively advanced context.
@ Thels, I can’t personally disconnect understanding the math from understanding the logic the math represents. I think understanding one necessitates the other. This could be because I’ve always viewed math as basically an extension of language, and I always want to hear and understand everything someone says, not just part.
Either way, my brain certainly takes a deep breath of relief when it sees “Summary” or “Conclusion” in bold.
I don’t have much to add to the discussion beyond saying that this is a fascinating look into developing metrics for theorycrafting, and I really enjoyed it.
Great post Theck. Having seen your preimplementation in SimC, the most interesting part was your elaboration of the health decade factor.
The good thing in SimC is that we can measure statistical noise, including that of the TMI metric. As discussed, my tests resulted in a pretty high variance. That would make the TMI more costly to create than other measurements ( dps, dtps ), but it should still be feasible with a bit of patience.
Bin sizes don’t matter in SimC, as you take the numbers directly from the distribution, without creating intermediate or merged histograms.
Looking forward to part two and three.
I ran some tests today with the latest trunk version, which includes your fix to my threading mistake, and the variances seemed very reasonable. I was getting very good estimates even with a 10kiteration sim; with 50k iterations they were very stable:
http://www.sacredduty.net/wpcontent/uploads/2013/07/10k_iterations.png
http://www.sacredduty.net/wpcontent/uploads/2013/07/50k_iterations.png
This was with my testbed “standard boss,” which has only one action:
actions=auto_attack,damage=1000000,attack_speed=1.5
I’m going to do some fooling around to see how much (if any) magic damage I want to introduce into that. I also need to figure out how to invert the scale factors (since TMI uses “golf rules” – lower is better).
The comment about bin sizes is a spoiler for part 3.
Perhaps to make TMI get bigger you just do it as 1 TMI or something as the end result. Although maybe having it become lower isn’t such a bad idea, but it might need a name change (sorry!) to convey the notion that the character takes less damage or is less susceptible to spikes.
The other common tank metric uses golf rules too (DTPS). I don’t think it will be a big deal once people understand it.
Those scale factors look pretty stable yes. I did my tests with the default enemy settings ( including spell nuke and dot ) on the Paladin_Protection_T15N.simc profile, and that resulted in a bit more variance. I still think the variance of the TMI directly correlates to how dependent the TMI is on the tail end of the distribution.
Is there a special reason to have inverted scale factors? I tried to make sure that negative scale factors are properly displayed in the graphics, so that shouldn’t be a problem. And if lower is better for a metrics, so be it. Export links, eg. to wowhead, which assume more is better, are already inverted automatically in SimC ( had the same problem with dtps ).
Just for consistency’s sake. If DTPS already uses negative scale factors and export links are inverted appropriately, it’s probably fine asis.
Would it be possible to apply the TMI to a WoL parse? Would be interesting in seeing performance vs simulation, but is the sample size too low from a single fight?
Yes, it would. If you extracted all of your damage and selfhealing events from the log, you could perform the TMI calculation. That said, the sample size will be very small, so the error bounds on the estimate will be relatively large.
Still, it might be interesting to run it on the entire log of a series of 20+ wipes during progression just to see how useful it is.
Hey Theck, look at the 2nd and 3rd graphs in this post for me. In the 2nd graph, it looks like Mastery is totally missing from the first data point (leftmost, near 0.85). Is this because Mastery was 0? And then in the 3rd graph, it looks like you’ve ADDED like a thousand to the Haste value, as if HasteMastery was 2xxx – (1xxx) = 3xxx. That is a RAW histogram, right? Cause it’s labeled with HDF=3.00, but even if it weren’t raw, the value should have decreased, not increased, since the weights are <1 below 100% of max HP… so I was totally confused about that. Is it from a different sim entirely, maybe?
If so, you may want to look at where you were showing the difference between the cutoffs of top X% of spike events, because there's the same graph with hdf=3.00 whereas all the others have 2.00… Yet in the 7% and 10%, the 3k+ spike is still there. I'm not really sure what it is, but something goofy is going on there.
Of note, forgot to mention, was concerned also because hdf=0.00 in graphs 1 and 2, then suddenly 3.00.
The “hdf=0″ on those plots is irrelevant, because neither of them are weighted. I just hardcoded “h=0″ into the title to remind myself they weren’t weighted. Apparently I forgot to do that for the binned bar plots (like #3), which is also a raw histogram. Sloppy title management on my part, sorry.
As for your other question: it’s not mastery that’s missing, it’s haste. And the reason it’s missing is because of the plot limits. There’s definitely a haste bar in the bin, but because MATLAB’s bar plot system puts one bar to the left of bin center and one to the right, it’s being cut off. That’s probably why you were confused – the large hastemastery value is because there’s a large haste bar just to the left of the plot limits that’s being cut off.
Oh, you are totally right, ok. I considered that it might have just been cut off but I was reading the graph wrong so I wrongly eliminated that possibility. Derp!
Different question: Is there an intuitive reason why the +1k Haste and +1k Mastery sets are so different for certain bins, causing these edge effects? These aren’t totally different gear sets; they’re C/Ha sets with a bit of stats added. I’m thinking it’s simulation noise combined with the fact that the possible number of attack reductions is discrete, and they all reduce the attack by a fixed amount (unless SoI/SS “absorbs” were not completely eaten, but does that happen often?). e.g. looking at just one attack in an attack string, +Haste will more often reduce a 0.4 attack to a 0.25, but +Mastery reduces it to a 0.23, putting the end sum in a different bin (one which may be impossible/unlikely to get to with C/Ha). Certainly a 0.4 can occur in either case, so some bins should look very similar, but I guess some can look drastically different.
If true, this is possibly another good reason you should remove the histogram middleman as you do in part 3.
What do you think?
Some of it is certainly noise, but these are 10kminute runs, so I don’t think that’s the dominant issue. I think it’s a combination of the histogram binning and the way mastery works. Boosting mastery by 1k increases your SotR mitigation by a bit less than 2%. As I said in the post, this tends to take that string of 4 attacks, at least one or two of which is SotR mitigated, and reduce the overall damage by ~1%, which shifts that event to the left on the histogram.
This “shifting” of damage spikes is exactly the smoothing mechanism that we want to get from mastery, of course, but on the histogram that means we’re often shifting over a bin boundary. Since we’re dealing with discrete events, some spike sizes are more likely than others – this sim doesn’t vary the damage of attacks at all, so each attack is limited to full damage, 70% damage (blocked), whatever the SotRmitigated size is, and 70% of that SotRmitigated size. Sacred Shield absorbs and SoI heals do add some variation on that, but those too are fixed values, so in the end you just get permutations of those basic building blocks.
As a result, if we take for example the second figure (raw histogram closeup), the haste set is a lot more likely to land in the 0.95 bin than in the 0.93. But shifting some of that haste to mastery pushes a chunk of those events into the 0.93 bin and empties out the 0.95 bin, creating an imbalance. That’s the major reason that we see differences in certain bins.
And I agree, abandoning the histogram middleman gets rid of a lot of those issues. You’d still have the edge effects if you only considered events over an arbitrary threshold, but those effects are relatively small as long as you include enough of the histogram (and of course, in the end we chose to just keep them all).
Yeah, that’s what I was trying to say. Your response is more complete though 😀 Glad you agree. I don’t know what else might cause that. I agree that the arbitrary threshold is dangerous in any case.
Offtopic: BTW, Mistweaver Monk main; our prot pally tank forwarded me here since I’m his “mathy friend.” Interesting posts, and you frankly seem pretty brilliant. I’m finishing an M.S. in Electrical Engineering this coming semester. Do you actually have a PhD as indicated by your twitter name? If so, in what? Just curious
Yes, my Ph.D. is in Optics – i.e. laser physics, telecommunications and quantum communications systems, photon counting, that sort of stuff.
Cool! I’m studying Control Theory, which is pretty general in the sense that it’s used it all parts of Science/Tech/Eng/Math, so I wouldn’t be surprised if you knew stuff about it. Thus, I’m kind of interested how you answered the question of “When should a Prot Paladin hit ShotR?” and I will probably check out your earlier blog posts about it soon I noticed your graphs (at least for C/Ha set?) have “finisher=SH1″ so I assume you played around with a couple rotations.
Usually in Controls, we examine the problem of “How can we control our state to our liking using the most efficient input?”, assigning a cost to the input of some kind. But here, the amount of HoPo you have is also a state, and the input is whether or not to spend it (discrete input, not continuous), which is quite a bit different. Also, the main state (last x attacks summed) probably can’t really be expressed as a clean diff eq here. I’m curious if there are many similarities, as a result… Problems like that have come up in my classes, but usually they’re a bit contrived, and the standard tools don’t usually work.
(See post below first)
Actually I guess it wouldn’t be terribly hard to express the problem with difference equations (discrete time control problem). Continuous time with integrating delta functions at various time points seems unnecessarily complicated.
Er, above*. How do I edit posts :/ Sorry! I guess it still thinks I’m a guest?
I think I’m the only one that can edit posts, actually.
One of my colleagues has a Ph.D. in Control Systems (focusing on robotics), so I have a passing, if limited, familiarity with the concept.
The rationale behind the shifting queues can be found in this post:
http://www.sacredduty.net/2013/03/19/controlshiftentier/
While I didn’t cast it in the form of a differential or difference equation, you’re absolutely correct that it could be. We’re essentially evaluating the state of the system represented by your holy power and recent damage timeline, with a moderately complicated decision function.
Pingback: Crowdsourcing TMI  Sacred Duty
Pingback: (Re)Building A Better Metric – Part I  Sacred Duty