## TC101: How Stats Are Calculated

Primary attribute calculations seem like they should be a pretty simple topic. So simple, in fact, that most players don’t even think about how they’re done. Most theorycrafters don’t either until they try to write a spreadsheet that models a character and notice that their math doesn’t work out.

I wonder how many times the following scenario has been re-enacted over the past few years:

Okay, my Paladin has 1455 base strength and 2378 strength from gear, for a total of 3833. With the 5% bonus for wearing plate armor, that should give me 3833\times 1.05=4024.7. The 5% stats raid buff should raise that to 3833\times 1.05\times 1.05=4225.9. My character sheet only gives an integer, so it should round that to 4226. Right?

4224 is not equal to 4226

Wait, what??

As it turns out, stat calculations are one of the more convoluted things in the game, and I suspect that (until now) few theorycrafters have modeled all the nuances with complete accuracy. Blizzard tosses a few floor() and round() functions into the mix at seemingly arbitrary places, which makes it tougher to reverse engineer. Over the past month or so, I’ve been collecting data from beta and working on determining exactly where and how the stat calculations are rounded.

This is a Theorycrafting 101 article because this process is a great example of the sort of thing I spoke about in part 1: starting with a basic model and adding complexity until all the details work. After we go over the formulas for calculating stats, we’ll go step-by-step through the process I used to test and determine the formulas.

Primary Attribute Formulas

Here’s how your character sheet attributes are calculated. First, we define some conditional values.

$\text{match} = \begin{cases}1.05 & \text{if armor matches class} \\ 1.00 & \text{otherwise} \end{cases}$

$\text{epicurean} = \begin{cases} 2 & \text{if pandaren} \\ 1 & \text{otherwise} \end{cases}$

$\text{alchemy} = \begin{cases} 2 & \text{if alchemist} \\ 1 & \text{otherwise} \end{cases}$

$\text{multiplier}$ is the total multiplier from buffs and other effects. So for example, if the only buff you have active is Blessing of Kings,

$\text{multiplier} = \begin{cases} 1.05 & \text{for STR/INT/AGI} \\ 1.00 & \text{otherwise} \end{cases}$

Similarly, Fortitude would be a $\text{multiplier}$ of 1.10, Guarded By The Light would give 1.25, and so on. Multiple effects are multiplicative, so a protection paladin with Fortitude active would have a stamina multiplier of $1.10\times 1.25=1.375$.

We then calculate the total base $B$ and gear $G$ contributions before multipliers:

$B = \text{race_base}+\text{class_base} + \text{heroic_presence} + \text{endurance}$

\begin{align} G = \text{gear_stat} &+ \text{round}[~\text{food_stat}~] \times \text{epicurean} \\ &+ \text{round}[~\text{flask_stat}~] \\ &+ \text{round}[~\text{potion_stat}~] \\ &+ \text{round}[~\text{trinket_proc_stat}~] \end{align}

And generate a “composite” value $C$ that incorporates the matching multiplier:

$C = \text{floor}[~G\times \text{match}~]+B\times \text{match}$

Finally, your character sheet mouseover tooltip reads:

Strength CS_Total ( CS_Base + CS_Bonus ),

with:

$\text{CS_Total} = \text{floor}[~C \times \text{multiplier}~ ]$
$\text{CS_Base} = \text{floor}[ ~B \times \text{match}~ ]$
$\text{CS_Bonus} = \text{CS_Total} – \text{CS_Base}$

Building The Model

Now that we’ve got the math out of the way, let’s see how we determined it. First of all, let’s go back to our original example. If we log on to the beta PvP server and create a new level 100 Paladin, this is what we get:

When I grow up, I want to run a camel farm.

As you can see, the character has 1455 base strength and 2378 “bonus” strength.  Since we haven’t chosen a spec yet, and we’re completely unbuffed, these values should properly reflect our total base strength and strength from gear, respectively. We can double-check this, because Celestalon gave us the full list of base stats for each class ($\text{class_base}$)as well as the racial base stat modifiers ($\text{race_base}$). A human’s $\text{race_base}=0$ and a paladin’s $\text{class_base}=1455$, so it’s quite clear that our character sheet’s giving us the correct base value. Likewise, you could go through and add up all the strength on each piece of gear to confirm that the sum is 2378. Together, that makes the 3833 total given on the character sheet.

In other words, so far we’ve got the following skeletal formulas:

$B = \text{class_base}+\text{race_base}$
$G = \text{gear_stat}$
$\text{CS_Base} = B$
$\text{CS_Bonus}=G$
$\text{CS_Total}=B+G$

Those aren’t final, of course – we’re going to be adding to them and correcting them as we go.

This is also useful because it means when we go to test other classes, we can look at the unbuffed values before we chose a spec to grab our $B$ and $\text{gear_stat}$ values.

Now let’s choose a spec. When we spec Retribution, we get the 5% increase to strength from the Armor Skills passive. Which gives us slightly different numbers:

We can respec her. We have the technology. We can make her …stronger…faster.

The only thing we’ve changed is to add a multiplier of 1.05 thanks to the armor specialization passive. Yet, as you might have guessed from our earlier example, our new value is not just 1.05 times our un-specced value of 3833. $3833\times 1.05 = 4024.7$, yet our character sheet reads 4023! Let’s see what’s going on here.

First, let’s consider the base value. We started with 1455, and $1455\times 1.05=1527.8$. The character sheet reads 1527, though, which tells us that the character sheet is taking the result of the calculation and applying a floor() function. In fact, this isn’t much of a surprise – it’s been known for a while that all of the values on the character sheet are floored rather than rounded. The game still uses the full-precision values when it does calculations though, so you don’t have to worry about stat points being “wasted” due to rounding/flooring.

Similarly, the 2378 strength from gear has become 2496 bonus strength. If we check the math, $2378\times 1.05 = 2496.9$. So the bonus strength is also being floored, not rounded. Our total strength is just the sum of the two floored values, which tells us that some of this flooring is happening before it’s displayed on the character sheet, otherwise we should have 4024 strength, as mentioned earlier. This is also useful information: since we know that the three character sheet values are linked through basic addition, we really only need to find correct formulas for two of them, and the third will fall into place automatically.

Now, none of this is really news. These details have been known for some time, since it’s very easy to stumble across. But this is how someone, at some point, had to go about determining it.

So now we can update our skeleton formulas slightly to incorporate the new details:

$B = \text{class_base}+\text{race_base}$
$G = \text{gear_stat}$
$\text{CS_Base} = \text{floor}[ ~B \times \text{match}~]$
$\text{CS_Bonus}=\text{floor}[~ G\times \text{match}~]$
$\text{CS_Total}=\text{CS_Base}+\text{CS_Bonus}$

The trick here is that there’s some ambiguity about those floor functions. For example, according to this, our $\text{CS_Total}$ could be expressed as:

$\text{CS_Total} = \text{floor}[~B\times \text{match}~] + \text{floor}[~G\times \text{match}~],$

but in reality, we’d get the same result from either of the following formulas as well:

$\text{CS_Total} = \text{floor}[~\text{floor}[~B\times \text{match}~] + G\times \text{match}~]$
$\text{CS_Total} = \text{floor}[~B\times \text{match} + \text{floor}[~G\times \text{match}~]~]$

So we don’t really know which is correct yet. The one thing we can rule out is having no floors.

Incorporating Multipliers

Now let’s apply Blessing of Kings to see how that interacts with these formulas:

It’s good to be the King.

Our first guess might have been that Kings would work the same way as the matching multiplier. In other words, our character sheet base should be $1455\times 1.05\times 1.05=1604.1$, or 1604. But it’s clear that’s not how it works, because our base value hasn’t changed – it’s still 1527. Likewise, if we treat the bonus strength contribution the same way, we’d get $2378\times 1.05\times 1.05 = 2621.7$, which is far too low. Something else is happening here.

If we naively take our total strength before Kings (4023) and multiply by the Kings modifier, we get $4023\times 1.05=4224.2$, which is exactly right after applying a floor function. Interesting! This actually gives us a hint as to how to proceed, but I’m going to blithely ignore it for instructional purposes.

That detail does make it pretty clear that base strength is affected by Kings, though. But the game’s accounting adds that extra strength to the green bonus strength value rather than the white base strength value in the tooltips.

For the moment, let’s go with our earlier (and unstated) assumption that the game calculates things the way we have, such that “base” and “bonus” strength are calculated independently and then summed to get the total. We’ll find out shortly if this is a good assumption or not. The bonus strength value is then, roughly speaking,

$G\times \text{match}\times \text{multiplier} + B\times \text{match}\times (\text{multiplier}-1)$

with the caveat that there may be some floors going on in there. We can turn that “may” into a “must” by checking the math:

$2378\times 1.05 \times 1.05 + 1455\times 1.05\times 0.05 = 2698.1$

which should give us a bonus strength value of 2698, not 2697. So we know that in order for this formulation to be correct, there has to be a floor happening somewhere before we get to the value that the game floors to show on the character sheet. We can also rule out any uses of round() here, because that would always give 2698.

This is the same ambiguity I spoke of earlier – we knew that the character sheet values were being floored, but we weren’t 100% sure where. Including the Kings multiplier here has clarified that there needs to be at least one extra floor() floating around in our hypothetical formula, but doesn’t tell us exactly where. So we have to try them all, and see if we can rule any of them out.

There are four logical scenarios that use only one floor (in addition to the final floor used to display the value on the character sheet). They are:

$F1=\text{floor}[~G\times \text{match}~]\times \text{multiplier} + B\times \text{match}\times (\text{multiplier}-1)$
$F2=\text{floor}[~G\times \text{match}\times \text{multiplier}~] + B\times \text{match}\times (\text{multiplier}-1)$
$F3=G\times \text{match}\times \text{multiplier} + \text{floor}[~B\times \text{match}~]\times (\text{multiplier}-1)$
$F4=G\times \text{match}\times \text{multiplier} + \text{floor}[~B\times \text{match}\times (\text{multiplier}-1)~]$

If we put our data into this, with $G=2378$, $B=1455$, $\text{match}=1.05$, and $\text{multiplier}=1.05$, they give us the following results:

$F1=2697.19$
$F2=2697.39$
$F3=2698.10$
$F4=2697.75$

This rules out $F3$, because the final floor() on the character sheet would leave that as 2698, which is wrong. But this test can’t distinguish between $F1$, $F2$, and $F4$. Any of those could still be correct. Unfortunately, that’s all the information we can extract from our premade Ret paladin.

We learn a little more if we swap specs to protection. In protection spec, we get a 5% increase to stamina from the Armor Skills passive along with the 25% increase to stamina from Guarded By The Light. Starting with $B=890$ base stamina and $G=3250$ from gear, and using $\text{multiplier}=1.25$, the formulas give us:

$F1=4498.63$
$F2=4498.63$
$F3=4499.13$
$F4=4498.63$

And looking at the character sheet….

One of these things is not like the others.

Uh oh. The only one that worked was $F3$, which we’ve already ruled out with our ret paladin. So this tells us that none of those formulas are correct. And increasing the number of floors doesn’t help, because it just makes the values even smaller, which won’t satisfy the protection data. For example,

$F5=\text{floor}[~G\times \text{match}\times \text{mult}~]+\text{floor}[~B\times \text{match}~]\times (\text{mult}-1) = 4498.50$

Still no dice. We can’t add round functions either, because then the ret data isn’t satisfied. As it turns out, we’re going about this the wrong way. Our original hypothesis – that the game calculate the base and bonus values individually and sums them to get the total – has to be false.

This isn’t really news, because theorycrafters have been using an alternative formulation for years. But it’s a good example of coming up with a hypothesis, testing it, and ultimately ruling it out, which is something that happens all the time in theorycrafting. Now that we’ve done so, let’s see if we have more luck with another hypothesis.

A Change Of Approach

The traditional approach goes something like this: rather than trying to calculate base and bonus strength individually, let’s try deriving correct formulas for the base and total values. Then the bonus value will be determined using basic subtraction. As we’ll see, this approach is much more successful.

Just as we did for bonus strength, we’ll construct formulas for total strength using $B$, $G$, $\text{match}$, and $\text{multiplier}$. We also know there has to be at least one floor in there somewhere based on our armor matching modifier test.

There are six obvious ways to do this using one or two floor functions:

$T1 = \text{floor}[~G\times \text{match}~]\times \text{multiplier}+B\times \text{match}\times \text{multiplier}$
$T2 = \text{floor}[~G\times \text{match}\times \text{multiplier}~]+B\times \text{match}\times \text{multiplier}$
$T3 = G\times \text{match}\times \text{multiplier}+\text{floor}[~B\times \text{match}~]\times \text{multiplier}$
$T4 = G\times \text{match}\times \text{multiplier}+\text{floor}[~B\times \text{match}\times \text{multiplier}~]$
$T5 = \text{floor}[~G\times \text{match}~]\times \text{multiplier}+\text{floor}[~B\times \text{match}~]\times \text{multiplier}$
$T6 = \text{floor}[~G\times \text{match}\times \text{multiplier}~]+\text{floor}[~B\times \text{match}\times \text{multiplier}~]$

We could also come up with two more methods using two floors, but they would require that we be flooring before the $\text{multiplier}$ in one term but not in the other, which seems unlikely. If none of these hold up, we’ll revisit that idea.

Plugging in our ret paladin stats of $G=2378$, $B=1455$, $\text{match}=1.05$, and $\text{multiplier}=1.05$, we get the following results:

$T1=4224.94$
$T2=4225.14$
$T3=4225.10$
$T4=4225.75$
$T5=4224.15$
$T6=4225.00$

So right out of the gate, we can cross off four of the formulas (2, 3, 4, and 6). Only $T1$ and $T5$ give a value consistent with the character sheet. And there’s something notable about those two formulas: we can factor out $\text{multiplier}$ in each of them:

$T1 = \left (~ \text{floor}[~G\times \text{match}~]+B\times \text{match}~ \right )\times \text{multiplier}$

$T5 = \left (~ \text{floor}[~G\times \text{match}~]+\text{floor}[~B\times \text{match}~] ~ \right ) \times \text{multiplier}$

that observation will come in handy later. For now, we need to figure out which one of these two formulas correct.

It’s worth noting that $T5$ is what’s frequently been used for calculating base stats in most theorycrafting works. This is what Simulationcraft has been using throughout MoP, for example. It’s pretty close, and generally gives you answers that are correct.

But as we noticed while working on the WoD code, on some rare occasions, it’s off by one. You can see why, as well: Let’s say that $B=919$. Then $B\times \text{multiplier}=964.95$. If we just multiply by $\text{multiplier}$ like we do in formula $T1$, we’d get a subtotal of 1013.20. But if we floor that 964.95 before multiplying, we get 1012.20. The two formulas would give answers that differ by one, with $T1$ giving a slightly higher value than $T5$.

So to test this discrepancy, we need to find a character with just the right amount of stat bonus. This gets easier the higher $\text{multiplier}$ is, so let’s try with our prot paladin, whose $\text{multiplier}=1.3750$ for stamina once we apply the 10% stamina buff.  Again using $B=890$, $G=3250$, and our $\text{match}=1.05$, we get:

$T1 = 5976.44$
$T2 = 5975.75$

Demonstrating the “off-by-one” error. And what does the character sheet say?

Conclusive Evidence!

We’ve already shown earlier that this value has to be floored rather than rounded, so this removes any ambiguity. $T1$ is the only formula that’s still standing. We now know conclusively how our total stats are calculated, at least so far.

Since we can factor the $\text{multiplier}$ out, it makes some sense to define a “composite” subtotal $C$ to make the math a little easier. In other words, we define it such that

$C = \text{floor}[~G\times \text{match}~]+B\times \text{match}$,

and then our character sheet base and total values are just

$\text{CS_Base} = \text{floor}[~ B\times \text{match}~]$
$\text{CS_Total} = \text{floor}[~C\times \text{multiplier}~]$

And of course, $\text{CS_Bonus} = \text{CS_Total}-\text{CS_Base}$. This gives you the bulk of the formulation provided in the beginning of this post.

The Nitty-Gritty Details

We’re not finished yet, though. Because even after this, we were able to observe some odd off-by-one errors due to certain special effects. For example, how are flat stat-bonus buffs like potions, trinket procs, flasks, and food incorporated into this? What about racial effects like Endurance, Heroic Presence, and Epicurean? So I set out to do some more testing.

Endurance was the easiest to test. Consulting our handy tables, all of the classes have 890 base stamina, and taurens get a racial modifier of +1. So before we choose a spec, our tauren paladin test subject (lovingly named Testbeef) should have 891 stamina. Instead…

There’s some extra beef here…

Endurance gives 197.055 stamina at level 100. The tooltip is, as usual, floored, but we can get an exact value directly from the spell data using Simulationcraft’s spell_query tool:

SimC to the rescue.

It’s clear from this that Endurance is being counted as base stamina by the game (890+1+197 = 1088). In other words, we can update our formula for $B$:

$B = \text{race_base}+\text{class_base}+\text{endurance}$

And it’s fairly simple to confirm that all of the other formulas work out to accurate values as given.

Next up is Heroic Presence, so we create a draenei paladin. It isn’t hard to determine how this works:

Heroic base strength.

The tooltip for Heroic Presence reads 65, but spell_query reveals that the actual value is 65.25. Draenei get a 1-point racial strength modifier, so it’s pretty clear that our base strength is just 1455+1+65=1521, meaning that Heroic Presence is also added into $B$:

$B = \text{race_base}+\text{class_base}+\text{endurance}+\text{heroic_presence}$

There’s one caveat here, which is that we can’t know for sure from this data whether Endurance or Heroic Presence are being floored before they’re added to $B$. Both of them have low enough decimal values that it would be unlikely to matter and difficult to test. Heroic Presence’s value was 130.5 several beta builds ago, and I was able to confirm that it isn’t being rounded (though that was unlikely anyway). But I couldn’t rule out a floor(). To do so now, we’d need modifiers such that $0.25\times \text{match}\times \text{multiplier} > 1$ to test Heroic Presence (or $0.055\times \text{match}\times \text{multiplier}>1$ for Endurance), which we don’t have. So it probably doesn’t matter very much, but it’s worth noting in case we find a situation where we can explicitly test that.

Epicurean was the most interesting test, because first we had to figure out how stat buffs worked. For example, is the amount of stat given by food floored or rounded? It turns out that by being clever, we can test both at once.

First, we searched through to find some foods that would be useful to us. The two we ended up using were Serpent Brew of Serenity and Hearty Elekk Steak. Serpent Brew has an intellect bonus of 24 according to the buff it grants. However, if we check the spell data…

Sneaky Sneaky

… it apparently gives a buff of 23.816 intellect. Which is a strong case for flat stat bonus buffs being rounded, not floored, and making them the exception to just about everything else. We can double-check this by rolling a level 100 monk, using Zen Pilgrimage to visit Master Chang at the Peak of Serenity, and testing this:

I took off all of poor Foodtest’s gear and didn’t give her a spec to make this easier to see. I’m sort of a jerk like that.

This tells us that the buff is definitely rounded, not floored. Because there are only two ways to get 48 intellect from a 23.816-intellect buff on a pandaren, and neither of them involve a floor:

$F1 = \text{round}[~\text{food_stat}\times 2~]$
$F2 = \text{round}[~\text{food_stat}~]\times 2$

To figure out which one we have, we employ the new Hearty Elekk Steak, which is conveniently available from Savage Flaskataur, Esq. in Stormwind or Orgrimmar. This food is almost magical, in that the spell data says it grants a 187.49999 stamina buff. Even spell_query rounds this to 187.5, which led to a little confusion until we sussed out the issue. But what it means for us is that if $F1$ is correct, we’ll see a buff that’s 375 stamina on our pandaren, but if formula $F2$ is correct it will only be 374. And what is it?

The steaks are low.

So apparently, food buffs are rounded, and Epicurean is applied after the rounding occurs. It’s also fairly easy to test that these are affected by $\text{match}$ and $\text{multiplier}$ just like gear contributions are, so we can fold this directly into our definition of $G$:

$G = \text{gear_stat} + \text{round}[~\text{food_stat}~] \times \text{epicurean}$

Testing the others follows a similar process. Spell data tells us that a Flask of the Earth adds 170.926 stamina, but in-game both the tooltip and character sheet show that it grants 171 stamina. So flasks are clearly rounded as well. The Alchemy profession buff has already been turned off, so we can safely ignore that. A Potion of Mogu Power grants 455.793 strength according to spell data, but shows up as 456 in-game, suggesting that potions are also rounded. Testing with a few different trinket procs also showed that they were all rounded, not floored.

The one thing we haven’t shown yet is whether these effects are rounded individually or after-the-fact. The Epicurean data strongly suggests each is rounded independently, but a more rigorous test would be nice. To do so, we need a pair of buffs that give a different value when rounded separately than together – e.g. both ending in 0.5 or greater, or both ending between 0.25 and 0.49.

Digging through the spell data, we come up with Beer Basted Crocolisk, which grants 35.724 strength and stamina, and Flask of Steelskin, which grants 178.62 stamina. If we use both items, we should get 215 stamina if each is rounded individually, but only 214 stamina if they’re rounded after being added together. A few unlucky crocolisks later, we have our confirmation:

Only about four crocolisks were harmed during the filming of this test.

This is the last piece of the puzzle, and finishes confirming the equations given in the beginning of this post.

Closing Thoughts

You can see that we’ve done a pretty exhaustive job of testing edge cases to make sure our formula for base stats works correctly in all circumstances. If we were content to be accurate to within one point of stat, we could have stopped very early on. But it was important to me that the stats Simulationcraft spits out match your character sheet all the time. Ultimately that lack of accuracy reflects poorly on the sim, even if “off-by-one” errors have no significant effect on the overall simulation results.

And as you’ve now seen, trying to cover all of those edge cases often takes some careful thought about what those cases are and how you can test them. Often it means ruling out all but one hypothesis, and sometimes you need very specific items or gear combinations to distinguish between different hypotheses. Most combinations of food+flask wouldn’t reveal the difference between the two rounding schemes proposed, we had to seek out a very specific pair based on the numerical constraints of the problem.

But that’s the sort of work theorycrafters do – wading through the minutiae to build models that are as accurate as possible. And sometimes, even an “easy” task like calculating player attributes ends up having a ton of details that you never thought about before.

Posted in Theck's Pounding Headaches, Theorycrafting | | 15 Comments

## TC101: Testing Simulationcraft

In the last two installments, we talked about what it means to theorycraft and spent some time discussing experimental design. Today, we’re going to talk about how Simulationcraft fits into that picture.

Simulationcraft is a numerical model of the game and its mechanics. It’s a fairly powerful theorycrafting tool, much like a good spreadsheet, but significantly more flexible. The downside of that flexibility is that the learning curve is a little steeper than using a spreadsheet. And unfortunately, a lot of players don’t really understand how to use that tool properly, leading them to mistakenly conclude that the tool isn’t very good.

As a beginner theorycrafter, there are two primary ways Simulationcraft may fit into your work. The first is as a contributor helping to improve Simc’s modeling. You may find yourself performing in-game tests to determine mechanics, and then comparing those tests to similar experiments in Simulationcraft to verify that SimC has the mechanics coded properly. Note that this doesn’t require any knowledge of C++, just enough familiarity with the program to tweak a character profile or action priority list.

The second way it may fit into your work is the obvious (and more common) complement: taking advantage of that model by using it to discover new techniques and determine optimal play patterns. This could be tweaking action priority lists to find the best rotation for a given circumstance, testing different gear sets to find a “best in slot” arrangement, or estimating the true value of a glyph, talent, or set bonus. In other words, using the model to answer the sorts of specific questions that come up in the course of optimizing your character.

In this blog post, we’ll address comparing in-game experiments to Simulationcraft outputs. A tool is only useful if you can trust that it produces accurate results, and while that’s a good assumption for actively-maintained class modules, it may not be good for ones that have sat dormant for some time. In a future blog post, I’ll talk more about using Simulationcraft for discovery and optimization.

When we want to validate Simulationcraft results, what we’re really doing is designing and performing a pair of experiments. One is our in-game experiment, which tells us what we use as a measuring stick for our SimC output. If the SimC output deviates significantly from what we observe in-game, then something is pretty clearly wrong.

But we’re also designing a second experiment, which is the simulation itself. Just as we do with the in-game experiment, we have control over the gear, talents, glyphs, and other character properties for the simulation. We also have control over the experimental procedure by way of the action priority list. Simulationcraft takes care of the data collection for us, so we only need to worry about analysis.

If you’ve never used Simulationcraft, there’s a pretty good (if slightly out-of-date) Starter’s Guide on the wiki. As an aside, this is another thing that we could really use help with and doesn’t require coding knowledge: people dedicated to keeping that wiki up-to-date for the benefit of new users.

Dissection of a Simulationcraft Profile

As an example, let’s consider a particular character profile. The following profile or “simc file” is a mock-up of a tier 17 normal-mode profile from simulationcraft’s Warlords of Draenor development branch. It uses the gear that a level 100 pre-made paladin has on beta, so it’s easy to make exactly this character for testing purposes.

paladin="Paladin_Protection_T17N"
level=100
race=blood_elf
role=tank
position=front
professions=Blacksmithing=600/Enchanting=600
talents=http://us.battle.net/wow/en/tool/talent-calculator#bZ!201121.
glyphs=focused_shield/alabaster_shield/divine_protection
spec=protection

# This default action priority list is automatically created based on your character.
# It is a attempt to provide you with a action list that is both simple and practicable,
# while resulting in a meaningful and good simulation. It may not result in the absolutely highest possible dps.
# Feel free to edit, adapt and improve it to your own needs.
# SimulationCraft is always looking for updates and improvements to the default action lists.

# Executed before combat begins. Accepts non-harmful actions only.

actions.precombat+=/food,type=chun_tian_spring_rolls
actions.precombat+=/seal_of_insight
actions.precombat+=/sacred_shield,if=talent.sacred_shield.enabled
# Snapshot raid buffed stats before combat begins and pre-potting is done.
actions.precombat+=/snapshot_stats
actions.precombat+=/mogu_power_potion

# Executed every time the actor is available.

actions=/auto_attack
actions+=/arcane_torrent
actions+=/holy_avenger,if=talent.holy_avenger.enabled
actions+=/divine_protection
actions+=/guardian_of_ancient_kings
actions+=/eternal_flame,if=talent.eternal_flame.enabled&(buff.eternal_flame.remains<2&buff.bastion_of_glory.react>2&(holy_power>=3|buff.divine_purpose.react|buff.bastion_of_power.react))
actions+=/eternal_flame,if=talent.eternal_flame.enabled&(buff.bastion_of_power.react&buff.bastion_of_glory.react>=5)
actions+=/shield_of_the_righteous,if=holy_power>=5|buff.divine_purpose.react|incoming_damage_1500ms>=health.max*0.3
actions+=/judgment
actions+=/avengers_shield
actions+=/sacred_shield,if=talent.sacred_shield.enabled&target.dot.sacred_shield.remains<5
actions+=/holy_wrath
actions+=/execution_sentence,if=talent.execution_sentence.enabled
actions+=/lights_hammer,if=talent.lights_hammer.enabled
actions+=/hammer_of_wrath
actions+=/consecration,if=target.debuff.flying.down&!ticking
actions+=/holy_prism,if=talent.holy_prism.enabled
actions+=/sacred_shield,if=talent.sacred_shield.enabled

# Gear Summary
# gear_strength=2407
# gear_stamina=3248
# gear_crit_rating=1242
# gear_haste_rating=370
# gear_mastery_rating=591
# gear_armor=4366
# gear_parry_rating=9
# gear_multistrike_rating=701
# gear_versatility_rating=224

If you’re new to Simulationcraft, it’s worth spending a few minutes discussing how profiles work. I’ll give a brief overview, but there is much more thorough documentation available on the Simulationcraft Wiki. A simc file is just a text file with the “.simc” extension – you can open it in your favorite text editor (I generally use Notepad++ on Windows, but the built-in Notepad application works just fine). Each line in the file tells the simulation one piece of information it needs to operate. For example,

paladin="Paladin_Protection_T17N"

tells the sim that we’re defining a new paladin (called an “actor” in SimC lingo) and we want to name him “Paladin_Protection_T17N.” If we wanted to, we could change that line to paladin="Bob" and the sim would work exactly the same, but our paladin would suddenly be named Bob. Likewise, subsequent lines tell the sim that Bob is a level 100 blood elf tank with Blacksmithing and Enchanting professions. It continues to specifiy talents, glyphs, and spec.

The lines that start with a pound sign (#) are called comments. These are lines that are for informational purposes only, to help explain what’s going on. The simulation skips over them entirely when interpreting the file. This also means that if we want to disable something in the profile, we can put a “#” before that line to make it invisible to the sim.

The next thing the profile specifies is the action priority list, or APL for short.  This is where we specify our experimental procedure, by defining what (and under what conditions) the player will cast. The first section of lines which start with “actions.precombat” define the things we’ll be doing before combat starts, like applying flasks and food, choosing a seal, and pre-potting. This section is only run once, at the beginning of the simulation.

The next section starting with “actions=/auto_attack” is the APL the sim uses during combat (also known as the “default” APL). You might note that the first line starts with “actions=” and the second with “actions+=”; this is an under-the-hood quirk related to C++ and the simulation internals, but it’s worth mentioning briefly. The line “actions=/auto_attack” defines a new text variable (known as a ‘string’ in computer science terminology) that contains “/auto_attack” and nothing else. In C++, “+=” is an operator that means “take the existing value of this variable and add whatever comes after to it.” So for example, in the pair of lines

x=2;
x+=3;

the first line assigns the value 2 to the variable $x$, and the second adds 3 to the value of $x$. After executing both lines, $x$ would contain the value 5.

When using += with strings, it just concatenates the two strings. So the two lines

actions=/auto_attack
actions+=/arcane_torrent

would leave an actions variable that contained /auto_attack/arcane_torrent. This is how SimC handles action priority lists – they’re just long strings of action names and conditions separated by slashes. The practical implication of this is that the very first action on the list has to be defined as actions=/action_name, otherwise the sim won’t know how to parse the input.

The final section of the profile defines the character’s gear, one slot at a time. You’ll note that for most of these, we just specify the slot (e.g. “head”) and set it equal to an item descriptor containing the name and item id. A normal profile would also include enchants or gems, but I’ve removed most of these since the pre-made gear doesn’t come with enchants or gems. We don’t need to tell it all of the item stats, as it will reconstruct those stats from the game data based on the item id.

Note that the name of the item isn’t important. We could call each of these items whatever we wanted. The sim will spit out a warning on the report if the names don’t match, but it will dutifully perform the simulation anyway assuming we know what we’re doing. I still recommend writing the item names in however, because the warning is quite useful when you accidentally make a typo in an item id (and thus aren’t using the item you thought you were!).

We can also override the stats on an item, or create an entirely fake item with whatever stats we want on it. One thing I’ll frequently do is abuse the “shirt” slot to tweak a character’s stat. If I want to give the character 10k more mastery and 5k haste rating, I might add a line like

shirt=thecks_shirt_of_haxx,stats=10000mastery_5000haste

to arbitrarily tweak the character’s stats.

Note that the “# Gear Summary” section below is completely irrelevant and unnecessary. Every line starts with a “#” so the simulation completely ignores it. This section is automatically generated, either by the script that puts together this profile or by the code that imports characters from the armory. You’re free to delete it if you don’t want it cluttering up the end of the character profile.

If it looks like a daunting task to put together all of that from scratch, you’re in luck. You can import your character from the armory and Simulationcraft will automatically generate your profile, along with a default action priority list. You can then go hacking away at it from there to make it fit your experiment, as we’ll do shortly. The Starter’s Guide explains how to do that.

However, if you’re on the PTR or Beta, you obviously can’t import from the armory. To help with that, I’ve written an addon that will generate a profile for your character in-game, which can then be copy/pasted into Simulationcraft. The addon is named, as you might guess, Simulationcraft. This is also useful if you want to test a bunch of configurations without having to log in and out repeatedly to update the armory; just change gear, type /simc, and copy/paste the new profile.

Back to Experiments

Now that we know what a SimC character profile looks like, let’s return to the topic at hand. Our profile is essentially the definition of our Simulationcraft “experiment.” We want to compare the results, so we want the simulation input to model the in-game experiment as much as possible, so it’s natural to expect that our constraints on the in-game experiment carry over to the simulation input. Thus, all of our earlier discussion about experimental design is equally applicable to designing the simulation input.

For example, we want to try and minimize or eliminate dynamic effects that could compromise our results. We probably don’t want our strength to change during the test, so we wouldn’t be using potions. As such, our profile shouldn’t include pre-potting. We may decide to comment out that line of the profile, as well as any line in the combat APL which used a potion (if there was one). We could also just delete those lines if we’re sure we’ll never use them again – for example, if we’ve saved this as a separate copy somewhere and will only use it for this specific experiment.

Since Primal Gladiator’s Insignia of Victory has a strength proc, we probably don’t want to use it during our testing. So we’d comment that line out in the profile and remove it from our character during the in-game test, just to make sure it didn’t taint our results. The Dancing Steel enchant on the weapon similarly has to go (the premade doesn’t actually have enchanted weapons – I just added this to the profile to illustrate the point). Recall that we talked about making other gear changes in the previous blog post due to versatility on gear. Any other gear changes we make in-game should also be reflected in the profile we feed to SimC.

Likewise, we’re probably not going to bother using flasks or food in our in-game experiment just for convenience. Again, we should comment or remove those lines if that’s the case (and remember: if you remove or comment the first line of the list, you’ll need to change the new first line from actions.precombat+=/ to actions.precombat=/). However, note that there are cauldrons in Shattrath (Outland) on beta that give you full raid buffs and critical strike flask and food buffs. If you plan on using the cauldron, you’d want to modify these lines to reflect that. For reference, they would look something like this:

actions.precombat=flask,type=greater_draenic_critical_strike_flask
actions.precombat+=/food,type=blackrock_barbecue

edit: It looks like paladins are bugged here and getting critical strike flask/food buffs regardless of spec. Other classes are getting a flask and food buff matching their spec’s secondary stat attunement. Thanks to Megan (@_poneria) for catching this.

Which brings us to another issue: raid buffs. On beta, the cauldrons let you apply the full suite of raid buffs. But you may not always have access to that – maybe you’re testing something on live servers, or just testing in an area that doesn’t have these cauldrons handy, or turning some of them off to specifically test the way one of those buffs interacts with something.

Simulationcraft is designed assuming you’re in a raid and you want all of those raid buffs, including Bloodlust/Heroism. If we want to disable them, we need to tell the simulation that. If you’re using the graphical user interface (GUI), you can toggle each buff on the Options -> Buffs/Debuffs pane. If you want to do it in the simc file, it only takes a single line of code:

optimal_raid=0

That line, usually placed between the character details (level/race/etc.) and the action lists, turns off all of the externally-provided raid buffs, including Bloodlust. You’ll still be able to use any that your class brings as long as you have it in the APL. For example, if we added blessing_of_kings to the precombat action list we’d get the benefit of the 5% stats buff, even if we set optimal_raid=0. Likewise, if we want to enable specific buffs, we can do so using overrides in the code or the checkboxes in the GUI.

By now, it should be clear that we’re going to have to go over the character profile with a fine-toothed comb to make sure it lines up as much as possible with our in-game test. Let’s say that for our in-game test, we’ve decided to attack a boss-level dummy with our level-100 pre-made character. We’ll only use auto-attacks, Crusader Strikes, and Judgments, while in protection spec and without any raid- or self-buffs. We won’t use any glyphs or talents that affect the damage of either spell, and we’ll un-equip our second trinket (which has a strength proc that we don’t want polluting our data).

Looking through the profile, there’s a lot of extra fluff in here that we don’t need. We’re not going to be using Holy Avenger during this test, because it changes the amount of damage Judgment does. Since we’re just testing the damage of a few abilities, we can remove everything not related to those abilities from the action priority list. We’ll also get rid of all of the precombat actions other than applying Seal of Insight, and turn off all external raid buffs with the optimal_raid flag.

There’s one more thing we need to change, though it isn’t obvious or intuitive. By default, Simulationcraft uses the average damage of an ability rather than making an actual damage roll. It does this mostly to save some time, because it executes a little faster. And in a normal simulation, where you’re making lots and lots of damage rolls and running for a few thousand or more iterations, using the average value instead of making individual damage rolls doesn’t have a significant effect on the statistics of the results.

However, for this particular experiment we care a lot about it, because we’re going to want to compare the minimum and maximum damage values of our in-game tests to the values the simulation predicts. So we have to add the line average_range=0 to the profile somewhere.

After doing all of that, Bob’s character profile looks like this:

paladin="Bob"
level=100
race=blood_elf
role=tank
position=front
professions=Blacksmithing=600/Enchanting=600
talents=http://us.battle.net/wow/en/tool/talent-calculator#bZ!201121.
glyphs=focused_shield/alabaster_shield/divine_protection
spec=protection

optimal_raid=0
average_range=0
iterations=50000

# Executed before combat begins. Accepts non-harmful actions only.

actions.precombat=/seal_of_insight
# Snapshot raid buffed stats before combat begins and pre-potting is done.
actions.precombat+=/snapshot_stats

# Executed every time the actor is available.

actions=/auto_attack
actions+=/judgment

off_hand=primal_gladiators_shield_wall,id=111221

Considerably shorter! Note that while I deleted many lines, I simply commented out the second trinket slot, in case I decided I wanted to test with that trinket later.

I’ve also added iterations=50000 to specify how many iterations I want to run (the default value is 1000). In practice, we may as well set our number of iterations high to improve our statistical knowledge of what the simulation is producing, even though we clearly don’t plan on logging several days worth of in-game testing. The more iterations we use, the more likely it is that we hit our extreme minimum and maximum values for each ability.

Now that we’ve got both experiments (in-game and simulation) nailed down, let’s perform both of them and analyze the results.

Collecting Data

The Simulationcraft output generated by this character profile is here. While your usual method of reading a SimC report probably involves spending some time looking at the sections that summarize the overall stats like DPS, HPS, and so on, we’re not that interested in those. We’re going to skip right down to the “Abilities” section, which looks like this:

The Simulationcraft report’s Abilities section. A veritable goldmine of information.

This section gives you a great breakdown of statistics for each ability. It tells you stuff like how much DPS or HPS that ability does, how many times its cast per iteration (“Execute”) and the average time between casts (“Interval”), the average hit and crit sizes as well as the average damage per cast (“Avg”), and so on. Most people have at least seen this section before, though you may not have seen the new pretty version (with icons!) that we’ve implemented for WoD.

What many people don’t know, but is crucial to you as a theorycrafter, is that we can get even more information. If you click on the ability’s name, it will expand that section to give you a lot more detail:

Expanding the ability entry gives loads of additional information.

This is a full stats breakdown for that ability. Of most relevance to us is the table that shows the statistics for each possible result of the action. By looking at the row labeled “hit” in the “Direct Results” column, we can see exactly how many of our casts were hits (79.47%) and their minimum, maximum, and mean values for the simulation overall (2675 to 3058 damage).  There’s also plenty of other information here that you might find useful, including a bunch of details about the spell data near the bottom of the expanded section.

If there’s interest, I may write another blog post in the future discussing what all of this stuff is, but for now let’s settle for being able to get our minimum and maximum values from the table. If we expand the sections for Judgment and melee, we find that Judgment’s hit damage ranges from 5523 to 5524, and our melee attacks hit for between 2384 and 2704.

Now let’s look at the results of the in-game test. I smacked around a raid boss target dummy for about five minutes to collect the following data set.  If you go to the “Damage Done” tab and mouse over the bars, you’ll see the breakdown by result type:

The Warcraft Logs ability damage breakdown tooltip.

Here we see that our minimum and maximum melee attacks hit for 2396 and 2702, respectively. We can extract similar limits for Judgment (5524-5525) and Crusader Strike (2685-3052). Now that we have the data we want, let’s analyze it.

Analyzing Data

We can summarize all of our relevant data in a quick table:

Damage Results, Hits Only
Ability Min(SimC) Max(SimC) Min(Game) Max(Game)
CS 2675 3058 2685 3052
J 5523 5524 5524 5525
Melee 2384 2704 2396 2702

The first thing to note here is that for CS and Melee, SimC gives lower minimum bounds and higher maximum bounds. That’s to be expected, because we ran the simulation for a long time, but our in-game test was pathetically short (about 5 minutes). With only 50-100 casts, we just haven’t taken enough in-game data to reasonably expect to hit the boundaries. But it’s good enough to illustrate the basic process.

We’d be a bit surprised if our in-game maximum was higher than our simulation maximum, or likewise if the in-game minimum was lower than the simulation minimum. While this could happen, statistically speaking it’s very unlikely for a long sim. That would be a strong indicator that our formula (in SimC) is off somehow, and we’d need to design an in-game experiment to test that. For example, we might have to collect data from a few hundred CS casts at several different AP values so that we can determine the proper AP coefficient.

You may have noticed that the Judgment data doesn’t quite agree. Judgment is easy because (again, at least for the moment) it doesn’t have a damage range. If the damage formula the game uses spits out 5524.3, it’ll generate damage values of 5524 and 5525. The game does a floor(result+random(0,1)) to determine how often it uses each, so we can also use the frequency of each result as a debugging tool. Our simulation contains a systematic error in that it’s always off by exactly 1 damage. This could be due to an errant AP or SP coefficient (though Simulationcraft is actually extracting those directly from Blizzard’s spell data) or an errant base damage value (Judgment’s spell data still indicates it has a base damage of 1), or something else entirely.

One way to check is to do a hand-calculation. The spell data claims that the SP coefficient is 0.5021 and the AP coefficient is 0.6030, and that it does a base damage of 1. You can get all of this information from the game files using Simulationcraft’s spell_query function, shown below (command-line only):

Simulationcraft’s spell_query command and output for Judgment. The base damage and SP/AP coefficients are in Effect #1.

What we call the base value is actually really the “Scaled Value” in the spell data. The default way WoW calculates ability damage is to add the spell power and attack power contributions to the base damage and then apply multipliers, or

$${\rm damage} = ({\rm base\_damage} + {\rm SP\_coeff}*{\rm SP} + {\rm AP\_coeff}*{\rm AP}) * {\rm multipliers}.$$

Judgment is a rare spell that has both an SP coefficient and an AP coefficient – most spells only have one or the other. As for multipliers, we know that the Improved Judgment Draenor perk should boost the damage by 20%. Our versatility will also increase it by 1.72% based on the in-game tooltip (or by hand, 224 rating gives 224/130=1.7231% extra damage, or a multiplier of 1.017231). So if we want to calculate Judgment’s damage by hand, we could multiply all of that together appropriately:

$${\rm damage} = (1 + 0.5021*4095 + 0.6030*4095)*1.2*1.017231 = 5525.25$$

That’s curious. This formula suggests we should be seeing 5525-5526 damage, which is higher than either of our experimental observations. We’re pretty confident in the AP and SP coefficients though, as well as the multipliers that get tacked on. So something else must be going on. By the way, I didn’t just fabricate this error for the blog post – I actually ran into this while writing it up, and ended up spending about 30 minutes figuring out the answer. So you’re witnessing real theorycrafting happening (albeit with a slight time lag, of course).

At this point, we’d probably start trying things. I went into MATLAB and tried variations on that formula, particularly tweaking the way base damage is included since I suspected that to be the source of the error. It turns out that wasn’t the case, because no sane variation matched the damage range and the frequency of each result. Out of fifty casts, we have one 5524 result and forty-nine 5525 results, suggesting that we need to be getting something in the 5524.9ish region from our hand-calculation.

Eventually I fired up Visual Studio and started debugging, which led me to notice that it was using 4094 AP during the damage calculations, even though it was reporting 4095 AP in the output. That accounts for the discrepancy between the SimC results and the in-game results, which is great, but it doesn’t explain why the hand-calculation doesn’t match.

However, it gave me a hint as to what was wrong. The character has 3616 strength, and thus starts with 3616 attack power before we apply the multiplier from our mastery. The 13.24% mastery we have increases attack power by that amount, so our net result should be

$${\rm Attack Power} = 3616*(1+0.1324) = 4094.7584$$

The character sheet is clearly rounding this up to 4095. Simulationcraft was applying a floor() function to turn it into 4094, at least for damage calculations. But neither of those give the observed damage range, as we’ve seen. The solution seems obvious here – what if attack power isn’t an integer? Let’s try that calculation one more time using the full decimal value of 4094.7584:

$${\rm damage} = (1 + 0.5021*4094.7584+ 0.6030*4094.7584)*1.2*1.017231 = 5524.9284$$

Aha! That perfectly fits the range we observed in-game. Most of the time, we’ll get 5525, but once in a rare while we’ll get 5524. In the experiment, that’s exactly what we observed. So not only have we validated Judgment’s damage formula, we’ve also discovered that our attack power and spell power values aren’t integers, they’re floating point values!

Why is that important to you as a theorycrafter? Well, if you use the integer values your character sheet gives you, it means you’re reducing the precision of your estimates by rounding them to the ones digit. As a result, you wouldn’t trust any results you get to be accurate to any more than about $\pm 1$ damage. In all likelihood, your results might be off by one, just like our original hand-calculation was. In practice, there are ways to quantify this (for example, on a crit the error might increase to $\pm 2$ or $\pm 3$). But as a rough rule of thumb it’s good enough to know that you might be off by one or two in the digit you’re rounding.

More Complicated Testing

Of course, this was just a simple test of ability damage. You can do quite a lot more with Simulationcraft, it all comes down to tweaking the character profile to fit whatever situation you’re trying to test. Sometimes that might not even require an in-game test for comparison. For example, you might decide to enable the fixed_time flag and count the number of ability uses to see if haste is being taken advantage of properly in the simulation – something you could compare to a simple hand calculation. You could perform similar tests to validate the uptimes of certain buffs or effects.

On the other hand, sometimes you need a more complicated profile to test something like an interaction between two different abilities or effects. Often, that involves using conditionals on the action list. To illustrate that, let’s say we had a set bonus that gave us a chance on melee attack to proc a buff called “Super Judgment” which increased Judgment’s damage by 10%. We might want to know whether that bonus is multiplicative or additive with the Improved Judgment perk.

In case it’s not clear what that means, let’s say Judgment does $X$ damage before either effect. If the two effects are additive, then the total damage including both effects would be

$$T = X * (1 + 0.2 + 0.1) = 1.3*X.$$

If the two are multiplicative, then the total damage would be

$$T = X*(1+0.2)*(1+0.1) = 1.2*1.1*X = 1.32*X.$$

Since Judgment appears to do fixed damage (at least, right now…) this would be pretty easy to test. If it suddenly got a damage range, then we’d need to take a bunch of data and determine which version is correct based on the minimum and maximum damage values that we observe, just like we did above for Crusader Strike and melee attacks.

If we want to find out whether Simulationcraft has this correct, we could just ask a developer. But it might be just as fast to run a test ourselves. With the APL,

actions=/auto_attack
actions+=/judgment,if=buff.super_judgment.react

we would limit ourselves to using Judgment only when the buff was active. The react in that statement just tells the sim to consider the player’s reaction time – in other words, the buff.super_judgment.react conditional evaluates to true if the buff has more than a few hundred milliseconds remaining.

Running the simulation for 50k or 100k iterations (which is relatively fast as long as you’re not doing anything fancy, like calculating stat weights) would give us pretty good maximum and minimum damage bounds that we could check against our in-game data.

Another neat trick that most players aren’t aware of is the “Sample Sequence” part of the report. It’s buried in the “Action Priority List” section, shown below:

A Simulationcraft report’s Action Priority List section.

This section tells you about the action priority list you’re using, but at the bottom you get a sample cast sequence for the player. This can get really ugly if your APL has lots of different spells, especially if some are off-GCD like Shield of the Righteous or Word of Glory. Nonetheless it’s a tool you can use to try and debug rotations. For our simple APL, it’s quite useful. We might expect a nice sequence of CS-J-E-CS-E-J-CS-E-E, where the E’s are all empty GCDs. In other words, the sequence of casts would be CS-J-CS-J-CS, or 34343 if we replace each abbreviation with the number on the action priority list. Since that sequence repeats, our sample sequence in the report should be an unending string that looks like 34343343433434334343.

If we look at what the sim produces, we get a single 2 in the front to indicate we’re starting our auto-attacks (in SimC we only cast this once at the beginning to turn them on on). But after that, we get the sequence 3434343-34343-34343-3434343; not quite what we were expecting. This is something we might want to investigate, because it tells us that sometimes the simulation is casting Judgment instead of Crusader Strike when they are both available, in theory.

I also want to draw your attention to two other sections of the report that are useful to theorycrafters. The “Statistics & Data Analysis” section, shown below, gives you a thorough statistical breakdown of major encounter metrics like DPS, DTPS, TMI, and so on.

Bob’s Statistics & Data Analysis section.

Note that you can change the confidence intervals used by modifying the confidence option, as documented in the wiki. This section can be very useful if you want quantitative information about the distribution of the data across all iterations.

Finally, you may already know about the “Stats” section, which documents your character’s stats:

Bob’s character stats.

This section can be immensely useful when trying to sync up in-game results to simulations. Comparing these stats to your character sheet values is a good way to identify discrepancies between the profile you’re simulating and the character you’re using to perform your in-game testing. In fact, I’ve spent a fair amount of time comparing this table to the stats given on the character sheet in beta to make sure we’re doing all of those calculations properly. The process of doing that led to some interesting discoveries about primary stats (hint: they’re not integers either – more on that in a future blog post!).

How Not To Succeed In Theorycrafting

Obviously, as your test gets more complicated, so does your APL. Eventually, it may include entire rotations. Which brings us to one of the biggest mistakes that we see beginners make. They fire up Simulationcraft, import their character, hit simulate, and then immediately compare their results to their most recent week’s raid logs.

If you’ve been keeping up with this series of posts, you almost certainly recognize the error that was just made. Unfortunately, a lot of players don’t. And when the two don’t match very well, they decide that Simulationcraft must be in error and conclude that the tool is useless. I’d guess that the vast majority of people that tell me that Simulationcraft’s modeling isn’t very good are actually just using it wrong. Or in tech support speak, PEBKAC.

However, having recently graduated from the Theck School Of Designing Good Experiments, you know that to have any hope of comparing an in-game result to a simulation, the two need to be as similar as possible. And a real raid environment is very different than a simulation in which you smack Patchwerk around. There is no encounter in Siege of Orgrimmar that is well approximated by a simple Patchwerk-style encounter in Simulationcraft – they all have some component that makes the comparison a little suspect.

That certainly doesn’t mean the results are useless. We often glean insight from how a class performs in a Patchwerk encounter, and generalize that to apply it to real encounters. In some ways, a real encounter is a series of little Patchwerk sections interwoven with periods of movement, cleaving, and other mechanics. But it does mean that you generally won’t get the same DPS values when you compare a raid log to a Patchwerk simulation. Also note that you can do a lot more than just Patchwerk in SimC – there are a variety of different fight styles, and you can add your own custom raid events and customize the boss’s action priority list to try and mimic real boss encounters.

If you’re going to try to test a rotation, you want to stick to the same principles you would use for a more basic test. The rotation you perform in game needs to match the action priority list you set up as closely as possible, as do the character properties, gear, talents, buffs, and so on. This is one of the hardest things to test since it can be tricky to perform a flawless rotation for long enough to collect a sufficient amount of data. Making a few mistakes probably won’t completely invalidate your results, but keep in mind that it’s very easy to sneak some systematic or random error into your comparison via your actual in-game rotation.

And from the other side of things, it can pay to make sure the simulation is really doing what you think it is. For example, our simple CS/J rotation isn’t doing what we expect for some reason, and while it wasn’t very relevant in that test since we were only checking ability damages, it would be very relevant if we were trying to test a rotation. Before you try your in-game test, use output data like the Sample Sequence, ability interval times, and number of casts to make sure that your simulated rotation is what you’ll be replicating in-game.

Going Further

So far, this series has covered the bulk of the material necessary to start doing your own theorycrafting. There are lots of nitty-gritty details we could talk about, but I’m trying to write an introductory guide rather than an encyclopedia. I’m hoping to write a series of smaller blog posts over the course of the beta period tackling specific issues that highlight some of those details that you might not otherwise encounter.

The one big omission is what I’d call “high-level theorycrafting” in an analogy to “high-level languages” in programming. The name is a little misleading, in that it doesn’t imply particularly complicated or amazing work. Instead, it’s “high-level” because it glosses over a lot of the details and assumes the underlying tool is accurately handling those details.

To explain the etymology of that idea: C++ is one of many “high-level languages” because the person writing the code doesn’t have to worry about the ugly details of moving each bit of data from one memory location to another. By comparison, assembly (sometimes called “machine code”) is a “low-level language,” because you have to write out every single operation the processor performs. It’s tedious and difficult work, and not the sort of language you’d want to write an entire program in. Instead, we have an interface (called the compiler) that lets us write in a high-level language like C++, and translates that high-level code into low-level code for us.

What I’ve taught you so far is “low-level theorycrafting.” You now know how to move all the bits around, one by one. You can test the most basic interactions in the game, describe them mathematically, and confirm whether or not those mechanics are properly represented in Simulationcraft. This is some of the hardest work theorycrafting has to offer, but also some of the most basic and important work that needs to be done.

“High-level theorycrafting” is in many ways a lot easier. You fire up Simulationcraft and start tweaking the action priority list or gear set, and take notes on the simulation outputs. This is, in fact, how most people get their start in theorycrafting. There’s a fair chance that if you’ve read through this entire series of blog posts, you’ve already tried it. Maybe you ran your character through Simulationcaft twice with different trinkets to see which was better, or tweaked a line on an action priority list to see if it gave a DPS boost. All of that qualifies as high-level theorycrafting in my book.

The problem with starting at that level is that you’re not yet equipped to know whether you can trust your results. If I handed you a magical black box that was able to evaluate a bridge design and tell you whether it would fall down or not, and asked you to design a bridge, could you? You could fumble around with designs of bridges you’ve seen, and maybe even get the box to approve your design. But you’re relying on the box being correct, and you wouldn’t have the tools to determine whether it’s making a mistake. That’s why real bridge engineers start by learning basic physics concepts like forces and kinematics, and work their way up to being able to design entire bridges. (Aside: these “magical black boxes” really do exist in bridge engineering – they’re software packages that do a lot of complicated math/physics to evaluate designs, and like any software package sometimes they have bugs. That’s why you have several real (human) bridge engineers double- and triple-check the work before you start construction.)

That’s why we took the route we did through these posts. Because by building up your skills from the basics, you now have the knowledge and skills required to generalize to more complicated systems like rotations or gear sets. When you get results you don’t expect from your high-level work, you’ll be able to dig into the meat of the output and figure out the low-level reason why.

While there are certainly some tips and tricks that are helpful when doing “high-level theorycrafting” in Simulationcraft, you don’t really need them to get started. I’m not even certain they warrant an entire blog post, but I hope to put one together to discuss those ideas in more detail anyway.

I hope you’ve enjoyed this little tutorial, and more importantly I hope you’ve found it useful. As usual, I’m happy to entertain questions in the comments section if there’s anything you feel I’ve left out or want more information on. In addition, any suggestions you have for future installments of TC101 are most welcome.

Theorycrafting Resources

To end this series, I’d like to leave you with a few references you can go to for additional learning and/or help.

The Simulationcraft Wiki has a lot of information about how to tweak profiles to get what you want. We try to keep the wiki up-to-date, but the documentation often lags development a little bit. When in doubt, you can always fire up your favorite IRC client and hop into our IRC channel at irc.stratics.com (#simulationcraft) and ask for help. Many of the devs make a habit of being in there and providing assistance, especially to theorycrafters that are interested in helping contribute. Note that we do eat and sleep from time to time, so don’t be discouraged if you don’t get an answer instantly – you may just have to try again another time or day when people are there.

The Elitist Jerks forums are still a solid place for many classes. There have been complaints that the community is slowly dwindling, which may be true. I still post results there because the level of discourse is pretty high and posters tend to be pretty good at critical analysis. More importantly for the new theorycrafter, there’s a wealth of good posts and discussions from previous expansions to wade through that detail a lot of the game’s core systems. Some of that information is old, but much of it is still relevant, and the posts can be great examples of how to thoroughly research a topic and report your work.

The MMO-Champion forums are another place you can check for theorycrafting, though just like Elitist Jerks, it varies in quality on a class-to-class basis. The Icy Veins and Wowhead forums may have information, but in my experience they tend to focus more on class guides and advice than on theorycrafting discussion.

There are also a host of class- or role-specific sites. Tankspot is a good resource for tank-spec theorycrafting (especially warriors), as is Maintankadin for paladins, The Inconspicuous Bear for Guardian druids, How To Priest for priests, #Acherus chat for DKs, Altered Time for mages, and so on. I’m sure there are other sites for specific classes, but those are the ones I know off the top of my head. As a theorycrafter, you should probably already be aware of the major sites for your particular class.

Both Wowhead and MMO-Champion’s wowdb are databases of spell information that can be helpful if you know what you’re looking for. Both have useful features like a “Modified By” tab that tells you what other spells affect an ability, which can help track down undocumented effects or set bonus spells. Wowhead also has a neat Changelog that shows you how the tooltip has changed in each patch. But don’t forget, tooltips can lie!

WoWWiki and WoWpedia are both potential resources, though their information is frequently out of date. But they can still be quite useful for archival information, like how item stats are calculated (obviously changing in WoD, but…) or how spell resistances used to work.

There are also plenty of personal blogs that discuss theorycrafting topics. One of the more general ones is Hamlet’s blog, where he posts a mix of healing theorycrafting, critical/conceptual analysis of wow (especially beta) mechanics, and mathematical treatments of mechanics. For example, in my last blog post I linked to his discussion of how wow spells calculate damage range, and in the past he’s posted on topics like how HoT mechanics work, how specific trinkets work, and how to compute uptimes of proc-based buffs. Digging through his archives is a great way to learn a little about math and WoW mechanics.

There are far too many personal blogs to list all of them, so I won’t even attempt to try (that way I can’t accidentally miss someone and piss them off). Instead, if you have a blog where you talk about theorycrafting topics, please post a comment with a link to your blog and a brief description of what you do, particularly what class or classes you work on. As a theorycrafter, you should figure out which blogs cover material specific to your class and keep up with them. Taking some time to browse through their archives will probably teach you a lot as well.

| | 9 Comments

## TC101: Experimental Design

In the previous post, we talked about what theorycrafting means and worked through a basic example of beginner theorycrafting. In this post, I want to go into a little more detail about the laboratory-based part of theorycrafting – in other words, designing and carrying out in-game “experiments” to test how mechanics work.

WoW Experiments

Luckily, “experiments” in WoW are pretty simple in a relative sense. While the entire system may be complicated, we generally have a good idea about how things work and what’s causally related (and not).

For example, I know when I press the key for Crusader Strike, my character will cast Crusade Strike on my target, if possible. I know that the damage it deals depends on a few factors: my weapon damage, my own stats and temporary buffs, any damage-increasing debuffs on the target, the target’s armor mitigation (which depends on its armor and both of our levels), and so on. Even if we don’t know exactly how those relationships work – that’s what we’re testing after all – we know that they exist or might exist.

Likewise, we can also quickly rule out a lot of variables. We don’t expect our Crusader Strike damage to depend on the time of day or the damage that another player is doing to a different target. This sounds silly, but it’s actually a pretty big deal. In real experiments (i.e. in a research laboratory), there are loads of external factors that can affect results, and we have to take great care to identify and eliminate (or at least minimize) those factors. WoW experiments are incredibly easy because we don’t have to do much of that at all.

To illustrate that thought, let me give you a real-life example. As an undergraduate, I spent one summer doing nuclear physics research at the University of Washington. One of the research groups there was making precise force measurements to test General Relativity. Their setup involved a very specially-designed arrangement masses and a smaller (but still hefty) hanging mass oscillator driven by a small motor.

When they made their measurements, they found a deviation from what they expected. After hours and hours of brainstorming, adjustments, repeating the experiment, and what not, it was still there. They looked at every external factor they could think of that might affect the result, and nothing seemed to be the culprit. After a few months of this, the grad students were beginning to think that maybe they had made a breakthrough discovery.

Their advisor, however, wasn’t as convinced. He made them continue searching for the error. I think he even made them build a second copy of the experiment from scratch to cross-check the results. In any event, eventually they narrowed down the culprit to the motor. As it turned out, the one they had been sent did not meet the manufacturer’s specifications (which had been pored over and chosen very carefully for exactly this reason!), and was malfunctioning in a very subtle way that caused the anomaly they observed.

In WoW experiments, we rarely, if ever, need to worry about being influenced by factors that aren’t immediately obvious. Generally, we have a very limited set of variables to work with, so identifying and isolating problems is pretty easy.

Basic Experimental Design

Before performing any experiment, you should first make sure you can answer (or have at least tried to answer) all of these questions:

1. What am I trying to test (i.e. what question am I trying to answer)?
2. What am I going to vary (and how)?
3. What am I going to hold constant?
4. What am I going to measure (and how)?
5. How much data do I need to take?

The first is pretty obvious – it’s hard to perform an experiment if you don’t have a clue what it is you’re trying to determine. Using our example from the previous post, if my question is “How does Judgment’s damage vary with attack power,” then the obvious answer to (1) is that we’re going to test whether Judgment’s damage changes when we change our character’s attack power. So far so good.

Variables

Our question also gives us the answers to questions (2) and (3). We’re going to vary attack power, and we want to (ideally) keep everything else constant. Implicit in this is a central tenet of experimental design that you should try to adhere to as often as possible: only vary one thing. That one thing is called the independent variable, in this case our attack power. In an ideal world, we only ever have one independent variable, so that we know for sure that whatever change we see in the measurement is due to that variable.

For a concrete example of that, let’s say we have an ability that depends on both weapon damage and spell power. If we make our measurements in such a way that we’re changing both our weapon and our spell power, we have a giant mess. We’d have to untangle it to determine how much of the change in damage was due to the weapon and how much was due to the change in spell power. That task may not be impossible, but it usually significantly complicates our data analysis.

In some cases, that complication is unavoidable. For example, if you look back at the most recent diminishing returns post, you’ll see that I was performing surface fits to three-dimensional sets of data with two independent variables. I had to do that to properly and accurately fit the two constants of the diminishing returns equation simultaneously, and only worked as well as it did because we already had a good idea what the formula looked like and what those constants were from prior experience and single-independent-variable experiments. In general, I wouldn’t recommend this technique to a beginner theorycrafter.

So, in our case our independent variable is attack power, and we’re going to keep all of the other potential variables constant. These constants are often called controls, though I prefer to call them “fixed variables” or “variables held constant” because they are variables, just ones you don’t want to vary. So our list of fixed variables includes crit, mastery, multistrike, versatility, etc.

This poses a potential problem for us, though. We haven’t yet answered the question, “how are we going to vary our attack power?” Normally, we would do this by putting on or taking off gear to change our strength. That seems pretty straightforward, but since gear has secondary stats, we’re also changing our crit, mastery, etc. at the same time!

Sometimes you can get around this by using certain temporary effects. For example, if we have several trinkets with varying amounts of attack power on them (and nothing else), we could swap them around to isolate attack power as a dependent variable. But we aren’t always that lucky, so generally we’re going to need to make some compromises here.

We might know that some of these are irrelevant – crit, for example, won’t change our results as long as we’re filtering out or adjusting for crits. Likewise, we probably don’t care what our multistrike chance is as long as we’re ignoring multistrike results. In cases where we’re sure, we can be more lenient about letting those factors vary. Since we know that crit and multistrike are independent events, we can safely ignore them as long as we’re careful during our data collection (see below).

But sometimes we’re not sure about it – for example, we may not know whether mastery does or doesn’t affect the damage of Judgment. And for another example, Chaos Bolt damage does increase with crit. Since we don’t always know, it’s safer to try and keep everything else constant when possible.

As an experiment designer, your goal is to juggle these constraints. You’ll be searching for ways to isolate a particular variable (say, attack power) while keeping certain other variables constant. But you’ll also have to decide when it’s acceptable for another variable to change value during an experiment, sometimes by confirming that the variable is irrelevant to the test at hand. Often this means thinking critically about how you’ll design your experiment. For example, if you were testing ability damage, you could safely ignore hit rating, but only if you made sure you ignored misses when you tallied up your results.

When testing ability damage in previous expansions, we generally just took off gear to change attack power. This wasn’t a problem because hit, miss, crit, haste, dodge, and parry were all independent from the raw (non-crit) damage done by abilities when attacking a target dummy. Multistrike doesn’t appreciably complicate matters, but as you might guess, versatility is a big problem. Which means we have at least one serious constraint on our experiment: we want to use gear that doesn’t have any versatility. Again, if we didn’t have any other choice, we could use advanced techniques to get around this constraint, but it’s far simpler if we just adhere to the constraint.

This brings up another major constraint on most in-game experiments: we don’t want any procs that change our stats temporarily. Otherwise, we’ll have periods where our experimental conditions have changed, which will make the data difficult or impossible to analyze properly. So generally, we want to take off any trinkets or special gear that has stat-changing or damage-increasing procs, and we want to avoid using weapon enchants like Dancing Steel that give temporary buffs.

Measurements

Which finally brings us to question (4). We’re going to measure Judgment’s damage, obviously. The thing we’re measuring is called the dependent variable, because its value may depend on the independent variable. But we don’t just need to know what we’re measuring, we also need to make sure we know how we’re going to perform that measurement and what we’re going to do with the result.

For example, am I going to cast Judgment and write down the value listed in the combat log on a piece of paper? That might be fine if I only need one or two casts, but could quickly become cumbersome if I plan on collecting hundreds of casts worth of data.

More commonly, we’d turn on combat logging (via the /combatlog slash command) so we have a text record of everything. From there, we could upload that result to a parsing site like Warcraft Logs, import the log into MATLAB, write a quick script to scrape the log and convert to a CSV file we can open in Excel, or any number of other analysis methods.

Similarly, within those steps, are we just going to count normal Judgment hits and ignore crits and multistrikes entirely? Or are we going to try to use that extra data, say, by dividing each crit by two and each multistrike by 0.3 and using the adjusted values? The latter gets us more data faster, which could reduce the amount of time it takes. But it relies on two very specific assumptions: that crits do 2x as much damage and multistrikes do 1.3x. If either of those are wrong, for example because we have a crit meta gem on, or our spec’s multistrikes have a different modifier, then our data is polluted.

Furthermore, we know we’re going to cast Judgment, but we haven’t specified how we’re going to cast it. Are we going to cast it on cooldown? Usually that’s the case, but sometimes we might have to wait longer to avoid another unwanted effect (think a single-target version of the Double Jeopardy glyph). Are we going to cast it on a single dummy, or tab-target around between different ones (for example, to test Double Jeopardy). If so, must those dummies all be the same level? Multi-target considerations are obviously even more important when testing AoE abilities.

Are we only going to cast Judgment, or are we going to cast other things while we’re at it? Maybe we’ll do a single data collection run that combines multiple tests – say, simultaneously Judgement, Crusader Strike, and Avenger’s Shield damage. If we’re casting multiple things, are we sure they don’t interact at all?

All of these are questions you’ll need to consider when deciding on your experimental method or procedure. No matter what we decide to do with the data or how we decide to collect it, we should know the entire plan ahead of time to make sure we’re collecting the right thing. It can be incredibly frustrating to take several hours worth of data, and later during analysis find out that your measurements depend on another factor that you forgot to record.

This means that every time we swap gear, we might want to write down everything – the value of all potential variables (strength, agility, intellect, attack power, spell power, mastery, crit, multistrike, haste, versatility) – before we start taking data. That way, if we start analyzing our data and find an anomaly, we have some information we can use to determine what the problem was.

For example, maybe we accidentally removed a piece with versatility on it that we intended to keep on for the entire test. If we’ve been recording versatility before every new set of data, we might be able to catch that after-the-fact and be able to salvage that data set, or at least know why we need to exclude it. Without that knowledge, we might have to re-take the data.

Again, if you’re very certain that a particular stat doesn’t matter (haste, for example, in our case), you can skip recording it. In practice, I rarely record everything. By now, I’m familiar enough with what factors matter that I generally write down only the handful of things I care about. That sort of intuition will come with time, practice, and familiarity with the mechanics. But even I make mistakes, and end up re-taking data (versatility is especially bad in that regard, because I’m still not used to it), so for a beginner I’d recommend erring on the side of caution.

Data Collection

The next question we need to answer is how much data we need to collect. This will vary somewhat from experiment to experiment. For example, if we just want to know whether Judgment procs Seal of Truth, we might only need a single cast. But more often, we’ll need to invoke some statistics. In this section, we’ll give a brief overview of two common ways to use statistics to determine just how much data we need.

Unknown Proc Rate

For example, let’s say we’re trying to accurately determine the proc rate of Seal of Insight. We expect we’ll need to record a lot of auto-attacks and count the number that generate a proc. We can use statistics to figure out how many swings we need to make, at minimum, to feel confident in our result. That amount could be a few hundred swings or even several thousand depending on how accurately we want to know the proc rate.

Proc-based effects are usually modeled by a binomial distribution because they’re discrete events with two potential outcomes (proc or no proc), every proc chance is independent (usually), and the proc rate is constant (again, usually). Most of the time, we can use something called the Normal Approximation Interval to estimate the possible error of our measurements, which we can reverse-engineer to figure out the number of swings we need.

In short, thanks to the Central Limit Theorem we can approximate the error in our measurement of a proc chance $p$ with the following formula:

$$p \pm z\sqrt{\frac{p(1-p)}{N}}$$

where $N$ is the number of trials (in our case, swings) and $z$ is a constant that depends on how confident we want to be on the result. If you want to know how to calculate $z,$ read up on Standard normal distribution, but most of us just use one of several precalculated values. The most common one is to use $z =1.96$, which corresponds to a 95% confidence interval. Other common values are $z=2.58$ for a 99% confidence interval and $z=3.29$ for a 99.9% confidence interval. If you’re lazy (like me) or don’t feel like memorizing or looking up those numbers, you can use $z=2$ and $z=3$ as rough approximations of the 95% and 99+% confidence intervals.

The way this is normally used, you’d first run an experiment to collect data. So maybe we perform 100 swings and get a proc on 25 of them. We would then have $p=25/100=0.25$ and $N=100$, and our 95% confidence interval is $0.25 \pm 0.0849$, in other words from 16.51% to 33.49% – a pretty wide range. We’d get a narrower range if we performed $N=1000$ swings and got 250 procs; $p$ is still 0.25, but the 95% confidence interval shrinks to $\pm 0.0268$, or from 22.32% to 27.68%. Since we’re dividing by $\sqrt{N}$ in the formula, to increase the precision by another decimal place (factor of 10) we need to use 100 times as many trials.

And of course, we can also use this formula to figure out how many iterations we need to reach a certain precision. Let’s say we want to know the value to precision $P=\pm 0.001$. We can set $P$ equal to term that describes the interval:

$$P = z \sqrt{\frac{p(1-p)}{N} }$$

and solve for $N$:

$$N = \frac{z^2 p(1-p)}{P^2}$$

So for example, let’s say we suspect the proc rate is $0.25$ (this would be our hypothesis). If we want to know the proc rate to a precision of $\pm 0.001$ with 95% confidence, we need $N=720300$ melee swings.

There are two caveats here. First, this formula is only an approximation, which means it’s got a range of validity. In particular, it becomes poor if $p$ is very close to zero or one and breaks down entirely if it’s exactly zero or one (though in those cases, the behavior is usually clear enough that we don’t need this method anyway). The rule of thumb is that it gives good results as long as $pN>5$ and $(1-p)N>5$. Since this rule includes $N$, you can still use this approximation when $p$ is very small by increasing the number of trials $N$ to keep the product over 5.

The second caveat is very subtle. Technically speaking, if we find the 95% confidence interval to be $0.25 \pm 0.05$, or 20% to 30%, that does not mean that the true value of the proc rate (let’s call it $\mu$) has a 95% chance to be between 20% and 30%. Instead, it means that if we repeat the experiment 100 times, and calculate confidence intervals for each of them, 95 of those confidence intervals will contain the true value $\mu$.

The wikipedia article for confidence interval makes this distinction as clear as mud (though to be fair, it’s better than any other treatment of it I’ve read; $1-\alpha$ is their representation of confidence, so for a 95% confidence interval $\alpha=0.05$):

This does not mean there is 0.95 probability that the value of parameter μ is in the interval obtained by using the currently computed value of the sample mean.

Instead, every time the measurements are repeated, there will be another value for the mean X of the sample. In 95% of the cases μ will be between the endpoints calculated from this mean, but in 5% of the cases it will not be.

The calculated interval has fixed endpoints, where μ might be in between (or not). Thus this event has probability either 0 or 1. One cannot say: “with probability (1 − α) the parameter μ lies in the confidence interval.” One only knows that by repetition in 100(1 − α) % of the cases, μ will be in the calculated interval. In 100α% of the cases however it does not. And unfortunately one does not know in which of the cases this happens. That is (instead of using the term “probability”) why one can say: “with confidence level 100(1 − α) %, μ lies in the confidence interval.”

Got all that?

In practice, the distinction isn’t that important for us – it’s mostly a matter of semantics. If we were submitting our work to a scientific journal, we’d care, but for theorycrafting we can be a little loose and fast with our statistics. Just don’t tell the real statisticians. The key is to remember that you can run an experiment and get a confidence interval that doesn’t contain the value you’re looking for.

In fact, it’s almost certain that you will see that happen if you’re using 95% confidence intervals, just because a 5% chance is pretty high. That’s one in every 20 experiments. When that happens, you may need to take further measures. That may mean repeating the experiment, or it may mean using a tighter confidence interval. Sometimes it means that you reject your hypothesis, because the value really isn’t inside the confidence interval. This is where the critical thinking aspect comes in – you have to interpret the data and determine what it’s really telling you.

Remember that you can always increase $z$ to calculate a more inclusive confidence interval, which doesn’t require taking extra data. Sometimes that will answer the question for you (“The result is outside the 99.9% confidence interval, the hypothesis is probably wrong, Seal of Insight’s proc rate is not 25%”). And on the other extreme, you can increase $N$ to reduce the size of the confidence interval if you’re trying to increase the precision of an estimate. Though that obviously means taking more data!

Unknown Proc Trigger

Sometimes, we just want to know if an effect can occur. For example, on beta I was doing some testing to see exactly how Sacred Shield benefits from multistrike. To test that, I just kept re-applying Sacred Shield until I saw it generate bubbles that didn’t match the baseline or crit values. A single multistrike should generate a shield that is 30% larger than the baseline one, so I just kept going until I observed one. Likewise, I wanted to know if a double-multistrike proc would generate a shield that was 60% larger, so I kept casting until I observed one of those as well.

This type of test, where a single positive result proves the hypothesis, can be very easy to perform if the chance is high. If I get a positive result very quickly, the test won’t take very long at all. But if the chance is low, you could be at it all day. And if the chance is actually zero (because your hypothesis is wrong), you could go forever without seeing the event you’re looking for. Again, statistics help us make the determination as to how many times we need to repeat the test before we can say with reasonable certainty that the event can’t happen.

Generally in this type of test, you already know the proc rate. For example, if I have 10% crit and I want to know if a particular ability can crit, I know that it should have either a 0% chance (if it can’t crit) or a 10% chance (if it can crit) to do so. So my $p$ here is 10%, or $p=0.10$. According to binomial statistics, if I perform the test $N$ times, the probability of getting exactly $k$ successes is calculated by:

$$Pr(k; N, p) = {N \choose k} p^k (1-p)^{ N-k},$$

where

$${N \choose k} = \frac{N!}{k!(N-k)!}.$$

This is relatively easy to evaluate in a calculator or computer for known values of $p$, $N$, and $k$. Many calculators have a “binopdf” function that will do the entire calculation; in others you may need to calculate the whole thing by hand.

So let’s say we perform the experiment. We cast the spell 100 times and don’t observe a single crit. The probability of that, according to our formula, is:

$$Pr(0; N, p) = {N \choose 0} p^0 (1-p)^N = (1-p)^N$$

Plugging in $N=100$ and $p=0.1$ we find that the probability is around $2.6\times 10^{-5}$, or around 0.00266%. Pretty unlikely, so we can probably safely assume that the ability can’t crit. Though again, there’s a 0.00266% we could be wrong!

A related problem is if we want to know the probability of getting up to a certain number of procs or crits. For example, if we have 10% crit, what’s is the probability of getting five or fewer crits in 100 casts. To do that, we’d have to calculate the separate probabilities of getting exactly 0, 1, 2, 3, 4, and 5 crits and sum them all up. Mathematically, that would be:

$$P(k\leq 5; N, p) = \sum_{k=0}^{5}P(k;N,p) = \sum_{k=0}^5 {N \choose k }p^k (1-p)^{N-k},$$

and at this point we’d probably want to employ a calculator or computer to do the heavy lifting for us. For our example, MATLAB gives us:

>> sum(binopdf(0:5,100,0.1))

ans =

0.0576

So we’d have a 5.76% chance of getting five or fewer crits in 100 casts with a 10% crit chance.

Ability Damage Tests

Since we started this post discussing a test related to the damage formula of Judgment, I want to make one final note about testing ability damage formulas. Normally, you can get this information from tooltips or datamining. But tooltips can lie, especially during a beta, and sometimes even on live servers. Word of Glory’s tooltip was wrong for almost the first half of Mists of Pandaria, for example. So it can be useful to perform tests to double check them. One of my first tasks every beta is to take data on every spell in our arsenal and attempt to fit the results to confirm whether the tooltips are correct.

Your first instinct when performing this type of experiment may be cast the spell a few hundred times and record the average damage, and then repeat at various different attack power (or spell power) values. However, while that works, it’s not always the most accurate (or efficient) way. Hamlet wrote an excellent article about this topic earlier this week, and you should really go read it if you want to understand why an alternative method (which I’ll briefly outline below, since I already had it written) is advantageous.

Spells in WoW traditionally have had a base damage range (either by default, or based on weapon damage) and then some constant scaling with attack power and/or spell power. The base damage range was fixed and obeyed a uniform distribution, and accounted for all damage variation in the spell. So it was often more accurate to record the maximum and minimum, and average for each set of casts, and then attempt to fit the maximum and the minimum values separately.

This was especially useful for abilities that did some percentage of weapon damage, because one could equip a weapon with a very small damage range (i.e. certain low-level common-quality weapons), at which point it might only take a handful of casts to cover the entire range.

I’ve used the past tense here, because they’ve changed how abilities work in Warlords. They no longer have any base damage values, which means that they’ve had to change the method they use to make spell damage vary from cast to cast. I don’t know what they’ve chosen to do about that, because I haven’t had time to test it thoroughly. In my limited time on beta, I’ve noted that some spells don’t appear to vary at all anymore, while others do. For example, Judgment and Exorcism both do the same damage every time they connect, while the healing done by Flash of Light and Word of Glory still varies from cast to cast. Abilities based on weapon damage, like Crusader Strike and Templar’s Verdit, vary according to their weapon damage range.

The ones that vary could just use a flat multiplicative effect, such that the spell always does $X \pm \alpha X$ for some value of $\alpha$. In other words, maybe it always does $\pm 10%$ of the base damage. But it could also be some other method. I’m sure we’ll figure this out as beta goes on (if nobody else has yet), but just keep in mind that this slightly changes the procedure above. You’d still be matching the min and max values, of course, but you’d potentially be looking for scale factors that are, say, 10% larger or smaller than the expected mean value.

Coming Soon

That wraps up our primer on performing in-game experiments. We could talk in a lot more detail about any of these examples and identify other nuances, tips, and tricks. But this post, while dense, covers the basics one would need to set up and perform an in-game experiment. Most of what we would gain by going into more depth is improvements in accuracy and efficiency.

Obviously both of those are good things, but they’re not necessary for your first few attempts at in-game experimentation. In the future, I might write a few shorter articles that are more focused on the nuances involved in a particular kind of measurement, provided there’s interest in the topic. If you have something in particular you’d like me to write about in depth, please mention it in the comments.

In the next post, I want to look at how we can use the results of in-game experiments to check Simulationcraft results. Which also means designing and executing “experiments” in Simulationcraft that we can use for comparison. Many of the same basic ideas will apply, of course; for example, eliminating as many variables as possible and making sure you’ve collected enough data. But in this case, we’ll be applying those principles to designing action priority lists and interpreting reports.

| | 12 Comments

## TC101: Intro to Theorycrafting

On more than a few occasions I’ve been asked some variation of the question, “How do I get started in theorycrafting?” Which is a tough question to answer, since there’s a variety of ways to get started depending on what you’re interested in and what talents or tools you have at your disposal. Someone proficient with spreadsheets might try to write one to model a rotation, for example. But one’s first foray into theorycrafting could be as simple as doing some “napkin math” to compare two talents.

For example, my own entry into the world of theorycrafting happened when I took somebody’s prot paladin spreadsheet and translated it into MATLAB code. I wanted to analyze variation with several different input variables (i.e. the oft-misused term “scaling”), which is something that spreadsheets are traditionally poor at doing. Translating the formulas in the spreadsheet into MATLAB code provided two advantages: full text code is generally easier to debug than spreadsheet formulas are, and MATLAB is designed to work with flexible arrays of data in ways that spreadsheets simply aren’t.

In the process of performing that translation I learned a lot about the way different spells interact, how some of the different game systems worked, and so on. In a lot of cases I corrected formulas that I discovered were in error, often because I explicitly tested the formula in-game to see if it was right. It was a slow but steady process of learning, testing, and refinement. And once it was done, the learning continued as I started to expand the sorts of questions that I wanted to answer with my code.

But when somebody asks, “How do I get started,” they’re not usually thinking about a specific problem. They’re thinking about making the transition from being a person that reads guides and follows the advice given to someone who discovers and creates that advice.

Sometimes, the person asking only has a vague understanding of what it means to “theorycraft.” Most players already know that theorycrafting produces numbers that can be used to evaluate performance and ask questions, of course. But what most players don’t know is how those numbers are produced from beginning to end.

That’s what I hope to clear up with this series of blog posts. And the first step is to make it clear exactly what the term “theorycrafting” means.

What IS Theorycrafting?

At its root, theorycrafting is a process called mathematical modeling. We’re trying to take some sort of system – in this case game mechanics – and describe it mathematically so that we can generate predictive results. As with most mathematical modeling, it’s also somewhat directed. In other words, we’re not just doing this for the hell of it; we’re trying to answer specific questions, so our model is built around having the versatility to be able to answer those questions.

Generally, that doesn’t happen by spontaneously creating a very complicated model that covers everything. It happens by creating a very simple model and then slowly refining it to include all of the complications necessary to make it accurate. In other words, you don’t start with a BMW. You start with a wheel, and maybe an axle. You put those together and start adding things, one by one, until you do have a BMW.

I had a very interesting conversation with Steve Chick a few weeks ago, during which he provided a great flow chart that more or less describes this process:

Flow chart that describes the process of learning. Source unknown.

This is the basic process of problem solving (and a core part of the scientific method), and it applies equally well to theorycrafting and model creation. Steve and I have differing opinions about the best advice for how the “learn more things somehow” part should be accomplished, but we agree completely on the process.

We have a question we’re trying to answer, like “How much DPS does Judgment provide,” and we’re attempting to break it down into smaller pieces that we can answer. The goal is to then put those pieces back together and come up with the answer to our original question.

Which means that in a broader sense, theorycrafting is also an exercise in problem solving. Even though they may not quite realize it, the player asking how to get started with theorycrafting is really asking how to obtain the tools necessary to start solving problems on their own. What I hope to provide with this series of blog posts is a little guidance on exactly how to develop and use those tools.

Theorycrafting 101

As an example, let’s say that is our question: “How much DPS does Judgment provide?” Let’s break that down as if we were complete newcomers to theorycrafting.

First, do we know what “DPS” stands for, and how to calculate it? You, as a seasoned WoW player, laugh at that question. But in reality, it’s not something I’d expect a random WoW player to know. Even if they knew it meant “damage per second,” knowing how to properly calculate it wouldn’t be guaranteed. You’d be surprised how many college-age students struggle with simple ratio metrics like velocity (“meters per second”), current (“charge per second” or “mass per second” depending on whether we’re talking about electricity or fluid flow), or efficiency measures (“miles per gallon”).

Let’s say we know the general concept – that we know we want to add up the damage we do in some period of time and then divide the total amount of damage by the length of time. How long a period do we use? Ten seconds? A minute? Ten minutes? An hour? The answer to that depends on not just accuracy, but the details of our rotation. If our rotation is a fixed, repeatable cycle (like CS-J-X-CS-X-J-CS-X-X) then we could plan on using one full cycle to give us the same precision as an infinite amount of time. But if it isn’t, we might have to decide what the cutoff is. Maybe we want to simulate 300 seconds of continuous combat, or maybe we only care about a 20-second window of a fight.

Once we decide on the time, we need to figure out how to calculate the total amount of damage done by Judgment in that time. Intuition tells us that will be the average damage of each cast times the average number of casts in our time window. Again, some of that depends on rotation (number of casts). But we also need to know how we calculate the damage done per cast. So we’ve broken the problem down into two smaller problems:

1. How much damage does Judgment do per cast
2. What’s our rotation?
• Determines number of casts and time interval (or equivalently, cast rate)

And we’ve come up with an equation:

$$DPS = ( {\rm Damage Per Cast} \times {\rm Number of Casts} ) / {\rm Time }$$

And note that we haven’t gotten any farther than deciding how to calculate a relatively simple metric like DPS!

So now we try and answer each of those questions, and break them down further if we can’t. Let’s take #1 since it’s simpler – how much damage does Judgment do per cast? If we’re a complete newcomer to theorycrafting, we may not know any more than “we press a button and it does some damage.” So we need to figure out how to quantify that.

This is the part where Steve and I disagree, by the way. He suggests that you should test it and figure it out yourself. In other words, go into the lab (i.e. in game) and set up an “experiment” to measure that damage and figure out how the game is calculating it. And there are definitely advantages to this approach. Learning is often significantly aided by firsthand experience, which is why laboratory exercises are so common in the sciences. This is, in fact, the approach we’ll use for our example.

However, my first instinct is to look things up and see if someone’s done the hard work for me before. I know that I may learn something from the process of designing and carrying out an experiment, especially if I screw something up and have to re-do it (nothing aids learning like painful and/or time-consuming mistakes!). But I also know that it’ll probably be a lot faster to spend a few minutes googling. That may also be a generous way of saying “I’m lazy.”

So let’s say we want to set up this experiment. What are we going to test? Or, put another way, what factors change the damage of Judgment? First, we might already know (or guess) that it changes when our attack power changes. We might also wonder if it varies with spellpower. Maybe we’re not sure if it depends on weapon damage, or if it has a base damage value. We do know from experience that it does more damage when we get a critical strike, and that there are a few effects that boost its damage (Glyph of Double Jeopardy, Avenging Wrath, Holy Avenger). So we need to test all of those things, and in some cases how they interact (for example, is Avenging Wrath’s 20% boost multiplicative or additive with Holy Avenger?) before we can put them together.

In other words, we’ve just created a bunch of smaller questions to answer:

1. How much damage does Judgment do per cast
1. Does it vary with attack power (and if so, how)?
2. Does it vary with spell power (and if so, how)?
3. Does it have a base damage value?
4. Does it depend on weapon damage (and if so, how)?
5. How much more damage does it do on a critical strike?
6. How often do we get a critical strike?
7. How does Avenging Wrath affect the damage?
8. How does the Glyph of Double Jeopardy affect the damage?
9. How does Holy Avenger affect the damage?
10. How do G, H, and I interact?

I’ve cheated a bit here and added (F) because I know there’s a hidden crit suppression against higher-level targets, but a new theorycrafter might not be aware of that fact. Similarly, they might skip test J because they’ve assumed (knowingly or not) that everything is multiplicative (it might be… or it might not be – Blizzard can be inconsistent on that from one effect to another). Both of those are errors that might not show up until a lot later (and with a lot more testing), which is one of the reasons I advocate doing a little reading first.

I’ve separated these out because each of these is going to require its own experiment (or at least, its own calculations). So we’ve now got a long list of things to test, each of which is a small component of how Judgment’s damage is calculated. Pretty much all of these are as low-level as we can get, so there’s no point in breaking them down further. They’re each things we can either answer directly (i.e. “A critical strike does 2x the damage”) or measure through experiment and analysis.

In the next blog post, I’ll talk in more detail about how we go about designing each of these experiments and putting the pieces together. For now though, I want to go back to the more abstract concept of putting the results together. Let’s say we perform some of these experiments and determine that (note that these are completely made up):

1. Judgment does 1000 base damage.
2. Weapon damage has no effect.
3. Every point of attack power adds 2 damage.
4. Every point of spell power adds 1 damage.
5. Crits do 2x damage, and
6. Crits occur with a probability equal our character sheet crit chance.

So we have several small pieces we can put together. We know that ignoring crits, a Judgment will do on average about $1000 + 2\times AP + 1\times SP$ damage. To apply the crits, we note that when we don’t crit (a probability of $1-C$, where $C$ is our crit chance) we do 100% damage, and when we do crit (probability $C$) we do 200% damage. That gives us a factor of

$1.00 \times (1-C)+2.00 \times C = 1 – C + 2C = 1 + C$

So our average damage per Judgment is then

$${\rm Judgment damage} = (1000 + 2\times AP + 1\times SP) \times (1+C).$$

And there we have it: our first model for Judgment damage. It’s not a complete model, obviously – we’d need to continue to refine it to account for all of the other effects that affect Judgment’s damage. And then we’d repeat this entire process for the rotation tree, and combine those results to create a model for DPS.

But that’s the essence of theorycrafting. Start with a simple model, and eventually add more detail and complexity until the model is as accurate as you need it to be. We’ll talk a little more about determining accuracy and tolerances in the next two installments.

Simulationcraft

Simulationcraft is, as you might expect, just a really big, complex numerical model. And it’s built up in exactly the same way that we built our model for Judgment damage. There are literally thousands of small moving parts within SimC taking care of each of the details that one might care about.

For example, there’s an entire system of functions to accurately calculate your hit, miss, dodge, parry, block, and crit chances against a target based on your combat ratings, the target’s base avoidance, block, and crit suppression values, and the level difference between the two of you. Another function takes all of that information and constructs attack tables and performs the rolls that determine whether you hit or miss, whether your attack is a critical strike or not, and whether the attack is blocked (provided it can be blocked at all!). All of that is done with pinpoint accuracy because we have a good understanding of how combat rolls work thanks to years of theorycrafting.

Likewise, in the paladin class module, there are special functions that handle things like Hand of Light damage, seal procs, Grand Crusader, and so on. Lots of little moving pieces that each handle one small detail, each one improving the accuracy of the model bit by bit.

Which brings us to another statement I see fairly frequently: “I’d like to contribute to Simulationcraft, but I don’t know C++.” It’s true that Simulationcraft is written in C++, and while the intent is that you don’t really need to know it to maintain a class module, in my experience our class modules simply aren’t user-friendly enough for that to be realistic.

However, not all contributions to Simulationcraft require coding knowledge. The great part about SimC is that it outputs a report that doesn’t require any programming experience to read and interpret. There are plenty of things that someone can do just by tweaking an action priority list and looking at how the output changes.

One way to think about it is that Simulationcraft has several layers. At the top, there’s the “theorycrafting layer,” where you only need the basic knowledge of how to manipulate action priority lists and read the reports the simulation generates. I call it the theorycrafting layer because this is where you try out new ideas for optimizing a character or compare simulation results to in-game testing to check for errors.

In the middle, there’s the mechanics layer. This is where the class module developers (i.e. coders) come in, because it’s the layer where the mechanics that we discover in-game get coded into the simulation. But even here, there’s room for non-coders, because we don’t always have class developers that are experts on each class. We have quite a few talented people writing code, but none of them may be experts on your class or spec. But if someone who is an expert on that spec can explain how the mechanics work to a developer, we can support that spec anyway.

At the bottom is the core layer, which is all of the under-the-hood subsystems that run the simulation. Things like how events are scheduled and executed and how (and what) data is stored. This layer really does require C++ knowledge, but we have several really dedicated devs that already take care of most of this stuff. While I’m sure they would love help, realistically the greater need is in the top two layers, since that’s the bulk of the work when we’re staring down a new expansion.

The point of all of this is that we don’t need a host of C++ gurus to help make SimC better for everyone. We need more people that can properly test and describe the mechanics to a coder, so that the coder can implement those features.

In other words, we need theorycrafters more than we need code monkeys.

Coming Soon

One goal of this series of blog posts is to give prospective theorycrafters a better idea of what they’re getting into. Another is to help them put together the basic toolbox they’ll need to actually start solving problems. Both of those aims are well served by showing actual examples of theorycrafting, like we did with Judgment in today’s post. Not coincidentally, this is exactly the same approach that most introductory textbooks take.

As you may have guessed, theorycrafting employs many of the basic techniques that any scientist would learn before going into a laboratory. So the next two blog posts will be focused on developing and understanding common experimental methods.

In the second part of this series, we’ll talk more about how to properly design in-game experiments to test and verify mechanics. Then, in the third part, we’ll focus on methods for comparing those in-game results to Simulationcraft results to check for consistency.

| | 10 Comments

## Velvet Resolver

On Monday, Celestalon kicked off the official Alpha Theorycrafting season by posting a Theorycrafting Discussion thread on the forums. And he was kind enough to toss a meaty chunk of information our way about Resolve, the replacement for Vengeance.

Resolve: Increases your healing and absorption done to yourself, based on Stamina and damage taken (before avoidance and mitigation) in the last 10 sec.

In today’s post, I want to go over the mathy details about how Resolve works, how it differs from Vengeance, and how it may (or may not) fix some of the problems we’ve discussed in previous blog posts.

Mathemagic

Celestalon broke the formula up into two components: one from stamina and one from damage taken. But for completeness, I’m going to bolt them together into one formula for resolve $R$:

$$R =\frac{\rm Stamina}{250~\alpha} + 0.25\sum_i \frac{D_i}{\rm MaxHealth}\left ( \frac{2 ( 10-\Delta t_i )}{10} \right )$$

where $D_i$ is an individual damage event that occurred $\Delta t_i$ seconds ago, and $\alpha$ is a level-dependent constant, with $\alpha(100)=261$. The sum is carried out over all damaging events that have happened in the last 10 seconds.

The first term in the equation is the stamina-based contribution, which is always  active, even when out of combat. There’s a helpful buff in-game to alert you to this:

In-game tooltip for Resolve, out of combat.

My premade character has 1294 character sheet stamina, which after dividing by 250 and $\alpha(90)=67$, gives me 0.07725, or about 7.725% Resolve. It’s not clear at this point whether the tooltip is misleadingly rounding down to 7% (i.e. using floor instead of round) or whether Resolve is only affected by the stamina from gear. The Alpha servers went down as I was attempting to test this, so we’ll have to revisit it later. We’ve already been told that this will update dynamically with stamina buffs, so having Power Word: Fortitude buffed on you mid-combat will raise your Resolve.

Once you’re in combat and taking damage, the second term makes a contribution:

In-game tooltip for Resolve, during combat.

I’ve left this term in roughly the form Celestalon gave, even though it can obviously be simplified considerably by combining all of the constants, because this form does a better job of illustrating the behavior of the mechanic. Let’s ignore the sum for now, and just consider an isolated damage event that does $D$ damage:

$$0.25\times\frac{D}{\rm MaxHealth}\left ( \frac{2 ( 10-\Delta t )}{10} \right )$$

The 0.25 just moderates the amount of Resolve you get from damaging attacks. It’s a constant multiplicative factor that they will likely tweak to achieve the desired balance between baseline (stamina-based) Resolve and dynamic (damage-based) Resolve.

The factor of $D/{\rm MaxHealth}$ means that we’re normalizing the damage by our max health. So if we have 1000 health and take an attack that deals 1000 damage (remember, this is before mitigation), this term gives us a factor of 1. Avoided auto-attacks also count here, though instead of performing an actual damage roll the game just uses the mean value of the boss’s auto-attack damage. Again, nothing particularly complicated here, it just makes Resolve depend on the percentage of your health the attack would have removed rather than the raw damage amount. Also note that we’ve been told that dynamic health effects from temporary multipliers (e.g. Last Stand) aren’t included here, so we’re not punished for using temporary health-increasing cooldowns.

The term in parentheses is the most important part, though. In the instant the attack lands, $\Delta t=0$ and the term in parentheses evaluates to $2(10-0)/10 = 2.$ So that attack dealing 1000 damage to our 1000-health tank would give $0.25\times 1 \times 2 = 0.5,$ or 50% Resolve.

However, one second later, $\Delta t = 1$, so the term in parentheses is only $2(10-1)/10 = 1.8$, and the amount of resolve it grants is reduced to 45%. The amount of Resolve granted continues to linearly decrease as time passes, and by the time ten seconds have elapsed it’s reduced to zero.  Each attack is treated independently, so to get our total Resolve from all damage taken we just have to add up the Resolve granted by every attack we’ve taken, hence the sum in my equation.

You may note that the time-average of the term in parentheses is 1, which is how we get the advertised “averages to ~Damage/MaxHealth” that Celestalon mentioned in the post. In that regard, he’s specifically referring to just the part within the sum, not the constant factor of 0.25 outside of it. So in total, your average Resolve contribution from damage is 25% of Damage/MaxHealth.

Comparing to Vengeance

Mathematically speaking, there’s a world of difference between Resolve and Vengeance. First and foremost is the part we already knew: Resolve doesn’t grant any offensive benefit. We’ve talked about that a lot before, though, so it’s not territory worth re-treading.

Even in the defensive component though, there are major differences. Vengeance’s difference equation, if solved analytically, gives solutions that are exponentials. In other words, provided you were continuously taking damage (such that it didn’t fall off entirely), Vengeance would decay and adjust to your new damage intake rather smoothly. It also meant that damage taken at the very beginning of an encounter was still contributing some amount of Vengeance at the very end, again, assuming there was no interruption. And since it was only recalculated on a damage event, you could play some tricks with it, like taking a giant attack that gave you millions of Vengeance and then riding that wave for 20 seconds while your co-tank takes the boss.

Resolve does away with all of that. It flat-out says “look, the only thing that matters is the last 10 seconds.” The calculation doesn’t rely on a difference equation, meaning that when recalculating, it doesn’t care what your Resolve was at any time previously. And it forces a recalculation at fixed intervals, not just when you take damage. As a result, it’s much harder to game than Vengeance was.

Celestalon’s post also outlines a few other significant differences:

• No more ramp-up mechanism
• No taunt-transfer mechanism
• Resolve persists through shapeshifts
• Resolve only affects self-healing and self-absorbs

The lack of ramp-up and taunt-transfer mechanisms may at first seem like a problem. But in practice, I don’t think we’ll miss either of them. Both of these effects served offensive (i.e. threat) and defensive purposes, and it’s pretty clear that the offensive purposes are made irrelevant by definition here since Resolve won’t affect DPS/threat. The defensive purpose they served was to make sure you had some Vengeance to counter the boss’s first few hits, since Vengeance had a relatively slow ramp-up time but the boss’s attacks did not.

However, Resolve ramps up a lot faster than Vengeance does. Again, this is in part thanks to the fact that it isn’t governed by a difference equation. The other part is because it only cares about the last ten seconds.

To give you a visual representation of that, here’s a plot showing both Vengeance and Resolve for a player being attacked by a boss. The tank has 100 health and the boss swings for 30 raw damage every 1.5 seconds. Vengeance is shown in arbitrary units here since we’re not interested in the exact magnitude of the effect, just in its dynamic properties. I’ve also ignored the baseline (stamina-based) contribution to Resolve for the same reason.

As a final note, while the blog post says that Resolve is recalculated every second, it seemed like it was updating closer to every half-second when I fooled with it on alpha, so these plots use 0.5-second update intervals. Changing to 1-second intervals doesn’t significantly change the results (they just look a little more fragmented).

Vengeance and Resolve timelines. Boss hits for 30% of tank health every 1.5 seconds, no variation.

The plot very clearly shows the 50% ramp-up mechanism and slow decay-like behavior of Vengeance. Note that while the ramp-up mechanism gets you to 50% of Vengeance’s overall value at the first hit (at t=2.5 seconds), Resolve hits this mark as soon as the second hit lands (at 4.0 seconds) despite not having any ramp-up mechanism.

Resolve also hits its steady-state value much more quickly than Vengeance does. By definition, Resolve gets there after about 10 seconds of combat (t=12.5 seconds). But with Vengeance, it takes upwards of 30-40 seconds to even approach the steady-state value thanks to the decay effect (again, a result of the difference equation used to calculate Vengeance). Since most fights involve tank swaps more frequently than this, it meant that you were consistently getting stronger the longer you tanked a boss. This in turn helped encourage the sort of “solo-tank things that should not be solo-tanked” behavior we saw in Mists.

This plot assumes a boss who does exactly 30 damage per swing, but in real encounters the boss’s damage varies. Both Vengeance and Resolve adapt to mimic that change in the tank’s damage intake, but as you could guess, Resolve adapts much more quickly. If we allow the boss to hit for a random amount between 20 and 40 damage:

Vengeance and Resolve timelines. Boss hits for 20%-40% of the tank’s hit points every 1.5 seconds.

You can certainly see the similar changes in both curves, but Resolve reacts quickly to each change while Vengeance changes rather slowly.

One thing you’ve probably noticed by  now is that the Resolve plot looks very jagged (in physics, we might call this a “sawtooth wave”). This happens because of the linear decay built into the formula. It peaks in the instant you take the attack – or more accurately, in the instant that Resolve is recalculated after that attack. But then every time it’s recalculated it linearly decreases by a fixed percent. If the boss swings in 1.5-second intervals, then Resolve will zig-zag between its max value and 85% of its max value in the manner shown.

The more frequently the boss attacks, the smoother that zig-zag becomes; conversely, a boss with a long swing timer will cause a larger variation in Resolve. This is apparent if we adjust the boss’s swing timer in either direction:

Vengeance and Resolve timelines. Boss hits for 20-40 damage every 1.0 seconds.

Vengeance and Resolve timelines. Boss hits for 20-40 damage every 2.0 seconds.

It’s worth noting that every plot here has a new randomly-generated sequence of attacks, so don’t be surprised that the plots don’t have the same profile as the original. The key difference is the size of the zig-zag on the Resolve curve.

I’ve also run simulations where the boss’ base damage is 50 rather than 30, but apart from the y-axis having large numbers there’s no real difference:

Vengeance and Resolve timelines. Boss hits for 40-60 damage every 1.5 seconds.

Note that even a raw damage of 50% is pretty conservative for a boss – heroic bosses in Siege have frequently had raw damages that were larger than the player’s health. But it’s not clear if that will still be the case with the new tanking and healing paradigm that’s been unveiled for Warlords.

If we make the assumption that raw damage will be lower, then these rough estimates give us an idea of how large an effect Resolve will be. If we guess at a 5%-10% baseline value from stamina, these plots suggest that Resolve will end up being anywhere from a 50% to 200% modifier on our healing. In other words, it has the potential to double or triple our healing output with the current tuning numbers. Of course, it’s anyone’s guess as to whether those numbers are even remotely close to what they’ll end up being by the end of beta.

Is It Fixed Yet?

If you look back over our old blog posts, the vast majority of our criticisms of Vengeance had to do with its tie-in to damage output. Those have obviously been addressed, which leaves me worrying that I’ll have nothing to rant about for the next two or three years.

But regarding everything else, I think Resolve stands a fair chance of addressing our concerns. One of the major issues with Vengeance was the sheer magnitude of the effect – you could go from having 50k AP to 600k AP on certain bosses, which meant your abilities got up to 10x more effective. Even though that’s an extreme case, I regularly noted having over 300k AP during progression bosses, a factor of around 6x improvement. Resolve looks like it’ll tamp down on that some. Reasonable bosses are unlikely to grant a multiplier larger than 2x, which will be easier to balance around.

It hasn’t been mentioned specifically in Celestalon’s post, but I think it’s a reasonable guess that they will continue to disable Resolve gains from damage that could be avoided through better play (i.e. intentionally “standing in the bad”). If so, there will be little (if any) incentive to take excess damage to get more Resolve. Our sheer AP scaling on certain effects created situations where this was a net survivability gain with Vengeance, but the lower multiplier should make that impossible with Resolve.

While I still don’t think it needs to affect anything other than active mitigation abilities, the fact that it’s a multiplier affecting everything equally rather than a flat AP boost should make it easier to keep talents with different AP coefficients balanced (Eternal Flame and Sacred Shield, specifically). And we already know that Eternal Flame is losing its Bastion of Glory interaction, another change which will facilitate making both talents acceptable choices.

All in all, I think it’s a really good system, if slightly less transparent. It’s too soon to tell whether we’ll see any unexpected problems, of course, but the mechanic doesn’t have any glaring issues that stand out upon first examination (unlike Vengeance). I still have a few lingering concerns about steady-state threat stability between tanks (ironically, due to the removal of Vengeance), but that is the sort of thing which will become apparent fairly quickly during beta testing, and at any rate shouldn’t reflect on the performance of Resolve.

| | 76 Comments

## Cumulative Loot

Earlier this week Blizzard published a Dev Watercooler describing the changes in raiding in Warlords of Draenor. I don’t think anything in this article was news, in that all of these changes had been announced at Blizzcon. The major addition was a detailed discussion of the rationale behind the changes.

But this post isn’t about dissecting that discussion – I agree with pretty much everything Ion wrote in regards to the “why” of the changes. Instead, I want to revisit a topic that we’ve touched on before: raiding, burnout, and loot.

The key points of the watercooler article that are relevant to us are these:

• LFR, normal, heroic, and mythic raids are on separate lockouts. In other words, you can run each one for loot each week.
• LFR, normal, and heroic are flexible-size loot-based lockouts, which means you can run them as many times per week as you like, but you’ll only get loot from the boss the first time you kill it on each difficulty.
• Mythic is a fixed-size boss-based lockout, meaning that it works just like MoP normal/heroic raid lockouts do. Once you kill a boss, you get an instance ID and you’re stuck with that instance ID all week.
• LFR will likely not contain set items and specific highly-sought-after trinkets in order to prevent heroic/mythic raiders from feeling like they need to run LFR.

Again, most of this is not news – the last bit is the only tidbit we didn’t already know last November. However, the watercooler triggered a lot of the same negative reactions that were elicited after the announcement at BlizzCon.

In particular, raiders complained that in order to remain competitive, they would feel pressure to clear the same instance several times a week on different difficulty levels to maximize loot income. This in turn contributes to higher burnout rates amongst those raiders and a less fun experience. Our own Anafielle has been one of the more vocal people involved in this debate, even as far back as the early days of LFR.

Why Should Blizzard Do Anything?

You could argue (and many people have) that this is a self-inflicted problem. That these hardcore players are victims of their own inability to set boundaries, and that they just have to learn to manage their time better. I don’t think that’s a reasonable response, because it glosses over a lot of subtleties about the differing motivations gamers have, how we approach games, and the behavioral psychology involved in playing a game. I also think it incorrectly assumes that this is an issue which only affects mythic raiders.

Some players simply cannot enjoy a game unless they feel they’re doing everything they can to advance their character. This isn’t a new phenomenon, and it isn’t limited to mythic raiders. I’ve known players who never stepped foot in a heroic (MoP) raid, but still felt this way about their character. It’s sort of the “type A personality” equivalent in gaming, and I think every raider has a little bit of that tendency in them. For some people, it’s the cause of the bulk of the satisfaction they get from a game.

You may recall that I’ve covered this topic once before, when flex raiding came out, so I won’t re-hash all of the arguments for why raider burnout is a legitimate concern. It’s also got strong similarities to the issues raiders had with the valor point grind before the introduction of heroic scenarios. Each of these activities adds a chunk of time that a raider can spend to further their character, raising the bar a little bit higher. And there’s a strong social incentive to do so in most cases. Perhaps your guild explicitly states that they expect it of you, or maybe peer pressure is enough because you don’t want to be “that guy” that’s letting down the team.

So rather than brushing the issue aside with an “it’s not my problem” response, it’s worth considering the situation with a critical eye and asking, “is there a good way to fix this?”

Cumulative Loot

The last time I touched on this topic, I laid out several potential systems that removed or reduced the incentive raiders had to run lower difficulty levels of the same raid. Some of them, like the increased ilvl gap between LFR and Normal, have already become a reality. But the one I want to dwell on today is a system I called the “Cumulative Loot System.”

The idea I’ve liked the most so far is one proposed by Thels. …. In short, when you kill a normal or heroic boss, you also automatically get your personal loot rolls for LFR and/or Flex.  You could imagine various permutations of how this would work; maybe a normal kill gives you your LFR roll, while a heroic kill gives you both LFR and Flex rolls.  But the simplest case is just that you get both rolls on any normal or heroic kill.

The basic premise behind this system is that if you can kill a boss on heroic mode, then the normal and LFR versions are obviously beneath your skill and gear level. There’s no challenge involved in doing so for your raid group anymore, it’s just an arbitrary time sink that’s probably not very much fun. But due to the way loot drops are structured, there may be a significant benefit to doing so thanks to set items and trinkets.

So instead of asking you to dump that time into the drudgery of another instance clear, it just gives you that loot when you kill the boss on heroic in addition to your usual heroic loot.

In other words, the system accepts that it is the game’s fault that it is providing an incentive for you to do busywork. It’s sort of like your professor writing an exam problem that’s a little too hard, and then giving you a bit of a curve to compensate. Not that I’ve ever done that. I’m just saying… some professors might have. At some point in history. Definitely not me though.

When I mentioned this idea on Twitter yesterday, it set off a flurry of retweets, favorites, and responses. So I felt it was worth clarifying some of the details in a place where I’m not limited to 140 characters at a time.

The idea is most succinctly explained via an example. Let’s say my raid group kills the new boss Ogre McOgreton on mythic difficulty. He drops mythic-quality loot just like usual for my raid leader to distribute.  However, at the same time, I get the option to automatically get the results of my personal loot rolls for that boss from heroic, normal, and LFR difficulties. Doing so “consumes” my loot lockout for that boss on each of those difficulties that week.

Note that this isn’t guaranteed loot, because it’s a personal roll. I’m not suggesting the boss drops X additional heroic-quality items for your raid leader to distribute. I’m suggesting that the game makes up to three extra loot rolls for you, using the personal loot system, for each of the lower difficulty levels. Sometimes you might get 3 items from those three rolls (one heroic quality, one normal quality, one LFR quality). Other times you’ll get no items (use your best sarcastic Pat Krane voice and say “Triple Gold! Thanks Blizz!”). But no matter what, your loot lockout for that boss is flagged so that you don’t need to run the lower difficulty levels.

Recall that in Warlords, LFR, normal, and heroic all use loot-based lockouts. So being locked to a certain boss doesn’t prohibit you from joining a new group and killing that boss on that same difficulty again, it just prevents you from getting loot from the boss a second time. So this system doesn’t prevent a player in a guild clearing mythic difficulty from joining his friend’s raid and helping out. It just removes the loot-based incentive to do so, provided that player opted to get their loot during their mythic raid.

It also doesn’t penalize guilds that want to clear a lower difficulty early in the week and attempt a harder difficulty later in the week like a shared lockout system (i.e. MoP normal/heroic) does. If you clear normal (WoD) quickly and decide to give heroic (WoD) a try, great – you’ll get better loot when you kill that first heroic boss. One way to think of it is as a one-way lockout system – it only locks you out of lower difficulties (after giving you the loot, of course!).

There’s one significant modification suggested by Brian Packer that I think really makes the idea shine. He suggested that this system be integrated into garrisons via a follower mission. In other words, if I kill Ogre McOgreton on mythic, it unlocks a follower mission to “retrieve” my extra loot from LFR, normal and heroic. The next time I go back to my garrison, I can tell my follower to go “loot the body” or some such, and the next day he’ll return with my extra loot rolls.

This solves pretty much all of the major problems with the idea, most of which involved UI concerns like “how does this work with loot spec” and “how do I tell the game whether I want to use each roll or not.” It codifies the system as an optional thing rather than automatic, and the follower mission interface can handle the choice of different quest for each difficulty level and loot spec. It also puts a nice linkage between raiding and garrisons without relying on raw power boosts or buffs, so it’s not in any way mandatory.

I’ll also note that it still leaves open the possibility of setting up multiple runs combining mains and alts to more effectively funnel loot to a group of main raiders. In theory, you might still get more efficient loot allocation pulling those sorts of tricks, because you can funnel everybody’s loot “rolls” to the people who need it. But Cumulative Loot does severely reduce the benefit of doing that, simply because the personal loot rolls are guaranteed to be for the spec you want. If you go the “funnel via alts” route, the boss could drop three bows in a raid with no hunters. It basically reduces the reward-to-time-spent ratio of having multiple alt runs to the point that it’s not even worth considering for guilds outside the top 10 or 20.

Summa Cumulative Laude

In the eight or so months since that last blog post I’ve discussed the idea with a fair number of people. The criticisms generally fall into one of two arguments, neither of which holds much merit in the Warlords raiding system.

The first criticism is that it means fewer mythic and heroic raiders will participate in LFR, and that those players are necessary to carry LFR groups kicking and screaming to their eventual loot drops. While this may be at least partially true in the Mists of Pandaria LFR design, the blog post by Watcher explicitly states that it is not the case in Warlords. LFR is being tuned around the expectation that those players are not present. Which also means it’s no longer a limitation to this sort of loot system.

The second and most common response has been, “but that just gives mythic raiders loot they didn’t earn.” But that argument is fundamentally flawed because it’s built on an incorrect assumption.

When you kill a Mythic boss, what do you “earn” exactly? What’s the appropriate reward for doing that? Higher-ilvl loot, obviously, but how much higher and how much more? We’ve seen various different iterations of this in wow’s history, where killing hard-mode bosses rewarded more loot (Ulduar) and/or higher-ilvl loot (Ulduar and everything since). But the amount of extra loot has changed, as has the ilvl gap.

The truth is, the “amount” of those extra rewards is completely arbitrary. It’s whatever Blizzard decides it’s worth. They have an incentive to make it worth enough that people want to engage in all levels of content, of course. But whether you take the idealistic stance that they’re doing it to make the best game possible or the pessimistic stance that they just want to maximize subscriber numbers, either way, their choice is pretty much arbitrary. It is based more on relative ilvl gaps and power increases than on some nebulous idea of “mythic boss A is X% harder than heroic boss A, thus should give Y additional loot.” And in fact, as we’ve seen, one of the factors that goes into the determination of those ilvl values is how much incentive it gives players to run lower difficulty tiers!

When people suggest that Mythic raiders would be getting gear they “didn’t earn,” they’re making the implicit assumption that such a nebulous connection exists, when in reality, it’s an arbitrary reward. It’s a lot like complaining about having 10 levels per expansion instead of 5 because it’ll take so much longer to level. There’s an implicit (and incorrect) assumption in there that a “level” is some well-defined quantity of experience or time, rather than an amount set arbitrarily by Blizzard to ensure that reaching max level takes around 20 hours (or whatever their target is). And that incorrect assumption means the whole argument topples over under scrutiny.

Cumulat-usions?

I’m not suggesting that this is the only way to address raiders’ concerns of multiple loot lockouts. But out of all of the solutions I’ve seen, this one seems to have the most positives and fewest negatives. As I mentioned in August, it’s got plenty of additional benefits:

• Since everyone gets extra loot, it feels good.  It feels like a bonus, whereas the traditional shared difficulty lockout feels punitive and restrictive.
• It makes it clear that the real reward is time – specifically, time you don’t have to spend mindlessly clearing the same instance and can spend on other things.
• It eliminates worries about LFR or normal mode loot being attractive to mythic raiders, which means that LFR and normal loot can be significantly better (i.e. a smaller ilvl gap). That makes LFR and normal raiders happy.

It’s pretty rare to stumble across a system that works this well without any major downsides. And yet, here it is. I’d also like to point out that I shouldn’t get most of the credit for the idea. It first came up in discussions with Thels on maintankadin, and of course Brian Packer gets all the credit for the exceptional idea of tying it into garrisons.

What I like most about the system is that it’s intuitive. If I clear a challenge mode in time to get the gold achievement, I don’t need to go back and clear it again to get the silver one. It’s clear from my accomplishment that I can do that, so the game doesn’t ask me to go back and spend another 20 minutes proving it. There’s no reason raiding can’t work the same way.

From a skill perspective, it’s sort of like performing a track and field event. If I can clear a 4′ hurdle track, it’s  pretty clear I can clear a 3′ one, or a 2′ one. There’s little point in making me re-run those to “prove” anything – it’s not a test of my skill at all at that point, it’s just another chunk of time I need to spend to clear a trivial hurdle (yes, I just used a hurdle metaphor in a hurdle analogy). Cumulative Loot just builds that into the reward structure. It says “sure, here’s the loot from all the lower difficulty levels that we know you can clear, great job on the mythic kill.”

Posted in Design, Raiding, Theck's Pounding Headaches | | 78 Comments

## Leetsauced Podcast Appearances

Last evening, I had the pleasure of hanging out with Logan, Viktory, and Hi-Ya of the Leetsauced Podcast and recording an episode. We had a pretty wide-ranging discussion covering topics such as what separates WoW from other MMOs, content types and pacing, WoW as an e-sport, and some of the new changes in Warlords of Draenor.

Advance warning: we got kind of wordy, and the episode is apparently 3 and a half hours long even after editing. I’m not 100% sure how much of that is my fault, but if you want to listen to me ramble in a moderately-coherent fashion (staying up late and drinking at the same time doesn’t lend itself towards fully-coherent rambling), you can download episode 4.06 at their website or on iTunes.

During the show, Vik and Logan provided the first part of a code for a free copy of Saints Row: The Third on Steam. Here’s the second half:

-ZB3RV-BWDAK

Enjoy!

A little bird told me that the elusive Meloree will be on their next episode later this week. So if you want to know what he’s been up to, you may want to watch @Leetsauced for that episode.

Also, in case you missed it, version 547-4 of Simulationcraft is available for download, which contains a working implementation of TMI v2.0. Note that there are a few changes coming eventually in 547-5 to make it adhere more closely to the new spec, as well as innately distinguish between TMI and ETMI.

Posted in Uncategorized | Tagged , , , | 8 Comments

## (Re)-Building A Better Metric – Part II

In Part I, we talked about the criteria we wanted to satisfy to ensure that a metric was good, and briefly assessed the results of our beta test of the new version of TMI. The conclusion I came to after that testing was that, in short, it needed more work.

I don’t know that it’s entirely true to say that I went “back to the drawing board,” so much as I went back to my slew of equations and mulled over what I could tweak in them to fix the problems. To recap, the formula I was using was:

$$\large {\rm Beta\_TMI} = c_1 \ln \left [ 1 + \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ],$$

with $F=10$, $c_1=500$ and $c_2=e^{10}$.

One of the problems I was running into was one of conflicting constraints. If you look back at the last blog post, you’ll see that constraint #6 was that the numbers had to stay reasonable. Mentally, I had converted this constraint to be “should have a fixed range of a few thousand,” possibly up to 10 or 20 thousand at a maximum. So I was rigidly trying to keep the score down around a few thousand.

But the obvious solution to the stat weight problem was to increase $c_1$, which increases the slope of the graph. That makes a small change in spike size a more significant change in TMI, and gives you larger stat weights. Multiply $c_1$ by ten, and your stat weights all get multiplied by 10. Seems simple enough.

Except that in the beta test, I got data with TMIs ranging from a few hundred to over 12 thousand. So if I multiply by ten, I’m looking at TMIs ranging from around a thousand to over 120 thousand, which is a much larger range. And a factor of ten still wouldn’t have fixed everything thanks to the “knee” in the graph, because if your TMI was on the really low end you could still get garbage stat weights.

It felt like the two constraints were at odds with one another. And both at odds with a third, somewhat self-imposed constraint, which is that I wanted to keep the zero-bounding effect that the “1+” in the brackets produced. Because without that, the score could go negative, which is odd. After all, what does it mean when your arbitrary FICO-like metric goes negative? Which just led back to more fussing over the fact that I was still pretty light on “meaning” in this metric to begin with.

It was a conversation with a colleague that led me to the solution. While discussing the stat weight issues, and how I could tweak the equation to fix them, he mentioned that he would rather have a metric with large numbers that had an obvious meaning than a nicely-constrained metric that didn’t. We were talking in terms of percentages of health, and it was only at that point that the answer hit me. Within a day of that conversation, I made all of the changes I needed to give TMI a meaning.

Asking The Right Question

As is often the case, the answer had been staring me in the face the entire time. I’ve been looking at this graph (in various different incarnations, with various different constants) for the last few months:

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

What that conversation led me to realize was that I was asking the wrong question. I was trying to figure out what combination of constants I needed to keep the numbers “reasonable.” But my definition of “reasonable” was vague and arbitrary. So it’s no surprise that what I was getting out was also… vague and arbitrary.

What I should have been doing was trying come up with a score that does a better job of communicating to the user how big those spikes were. Because that, by definition, would be “reasonable” no matter what size the numbers were.

In other words, the question I should have been asking was “how can I tweak this equation so that the number it spits out has a simple and intuitive relationship to the spike size, expressed in a scale that the user can not only easily understand, but easily remember?”

And the answer, which was clear after that conversation, was to use percent health.

To illustrate, let’s flip that graph around it’s diagonal, such that instead of plotting TMI vs. $MA_{\rm max}$, we were plotting $MA_{\rm max}$ vs. TMI.

The same data, just plotted in reverse.

At a given TMI value, the $MA_{\rm max}$ values we get from the random combat simulation always fall below the blue single-spike line. In other words, at a TMI of X, you can confidently say that the maximum spike you will take is of size Y. It could be smaller, of course – you could take a few spikes that are a little smaller than Y and get the same score. But you can be absolutely sure it isn’t above Y.

So we just need to find a way to make the relationship between X and Y obvious, such that someone can look at a TMI of e.g. 20k and immediately know how large of a damage spike that is, as a percentage of their health.

We could use a one-to-one relationship, such that a TMI of 100 meant you were taking spikes that were 100% of your health. That would correspond to a slope of 100, or a $c_1$ of 10. But that would give us even smaller stat weights, which is a problem. We could literally end up with a plot in Simulationcraft where every single one of your stat weights was 0.00.

It would be nice to keep using factors of ten. Bumping it up to a slope of 1000 doesn’t work. That’s a $c_1$ of 100, which is still smaller than what we used in Beta_TMI. A slope of 10000, or a $c_1$ of 1000, is only a factor of two improvement over Beta_TMI, so our stat weights will still be sloppy.

But a slope of 100k… that might just work. A TMI of 100k would mean that your maximum spikes were around 100% of your health. If your TMI went up to 120k, you’d immediately know that the spikes are now about 120% of your health. Easy. Intuitive. Now we’re getting somewhere. The stat weights would also be 20x as large as they were for Beta_TMI, ensuring that we would get good unnormalized weights even with two decimal places of precision.

So, assuming we’re happy with that, it locks down our $c_1$ at $10^4$, so that every percentage of health corresponds to 1k TMI. Now we just have to look at the formula and figure out what else, if anything, needs to be changed.

Narrowing the Field

The very first thing I did after coming to this realization is toss out the “1+” in the formula. While I liked zero-bounding when we were treating this metric like a FICO score, it suddenly has no relevance if the metric has a distinct and clear meaning. Removing it allows for negative TMI values, but those negative values actually mean something now! If you end up with a TMI of -10k, it means that you were out-healing your damage intake by so much that the largest “spike” you ever took was smaller than your incoming healing in that time window. It also tells you exactly how much smaller: 10% of your health. While it’s not a situation we’ll run into that often, I suspect, it actually has meaning. There’s no sense obscuring that information with zero-bounding.

Which just leaves the question of what to do with $c_2$. Let’s look at the equation after removing the “+1″:

$$\large {\rm TMI} = c_1 \ln \left [ \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ]$$

If we make the single-spike approximation, i.e. that we can replace the sum with a single $e^{F(MA_{\rm max}-1)}$, we get:

\large \begin{align} {\rm TMI_{SS}} &= c_1 (\ln c_2 – \ln N) + c_1 F (MA_{\rm max} – 1) \\&~\\ &= c_1 F MA_{\rm max} + c_1 ( \ln c_2 – \ln N – F ) \end{align}

just as before. Now that we’ve removed the “1+” from the formula, the single-spike approximation isn’t limited to large spikes anymore, so this is valid for any value of $\large MA_{\rm max}.$

Remember that in our single-spike approximation, $c_2$ controlled the y-intercept of the plot. And now that this y-intercept isn’t being artificially modified by zero-bounding, it actually has some meaning. It’s the value of $MA_{\rm max}$ at which our TMI is zero.

And given our convention that X*1000 TMI is a spike that’s X% of our health, a TMI of zero should mean that we take spikes that are 0% of our health. In other words, this should happen at $MA_{\rm max}=0$. So we want our y-intercept to be zero, or

$$\large c_1 ( \ln c_2 – \ln N – F ) = 0 .$$

Since $c_1$ can’t be zero, there’s only one way to accomplish this: $c_2 = N e^F.$ I was already using $e^F$ for $c_2$ in Beta_TMI, so this wasn’t totally unexpected. In fact, I figured out quite a while ago that the choice of $e^F$ for $c_2$ was equivalent to simplifying the term inside the sum:

$$\large \frac{e^F}{N}\sum_{i=1}^N e^{F(MA_i-1)} = \frac{1}{N}\sum_{i=1}^N e^{F\cdot MA_i}.$$

Defining $c_2=Ne^F$ would also eliminate the $1/N$ factor in front of the sum. However, there’s a problem here: I don’t want to eliminate it. That $1/N$ is serving an important purpose: normalizing the metric for fight length. For example, let’s consider two simulations, one being three minutes long and the other five minutes long. We’ll assume the boss is identical in both cases, so the magnitude and frequency of spikes are identical. In theory, the metric should give you nearly identical results for both, because the amount of danger is identical. A fight that’s twice as long should have roughly twice as many large spikes, but they’re spread over twice as much time.

But a longer fight will have more terms in the sum for a particular bin size, and a shorter fight will have fewer terms. So the sum will be approximately twice as large for the longer fight. The $1/N$ cancels that effect because $N$ would also be twice as large. If we get rid of that $1/N$, then the longer fight will seem significantly more dangerous than the shorter one. In other words, it would cause the metric to vary significantly with fight length, which isn’t good.

So I decided to define $c_2$ slightly differently. Rather than $Ne^F$, I chose to use $N_0e^F$, where $N_0$ is a default fight length. This means that we’re normalizing the fight length to $N_0$ rather than eliminating the dependence entirely, which should mean much smaller fluctuations in the metric across a large range of fight lengths. Since the default fight length in SimC is 450 seconds, that seemed like an obvious choice for $N_0$.

To illustrate that graphically, I fired up Visual Studio and coded the new metric into Simulationcraft, with and without the normalization. I then ran a character through for fight lengths ranging from 100s to 600s. Here are the results:

Comparison of normalized ($N_0/N$) and unnormalized versions of the TMI metric. Vertical axis is in thousands.

The difference is pretty clear. The version where $c_2=Ne^F$ varies from a little under 65k TMI to around 86k TMI. The normalized version where $c_2 = N_0e^F=450e^F$ varies much less, from about 80k to a little over 83k, and most of that variation happening for fights that are shorter than four minutes long (i.e. not that common). This version is stable enough that it should work well for combat log analysis sites, where we’d expect a wide variety of encounter lengths.

There was one final change I felt I should make, and it’s not to the formula per se, it’s to the definition of $MA$. If you recall from the last post, we defined it as follows:

$$\large MA_i = \frac{T_0}{T}\sum_{j=1}^{T / dt} D_{i+j-1} / H.$$

This definition normalizes for two things: player health (by dividing by $H$), and window size (by multiplying by $T_0$). The latter is the part I wanted to change.

The reason we originally multiplied by $T_0/T$ was to allow the user to specify a shorter time window $T$ over which to calculate spikes, for example in cases where you were getting a large heal every 5 second, but were fighting a boss who could kill you in 3 or 4 seconds in-between those heals. This normalization meant that it calculated the moving average over $T$-second intervals, but always scaled the total damage up to what it would be if that damage intake rate were sustained for $T_0$ seconds. Doing this kept the metric from varying significantly with window size, as we discussed last year.

But that particular normalization doesn’t make sense anymore now that the metric is representing a real quantity. If my TMI is a direct reflection of spike size, then I’d expect it to go up or down fairly significantly as I change the window size. If I take X damage in a 6-second time window, but only X/2 damage in a 3-second time window, then I want my TMI to drop by a factor of 2 when I drop the window size from 6 seconds to 3 seconds as well.

In other words, I want TMI to accurately reflect what percentage of my health I lose in the window I’m considering. If I want to analyze a 3-second window, then I want to know what percentage of my health the boss can take off in that 3 seconds, not how much he would take off if he had 6 seconds.

So we’re entirely eliminating the time-window normalization in the definition of $MA_i$. That seems to match people’s intuition for how the time-window control should work anyway (this topic has come up before, including in the comments of the Crowdsourcing TMI post), so it’s a win on multiple fronts.

Bringing it all Together

Now, we have all the pieces we need to construct a formal definition for TMI v2.0. I’ll update the TMI Standard Reference Document with the rigorous details, but since we’ve already discussed many of them, I’m only going to summarize it here. Assume we start with an array $D$ containing the damage we take in every time bin of size $dt$, and the player has health $H$.

The moving average array is now defined as

$$\large MA_i = \frac{1}{H}\sum_{j=1}^{T / dt} D_{i+j-1}.$$

In other words, it’s the array in which each element is the $T$-second moving sum of damage taken, normalized to player health $H$.

We then take this array and use it to calculate TMI as follows:

$$\large {\rm TMI} = 10^4 \ln \left [ \frac{N_0}{N}\sum_{i=1}^N e^{10 MA_i} \right ] ,$$

where $N$ is the length of the $MA$ array, or equivalently the fight length divided by $dt$, and $N_0=450/dt$ is the “default” array size corresponding to a fight length of 450 seconds.

But Does It Work?

To illustrate how this works, let’s look at some examples using Simulationcraft. I coded the new formula into my local copy and ran some tests. Here are two reports, both against the T16H25 boss, using my own character and the T16H Protection Warrior profile:

The very first thing I looked at was the stat weights:

Stat weights generated with Theck using TMI 2.0

Much, much better. This was with 25k iterations, but even 10k iterations gave us reasonable (if noisy) stat weights. The error bars here are all pretty reasonable, and it wouldn’t be hard to increase the precision by bumping it up to 50k iterations if we wanted to. The warrior profile’s stat weights are similarly high-precision.

We could also look at the TMI distribution:

TMI distribution for Theck using TMI 2.0

Again, much nicer looking than before. We’re still getting a bit of skew here, but that mostly has to do with being slightly overgeared for the boss definition. The warrior profile exhibits even stronger skew, but tests run with characters of lower gear levels (and thus higher average TMI values) show very little skew.

I also wanted to see exactly how well the TMI value reflected maximum spike size, and what (if any) difference there was. So you may have noticed that I’ve enhanced the tanking section of the SimC report a little bit by adding some new columns:

Updated tanking section of the SimC report, including information about spike size.

In short, SimC now also records the “Maximum Spike Damage,” or MSD, for each iteration and calculates the maximum, minimum, and mean MSD value. It reports this information in units of “percentage of player health” right alongside the DTPS and TMI information that you’re used to getting. Lest the multiple “max” modifiers be confusing: the MSD for one iteration is the biggest spike you take that iteration, and the “MSD Max” is the largest spike you take out of all iterations.

You may be wondering, at this point, if this isn’t all superfluous. If I can code SimC to report the biggest spike, why wouldn’t we want to use that directly? What does TMI add that we can’t get from MSD?

The answer is continuity. MSD uses a max() function to isolate the absolute biggest spike in each iteration. Which is fine, but often misleading. For example, let’s consider two different tanks, one of which takes a single spike that’s 90% of their health, and another that takes one 90% spike and three or four 89% spikes. Assume nothing else in the encounter is remotely threatening them. Their MSD values will be identical, because it ignores all but the largest spike. But it’s clear that the second tank is in more danger, because he’s taking a large spike more frequently, and the TMI value will accurately reflect that.

That continuity also translates into generating better and more reliable stat weights. A stat that reduces the frequency of 90% spikes without eliminating them would be given a garbage stat weight if we tried to scale over MSD, because MSD doesn’t retain any information about frequency. However, we know that stats like hit and expertise are strong partly because they reduce spike frequency. TMI reflects that accurately while MSD simply can’t.

MSD is still useful though, in that having both TMI and MSD gives us additional information about our spike patterns. It also gives us a convenient way to compare the two to see how TMI works.

First, take a look at the TMI Max and MSD Max values. You’ll notice they mimic each other pretty well: MSD Max is 150.3%, TMI Max is 151.7k. This makes sense for the extreme case because that’s when all the planets align to create your worst-case scenario, which is rare. It won’t happen multiple times per fight, so it’s a situation where you have one giant spike that dominates the score, much like our single-spike approximation. And in that approximation, TMI is roughly equal to the largest spike size, just like it should be.

Comparing the mean TMI value (just “TMI” on the table) to the MSD mean shows a little bit of a gap: MSD Mean is 69.5%, TMI mean is 82.8k. The TMI is about 13k above where you’d expect it to be based on the single-spike model. That’s because of spike frequency. You wouldn’t normally expect to take one giant spike in an encounter and nothing else; the more common case is to take several spikes of similar magnitude over that 450 seconds. If we’re taking 3-4 of those spikes, then that’s going to raise the TMI value a little bit compared to the situation where we only take one. That’s exactly what’s happening here.

Mathematically, if we take $n$ spikes, we expect the TMI to be $\ln(n)$ times as large as the single-spike case. In this simulation, the TMI is about 1.2 times larger, meaning that $n\approx 3.3.$ In other words, on average we’re taking about 3.3 spikes every 450 seconds, each of which is about 69.5% of our health. That’s pretty useful information – in fact, I may add it to the table in the future if people would like SimC to calculate it for them.

You can see that the gap grows considerably for the minimum TMI and MSD values. The MSD Min is only about 31% while the minimum TMI is ~66k. Again, this comes down to frequency. Large spikes tend to be infrequent due to statistics, as they require a failure to avoid any one of multiple attacks. But as we eliminate those (either by gearing, or in this case, by lucky RNG on one iteration) we’re left with smaller, more frequent spikes. In the extreme limit, you could imagine a scenario where you alternated between taking a full hit and avoiding every second attack, in which case you’d have loads of really tiny spikes. So what we’re seeing at this end of the distribution is that we’re taking about $n=8.4$ small spikes in the low-TMI iterations.

This behavior also has a more subtle, but rather important meaning. TMI is really good at prioritizing large spikes and giving you stat weights that preferentially eliminate them. Once you eliminate those spikes, it automatically shifts to prioritizing the next-biggest spikes, and so on. If you smooth your damage intake sufficiently that you’re taking a lot of moderately-sized spikes, it naturally tries to reduce the frequency of those spikes. In other words, if you’ve successfully eliminated the danger of isolated spikes, it automatically starts optimizing you for DTPS. So it seamlessly fuses spike mitigation and DTPS into a metric that shifts the goalposts based on your biggest concern, as determined by the combat data.

A lot of those ideas can be seen graphically, as well. Here’s a plot showing data generated with my own character pitted against the T16H25 boss. We’re plotting MSD (which I was originally calling “Max Moving Average”) against the reported TMI score. To generate this plot, I used a variety of window sizes. At each window size, I recorded the minimum, mean, and maximum TMI and MSD values. The dotted line is the expected relationship, i.e. 100k TMI = 100% max health.

MSD vs. TMI for Theck against the T16H25 boss.

Generally speaking, as we increase or decrease the window size, the MSD and TMI should similarly increase or decrease. That’s certainly happening for the maximum MSD and TMI values, which should be expected. And in that limit, we see that TMI and MSD mostly agree and lie close to the dotted line.

However, the mean values show a much smaller spread, and the minimum values show almost no spread. It turns out that this is the fault of EF’s crazy scaling. A paladin in this level of gear is basically self-sufficient against the T16H25 boss, so changing the window size doesn’t have a large effect unless we consider the most extreme cases. If we’re out-healing the boss, then a longer window won’t cause a noticeable increase in damage intake or spike size. At the very low end, where the minimum TMI & MSD values show up, we’re basically plotting window-edge effects.

The results look a lot cleaner if we consider a player that’s undergeared for the boss (and of a class that doesn’t have a strong self-healing mechanic, like a warrior):

MSD vs. TMI for a sample warrior against the T16H25 boss.

This is one of the warriors who submitted multiple data sets for the beta test. He’s got an average ilvl of 517, which is well below what would be needed to comfortably survive the 25H boss. As a result, his TMI values are fairly high, with even the smallest values being over 200k. As you can see, though, all of the values cluster nicely around the equivalence line, meaning that the TMI value is a very good representation of his expected spike size. Also note that the colors are more evenly distributed on this plot. That’s because the window size adjustment is working properly here. The lowest values are from simulations with a window size of 2 seconds, while the largest ones are using a window size of 10 seconds. And the data is pretty linear: double the window size, and you double the MSD and TMI.

Report Card

So this final version of the metric seems to be hitting all the right notes. Let’s get our checklist out and grade it on each of the criteria we set out to satisfy.

1. Accurately representing danger: Pass. There’s really no difference between this version and the beta version in this category. If anything, this may be a bit better since it no longer has the “knee” obfuscating danger for smaller spikes.

2. Work seamlessly: Pass. Apart from coding the metric into SimC, it took no additional tweaks to get it to work properly with the default plotting and analysis tools.

3. Generate useful stat weights: Pass. The stat weights are being generated properly and to sufficient precision to identify differences between the stats, without having to normalize. It will generate useful stat weights even in low-damage regimes thanks to the removal of the “knee,” and it automatically adapts to generate DTPS-like results when you’ve done all you can for smoothing. Massive improvement in this category.

4. Useful statistics: Pass. Again, not much difference between this version and Beta_TMI, at least in this category.

5. Easily interpreted: Pass. This is the most important improvement. If I get a TMI score of 80k, I immediately know that I’m in danger of taking spikes that are up to 80% of my health. I don’t need to do any mental math to figure it out, just replace a “k” with a “%” and I’m there. No need to look back to a blog post or remember a funny conversion factor. As long as I know what TMI is, I know what it means.

6. Numbers should be reasonable: Pass. While the numbers aren’t technically small, I think it’s fair to say that they’re reasonable. After Mists, everyone is comfortable working in thousands (“I do 400k DPS and have 500k health”), so I don’t think the nomenclature will be confusing. The biggest issue with the original TMI was that it varied wildly by orders of magnitude due to small changes, which can’t happen in this new form. Going from 75k to 125k has a clear and obvious meaning, and won’t throw anyone for a loop, unlike going from 75k to 18.3M (an equivalent change in Old_TMI).

I’ll admit that I may be a little biased when it comes to grading my own metric, but I don’t think you can argue that I’m being unfairly kind in any of these categories. I set up clear expectations for what I wanted in each category, and made sure the metric met them. If it hadn’t, you probably wouldn’t be reading about it, because I’d have tossed it like Beta_TMI and continued working on it until I found a version that did.

But keep in mind that this doesn’t mean the metric is flawless. It just means that we haven’t discovered what (if any) its flaws are yet. As the logging sites get on-board with the new metric and implement it, we’ll be able to look for differences between real-world performance and Simulationcraft results and identify the causes. And if we do find problems, we’ll adjust it as necessary to fix them.

Looking Forward

It shouldn’t be much of a surprise that I’m very happy with TMI 2.0. It finally has a solid meaning, and will be far simpler to explain to players discovering it for the first time. It’s a vast improvement over the original version of the metric in so many ways that it’s hard to even compare the two.

And by giving the metric a clear meaning, we’ve opened up a number of new possible applications. For example, let’s say you sim your character and get a TMI of 85k. You and your healers now know they need to be prepared for you to take a spike that’s around 85% of your health at any given moment. Which leads directly into the question, “how much healing do I need to ensure survival?”

If your healer is a druid, you might consider how many Rejuvenation ticks you can rely on in a 6-second window and how much healing that will be. If it’s 20% of your health, then you (and your healer!) immediately have an estimate of how much on-demand healer throughput you’ll need to keep you safe. Or if you have multiple HoTs, and they sum up to about 50% of your health in that time window, your healers know that as long as they keep you HoT-ted up, they can spend their GCDs elsewhere and just spot-heal you when you hit 50% health.

In other words, TMI may be a tanking metric, but it’s got the potential to have a meaning for (and be useful to) your healers as well.

Extend this idea even further: TMI was originally defined as only including self-healing effects, not external heals. The new definition can be much looser, because it still has a meaning if you include external heals. Adding a healer to your simulation may reduce your TMI, but the end result is still meaningful because it tells you how large a spike you took with a healer focusing on you.

Likewise, a combat logging site might report your regular TMI and an “ETMI” or Effective TMI, which includes outside healing. And that ETMI would tell you something slightly different – what was the biggest spike you took and survived (or not!) on that pull. If your ETMI is less than 50k you’re never really in much danger. If your ETMI is pushing 90k or 100k (and you didn’t die), it means you’re getting awfully close to dying at least a few times in that encounter, which may warrant some investigation. You could then analyze your own logs and your healers’ logs to figure out why that’s happening and determine ways to improve it.

I’m really excited to see where this goes over the next few months. For now, though, I’m going to focus on getting the foundations in place. I’ve already coded the new metric into Simulationcraft, so as of the next release (547-3) all TMI calculations will use the new formula.

I also plan on working with both WarcraftLogs and AskMrRobot, both of whom have expressed an interest in implementing TMI, to get it up and running on their logging sites. And I’ll be updating the standard reference document shortly with a rigorous definition of the standard to facilitate that.

| | 54 Comments

## (Re)-Building A Better Metric – Part I

A few weeks ago, I posted a request for data to test out a new implementation of TMI. This follow-up post took longer than expected, for a number of reasons. A busy semester, wedding planning, and the Diablo 3 expansion were all contributing factors.

However, the most important factor is that the testing uncovered a few weaknesses that I felt were significant enough to warrant fixing. So I went back to the math and worked on revising it, in the hopes of hitting on something that was better. And I’m happy to say that I think that I’ve succeeded in that endeavor, to the point that I feel TMI 2.0 will become an incredibly useful tool for tanks to evaluate their performance.

But before I get to the new (and likely final) implementation, I think it’s worth talking about the data. After all, many of you were generous enough to take the time to run simulations for me and submit it, so I think I owe you a better explanation of what that data accomplished than “Theck changed stuff.”

To do that with sufficient rigor, though, I need to start from the beginning. If you recall, about nine months ago I laid out a series of posts entitled “The Making of a Metric,” which explained the thought process involved in designing TMI. Without re-hashing all of those posts, we were trying to quantize our qualitative analysis of damage histograms in table form. And most of the analysis and discussion in those posts centered around the numerical aspects of the metric. For a few examples, we discussed:

• How we thought that a spike that’s 10% larger should be worth $h$ times as much in the metric (the “cost function” for those that are familiar with control theory or similar fields)
• The problem of edge effects that were caused by attempting to apply a finite cut-off or minimum spike size
• What normalization conditions should be applied to keep the metric stable across a variety of situations

and so on.

However, none of that discussion addressed what would eventually be a crucial (and in the case of our beta test results, deciding) factor: what makes a good metric? I was intently focused on the mathematics of the problem at the time, and more or less assumed that if the math worked well then the metric would be a good one.

Suffice to say, this assumption was pretty wrong.

What Does Make a Good Metric?

So when I sat down late last year to start thinking about how I would revise the metric, I approached from a very different direction. I made a list of constraints that I felt a good metric would satisfy, which I could then apply to anything I came up with to see if it “passed.” This is that list:

1. First and foremost, the metric should accurately represent the threat of damage spikes. That actually encompasses several mini-constraints, most of which are numerical ones.
• For example it should take into account spike magnitude and spike frequency, because it’s more dangerous to take three or four spikes of size X than it is to take one spike of size X.
• It should filter the data somehow, such that the biggest spikes are worth considerably more than smaller ones are.
• However, it also can’t filter so strongly that it ignores ten spikes that were 120% of your health just because you took one spike of 121%.
• The combination of those three points means that it has to filter continuously (i.e. smoothly), so we can’t use max() or min() functions.

In short, this is basically the numerical constraints that I applied to build the original version of TMI. Ideally, I would like it to continue generating the same quality of results, but tweak the numbers to change the presentation.

2. It should work seamlessly in programs like Simulationcraft and on sites like World of Logs, Warcraft Logs, and AMR’s new combat log analysis tool. Working in Simcraft is obvious. That was one major reason I joined the SimC dev team. But wanting it to be useful on logging sites is a broader constraint – it means that it needs to work in a very wide range of situations, including every boss fight that Blizzard throws at us. If it’s only useful on Patchwerk under simulated conditions, it’s probably not general enough to mean anything.

This also means that it should work with SimC’s default settings. I want to have to do as little messing around with SimC’s internals as possible.  This will come up again, so I want to mention it explicitly here.

3. It should generate useful stat weights when used in Simulationcraft. One of the primary goals of the original metric was to be able to quantify how useful different stats were. If the metric produces garbage stat weights, it’s a garbage metric.

4. Similarly, it should produce useful statistics. Another major drawback of the old version was that the TMI distributions were highly skewed thanks to the exponential nature of the metric. That meant that the distribution in no way represented a normal distribution, which made certain statistical measures mostly useless. A new version should (hopefully) fix that.

5. It should be easily interpreted. Ideally, someone should be able to look at the number it produces and immediately be able to infer a meaning. Good, bad, or otherwise, you shouldn’t need to go to a blog post to look up what it means to have a TMI of 50k.

I was never very happy with this part of the original metric. The meaning wasn’t entirely clear, because it was an arbitrary number. You’d have to read (and remember) the blog post to know that a factor of 3 corresponded to taking spikes that were 10% of your health larger (i.e. 80% of your health to 90% of your health should triple your TMI).

6. Ideally, the numbers should be reasonable. This was arguably the biggest failing of the original version of TMI, and something that Wrathblood and I have argued about a lot. While it’s nice mathematically that a bigger spike creates an exponentially worse value, the majority of players do not think in orders of magnitude.

I may have no problem understanding a TMI going up from 50 thousand to over 1 million as a moderate change, because I’ve been trained to work with quantities that vary like that in a scientific context. But the average user hasn’t been trained that way, and thus saw that as an enormous difference. Much larger than going from 2.5k to 50k, even though it is an equivalent change in spike size.

The size of the change was part of the original goal, of course – to emphasize the fact that it was significantly worse to take a larger spike. But that’s not how the average user interpreted it. Instead, their initial reaction was to assume that the metric was broken. Because surely they hadn’t suddenly gotten 20 times worse just by bumping the boss up from normal to heroic. Right? Well, that’s exactly what the metric was saying, and should have been saying, when their spike size went up by ~28% of their health. But the message wasn’t getting across.

In retrospect, I think I know why, and it was tied to item #5. The meaning of the metric wasn’t entirely clear. At least to someone who hadn’t gotten down and dirty with the math behind the metric. So instead, they assumed the metric was in error, or faulty, or something else.

Those were the five major constraints I set out to abide by in my revisions. Pretty much anything else I could come up with was covered by one or more of those, either explicitly or implicitly.

Now, with this rubric, we can take a look at the results of the beta test and see how the original revision of the metric performed. But first, I want to talk briefly about the formula I chose to use for those that are interested. Fair warning, the next section is fairly mathy – if you don’t care about those details, you may want to skip to the “Beta Test Results” section.

Beta Test Formula

Let’s first assume we have an array $D$ containing the damage taken in each time bin of width $dt$. I’m going to leave $dt$ general, but if it helps you visualize it just pretend that $dt=1$, so this array is just the damage you take in every one-second period of an encounter. We construct a $T$-second moving average array of that data just as we did in the original definition of the metric:

$$\large MA_i = \frac{T_0}{T}\sum_{j=1}^{T / dt} D_{i+j-1} / H$$

The new array $MA$ created by that definition is essentially just the moving average of the total damage you take in each $T_0$-second period, normalized to your current health $H$. By default I’ve been using $T_0=6$ as the standard window size. Again, nothing about this part changed, it’s still the same array of damage taken in each six-second period for the entire encounter.

If you recall, the old formula took this array and performed the following operation:

$$\large {\rm Old\_TMI} = \frac{C}{N} \sum_{i=1}^N e^{10\ln{3} ( MA_i – 1 ) } = \frac{C}{N}\sum_{i=1}^N 3^{10(MA_i-1)}$$

Where $C$ was some mess of normalization and scaling constants, and $N$ was the length of the $MA$ array.

This formed the basis of the metric – the bigger the spike was, the larger $MA$ would be, and the larger $3^{10(MA_i-1)}$ would be. Due to the exponential nature of this function, large spikes would be worth a lot more than small ones, and one really large spike would be worth considerably more than lots of very little ones.

The formula that I programmed into Simulationcraft for the beta test was this:

$$\large {\rm Beta\_TMI} = c_1 \ln \left [ 1 + \frac{c_2}{N} \sum_{i=1}^N e^{F(MA_i-1)} \right ]$$

where the constants ended up being $F=10$, $c_1=500$ and $c_2=e^{10}$. Let’s discuss exactly how this differs from ${\rm Old\_TMI}$

It should be clear that what we have is roughly

$$\large {\rm Beta\_TMI} \approx c_1 \ln \left [ 1 + \chi {\rm Old\_TMI} \right ]$$

where $\chi$ is some scaling constant. That statement is only approximate, however, because ${\rm Old\_TMI}$ used a slightly different exponential factor in the sum. In the old version, we summed a bunch of terms that looked like this:

$$\large e^{10\ln 3 (MA_i-1)} = 3^{10(MA_i-1)},$$

while in the new one we’re raising $e$ to the $F(MA_i – 1)$ power:

$$\large e^{F(MA_i-1)}.$$

In other words, the constant $F$ is our “filtering power,” just as $10\ln 3$ was our filtering power in ${\rm Old\_TMI}$. The filtering power is a little bit arbitrary, and after playing with the numbers I felt that there wasn’t enough of a difference to warrant complicating the formula. By choosing $F=10$, a change of 0.1 (10% of your health) in $MA_i$ increases the value of the exponential by a factor of $e\approx 2.718.$ For comparison, in ${\rm Old\_TMI}$ increasing the spike size by 10% increased the value of the exponential by a factor of 3. So we’re not filtering out weaker attacks quite as strongly as before, but again, the difference isn’t that significant. The main advantage to doing this is simplifying the formula, that’s about it.

So with that caveat, what we’re doing with the new formula is taking a natural logarithm of something close to ${\rm Old\_TMI}$. For those that aren’t aware, a logarithm is an operation that extracts the exponent from a number in a specific way. Taking the log of “base $b$” of the number $b^a$ gives you $a$, or

$$\large \log_b \left ( b^a \right ) = a$$

There are a few logarithms that show up frequently in math. For example, when working in powers of ten, you might use the logarithm “base-10,” or $\log_{10}$, also known as the “common logarithm.”  If what you’re doing uses powers of $e$, then the “natural logarithm” or “base-$e$” log ($\log_{e}$) might be more appropriate. Binary logarithms (“base-2″ or $\log_2$) are also common, showing up in many areas of computer science and numerical analysis.

In this case, we’re using the natural logarithm $\log_e$, which can be written $\log$ or $\ln$ depending on which textbook or website you’re reading. I’m using $\ln$ because it’s unambiguous; some books will use $\log$ to represent the common log and others will use it to represent the natural log, but nobody uses $\ln$ to represent anything but the natural log.

To figure out how this new formula behaves, let’s consider a few special cases. First, let’s consider the limit where the sum in the equation comes out to be zero, or at least very small compared to $1/c_2$. This might happen if you were generating so much healing that your maximum spike never got close to threatening your life. In other words, if your ${\rm Old\_TMI}$ was really really small. In that situation, the second term is essentially zero, and we have

$$\large {\rm Beta\_TMI} \approx c_1 \ln \left [ 1 + 0 \right ] = 0,$$

because $\ln 1 = 0$. In other words, adding one to the result of the sum before taking the log zero-bounds the metric, so that we’ll never get a negative value. This was a feature of the old formula just due to its definition, and something I sort of liked, so I wanted to keep it. It has a side effect of introducing a “knee” in the formula, the meaning of which will be clearer in a few minutes when we look at a graph.

But before we do so, I want to consider two other cases. First, let’s assume we have an encounter where we take only a single huge spike, and no damage the rest of the time. We’ll approximate this by saying that all but one element of the $MA$ array is a large negative number (indicating a large excess of healing), and that there’s one big positive element representing our huge spike. In that case, we can approximate our sum of exponentials as follows:

$$\large \sum_{i=1}^N e^{F(MA_i-1)} \approx e^{F(MA_{\rm max}-1)}.$$

Let’s also make one more assumption, which is that this spike is large enough that $c_2 e^{F(MA_{\rm max}-1)}/N \gg 1$, so that we can neglect the first term in the argument of the logarithm. If we use these assumptions in the equation for ${\rm Beta\_TMI}$ and call this the “Single-Spike” scenario, we have the following result:

$$\large {\rm Beta\_TMI_{SS}} \approx c_1\ln\left [ \frac{c_2}{N} e^{F(MA_{\rm max}-1)} \right ] = c_1\left ( \ln c_2 – \ln N \right ) + c_1 F \left ( MA_{\rm max} – 1 \right ),$$

where I’ve made use of two properties of logarithms, namely that $\log(ab)=\log(a)+\log(b)$ and that $\log(a/c) = \log(a)-\log(c)$. We can put this in a slightly more convenient form by grouping terms:

$$\large {\rm Beta\_TMI_{SS}} \approx c_1 F MA_{\rm max} + c_1 \left ( \ln c_2 – \ln N – F \right )$$

This form vaguely resembles $y=mx+b,$ a formula you may be familiar with. And putting it in that form makes the effects of the constants $c_1$ and $c_2$ a little clearer.

We’re generally interested in how the metric scales with $MA_{\rm max}$, which is a direct measurement of maximum spike size. It’s clear from this form that ${\rm Beta\_TMI_{SS}}$ is linear in $MA_{\rm max}$, with a slope equal to $c_1 F$. So for a given filtering strength $F$, the constant $c_1$ determines how many “points” of ${\rm Beta\_TMI}$ you gain by taking a larger spike. Since $F=10$, $c_1$ is the number of points that corresponds to a spike that’s 10% of your health larger.

So if your biggest spike goes up from 130% of your health to 140% of your health, your ${\rm Beta\_TMI}$ goes up by $c_1$. Note that this isn’t a factor of $c_1$, it’s an additive amount. If you go from 130% to 150%, you’d go up by $2c_1$ rather than $c_1^2$.

This was the point of taking the logarithm of the old version of TMI. It takes a metric that scales exponentially and turns it into one that’s linear in the variable of interest, $MA_{\rm max}$. If done right, this should keep the numbers “reasonable,” insofar as you shouldn’t get a TMI that suddenly jumps by 2 or 3 orders of magnitude by tweaking one thing. The downside is that it masks the actual danger – your score doesn’t go up by a factor of X to indicate that something is X times as dangerous.

Once you have $F$ and $c_1$, the remaining constant $c_2$ controls your y-intercept, and is essentially a way to add a constant amount to the entire curve. It doesn’t affect the slope of the result, it just raises or lowers all TMI values by $\approx c_1 \ln c_2$.

The other case I want to consider before going forward is one in which you’re taking uniform damage. In other words, every element of $MA$ is the same, and equal to $MA_{\rm max}$. In that case, the sum becomes

$$\large \sum_{i=1}^N e^{F(MA_i-1)} = \sum_{i=1}^N e^{F(MA_{\rm max}-1)} = Ne^{F(MA_{\rm max}-1)}.$$

In this case, the $N$’s cancel and we have

$$\large {\rm Beta\_TMI_{UF}} = c_1 \ln \left [ 1 + c_2 e^{F(MA_{\rm max}-1)} \right ]$$

If we make the same assumption that the second term in brackets is much larger than one, this is approximately

$$\large {\rm Beta\_TMI_{UF}}\approx c_1\ln c_2 + c_1 \left [ F (MA_{\rm max}-1)\right ],$$

or in $y=mx+b$ form:

$$\large {\rm Beta\_TMI_{UF}} \approx c_1 F MA_{\rm max} + c_1 (\ln c_2 – F ).$$

The difference between the uniform case and the single-spike case is just a constant offset of $c_1 \ln N$. So we get all the same behavior as the single-spike case, just with a slightly higher number. The uniform and single-spike cases are the extremes, so we expect real combat data to fall somewhere in-between them.

On a graph, this would look something like the following:

Simulated TMI data using the Beta_TMI formula. Red is the uniform damage case, blue is the single-spike case, and green is pseudo-random combat data.

This is a plot of ${\rm Beta\_TMI}$ against $MA_{\rm max}$ for some simulated data that shows how the new metric behaves as you crank up the maximum spike the player takes. The red curve is what we get in the uniform case, where every element of $MA$ is identical. The blue curve is the single-spike case, where we only have one large element in $MA$. The green dots are fake combat data, in which each attack can be randomly avoided or blocked to introduce variance.

The first thing to note is that when $MA_{\rm max}$ is very large, the blue and red curves are both linear, as advertised. Likewise, the green dots always fall between those two curves, though they tend to cluster near the single-spike line. In real combat, you’re going to avoid or block a fair number of attacks, and the randomness of those processes eliminates the majority of cases where you take four full hits in a 6-second window.

You can also see the “knee” in the graph I was talking about earlier. At an $MA_{\rm max}$ of around 0.6, the blue curve starts, well, curving. It’s no longer linear, because we’ve transitioned into a regime where the “1+” is no longer negligible, and we can’t ignore it. The red curve has a similar knee, but it occurs closer to zero (as intended, based on the choice of $c_2$). As you get closer to the knee, the metric shallows out, meaning that changes in spike size have less of an effect on the result. This makes some intuitive sense, in that it’s not as useful to reduce spikes that are already below the danger threshold.

The constants $c_1$ and $c_2$ were chosen mostly by tweaking this graph. I wanted the values to be “reasonable,” so I was aiming for values between around 1000 and 10000. The basic idea was that if you were taking 100% of your health in damage, your TMI value would fall between about 2000 and 2500, and then scale up (or down) from there in increments of 500 for every 10% of health increase in maximum spike size.

So that’s the beta version of the metric. Now let’s look at the results of the beta test, and see why I decided to go back to the drawing board instead of rubber-stamping ${\rm Beta\_TMI}$.

Beta Test Results

The spreadsheet containing the data is now public, and you can access it at this link, though I’ve embedded it below:

The data we have doesn’t include the moving average arrays used to generate the TMI value, so we can’t make a plot like the one I have above. We can generate a lot of other graphs, though, and trust me, I did. I plotted more or less everything that I thought could give me relevant information about its performance. I could show you histograms and scatter plots that break down the submissions by average ilvl, TMI, Boss, class, stat weight. But while I had to sift through all of those graphs, I’m not sure it’s a productive use of time to dissect each of them here.

Instead, let’s look at a few of the more significant ones. First, let’s look at Beta_TMI vs ilvl for all classes against the T16N10 boss:

Beta_TMI vs. ilvl for all classes, T16N10 boss.

The T16N10 boss had the highest response rate from all classes. The general trend here is obvious – as ilvl goes up, Beta_TMI goes down indicating that you’re more survivable against this boss. Working as intended. The range of values isn’t all that surprising given that not all of the Simcraft modules are as well-refined as the paladin and warrior ones. But at least on this plot the metric appears to be working fine.

If we want to see how a single class looks against different bosses, we can. For example, for warriors:

Beta_TMI vs. ilvl for warriors, all bosses.

Again, the trends are pretty clear. Improving gear reduces TMI, as it should. Some of these data points come from players that tested against several different bosses in the same gear set, and those also give the expected result – crank up the boss, and the TMI goes up.

Another neat advantage is that the statistics finally work well. In other words, if you ran a simulation for 25k iterations before, you’d get an ${\rm Old\_TMI}$ distribution plot that looked like this:

TMI distribution using the old definition of TMI.

And this was an ideal case. It was far more common to have a really huge maximum spike, such that the entire distribution was basically one bin at the extreme end of the plot. It also meant that the metrics Simulationcraft reported (like “TMI Error”) were basically meaningless. However, with the Beta_TMI definition, that same plot looks like this:

TMI distribution generated using the Beta_TMI definition.

This looks a whole lot more like a normal distribution, and as a result works much more seamlessly with the standard error metrics we’re used to using and reporting.

So on the surface, this all appears to be working well. Unfortunately, when we look at stat weights, we run into some trouble. Because a lot of them looked something like this:

Example stat weights generated using Beta_TMI

The problem here should be obvious, in that this doesn’t tell us a whole lot about how these stats are performing. Rounding to two decimal places means we lose a lot of granularity.

Now, to be fair, they aren’t all this bad. This tended to happen more frequently with players that were overgeared for the boss they were simming. In other words, on players that were nearing the “knee” in the graph. But enough of the stat weights turned out like this for me to consider it a legitimate problem.

Note that Simcraft only rounds the stat weights for the plot and tables. Internally, it keeps much higher precision. As a result, the normalized stat weights looked fairly good. But by default, it plots the un-normalized ones.

I could fix this by forcing SimC to plot normalized stat weights if it’s scaling over TMI, but this comes into conflict with goal #2. Ideally, I’d like it to work well with the defaults, so that I don’t have to add little bits of code all over SimC just to get it to work at all.

And more to the point, this is a more pervasive issue. If health is really doubling in Warlords, and the healing model really is changing, we may start caring about smaller spikes than before. It isn’t good for the metric to be muting the stat weights in those regions.

In fact, now seems like as good a time as any to go through our checklist and grade this version of the metric. So let’s do that.

1. Accurately representing danger: Pass. At least insofar as more dangerous spikes give a higher TMI, it’s doing its job. We could debate whether the linear scaling is truly representative (and Meloree and I have), but the fact of the matter is that we tried that with version 1, and it led to confusion rather than clarity. So linear it is.

2. Work seamlessly: Eh…. There’s nothing in SimC that prevents it from working, and it’s vastly improved in this category compared to the first version of the metric because the default statistical analysis tools work on it. But the stat weights really need to be fixed one way or another, which either means tweaking SimC to treat my metric as a special snowflake, or changing the metric. Not super happy about that, so it’s on the border between passing and failing. If I were assigning letter grades, it would be a C-. The original metric would flat-out fail this category.

3. Generate useful stat weights: Eh…. Again, it’s generating numeric stat weights that work, but only after you normalize them. I’m not sure if the fault really lies in this category, but at the same time if the metric generated larger numbers to begin with, we wouldn’t have this problem.

4. Useful statistics: Pass. This is one category where the new version is universally better.

5. Easily interpreted: Fail. If someone looks at a TMI score of 4500, can they quickly figure out that it means they’re taking spikes that are around 135% to 150% of their health? Not unless they go back and look up the blog post, or have memorized that 100% of health is around 2000 TMI, and each 10% of health is about 500 TMI.

In fact, I’d go as far as to say that this is very little improvement over the original in terms of ease of understanding. The linearity is nice, and the numbers are “reasonable,” but the link between the value and the meaning is still pretty arbitrary and vague.

6. Numbers should be reasonable: Pass. At the very least, taking the logarithm makes the numbers easier to follow.

All in all, that scorecard isn’t very inspiring. This may be an improvement over the original in several areas, but it’s still not what I’d call a great metric. Generating nice stat weights is important, and it’s not doing a great job of that, but that could be fixed with a few workarounds. But failing at #5 is the real kicker. We rationalized that away in version 1 by treating this like a FICO score, an arbitrary number that reflects your survivability. But the more time I spent trying to refine the metric, the more certain I became that this was a fairly significant flaw.

To make TMI more useful, it needs to be more understandable. Period. And it was only after a discussion with a friend about the stat weight problem that the solution to the “understandability” problem became clear.

In Part II, I’ll discuss that solution and lay out the formal definition of the new metric, as well as some analysis of how it works and why

| | 7 Comments

## Crowdsourcing TMI

As I’ve mentioned a few times on Twitter already, I’ve been working on refining the formula used to calculate the Theck-Meloree Index. The current version certainly works, or at least, gives me the numerical effects that I initially wanted. But in the 6+ months since we defined the metric, we’ve learned a lot more about the quirks involved with having a raw exponential metric. Several of which are more rooted in psychology than mathematics!

(If you’re keeping score, that’s Wrathblood – 1, Theck – 0)

In any event, after a bit of playing with the possibilities I’ve finally decided how I want to modify the formula. It’s all but finished, in fact. The only thing left to do is fine-tune some constants, which I think I’ve done sufficiently well already. But the only good way to test that is to generate a lot of data and see if it’s working the way I want it to.

Which normally would be fine, but there are a few issues with doing that myself.

• It takes a long time to generate the amount of data I’m looking for. Think several hundred simulations, each with 25k iterations, and calculating 10 scale factors.
• I want to test it on a variety of gear sets. Again, it takes a lot of time to put together gear sets, and I don’t really want to troll the armory looking for random players to import.
• I want to test it on all five tanking classes (or at least, the ones that SimC supports). Again, short of trolling the armory, it would be a formidable task to find an appropriate number of players to get a proper sample. And would take a long time.
• I want to test it against multiple different TMI bosses… so multiply all of those time investments by a factor of four or five.

I’m busy enough as it is with all sorts of other projects (*cough* and Diablo III), not to mention my job, that it’s not feasible for me to generate all of this data myself. Unless you want to wait for the new TMI definition until December. Of 2017.

I could just release the new metric into the wild, of course. I’m pretty sure it’s functioning properly, after all. But I’d much rather be able to do some rigorous testing of it in case there are weird problems that I didn’t anticipate.

This is where you come in. Instead of running several hundred simulations myself, I’m asking each of you to run a simulation or two for me. Basically, you could consider this the public beta test of TMI 2.0.

How To Contribute Data

I’ve coded the new TMI definition into Simulationcraft, and it’s available as an option in version 547-2. By default, it will calculate stat weights using the old formula. However, you can enable the new formula with the argument new_tmi=1.

You can do this by adding that line to the Simulate tab as shown in the screenshot below:

Add “new_tmi=1″ after your character definition to enable the new formula.

The results page will then report TMI as calculated using the new formula.

If every reader of the blog runs their own character through the sim, I will have a veritable sea of data to swim through (as in, many thousands of simulations). I’m not that optimistic about a 100% reader-to-data-submission conversion rate, so if you can run your character several times with different options (i.e. against different TMI bosses), that’s even better.

Here are the basic guidelines that I’m looking for in submissions:

• 25000 iterations
• Standard Patchwerk fight (these should all be SimC defaults)
• Length: 450
• Vary Length: 20%
• Style: Patchwerk
• Level: Raid Boss
• Target Race: humanoid
• Num Enemies: 1
• Challenge Mode: Disabled
• Standard Player settings (again, defaults)
• World Lag: Low
• Player Skill: Elite
• Scale Factors (make sure you choose to scale over “tmi”)
• Strength or Agility (depending on your class)
• Stamina
• Expertise
• Hit
• Crit
• Haste
• Mastery
• Armor
• Dodge
• Parry

All of these options can be found on either the Options->Globals tab or the Options->Scaling tab. First, a quick look at the Globals tab:

SimC’s Options -> Global tab.

You can see that I have all of the settings at default here. The only two I want you to play with are the TMI Standard Boss and the TMI Window.

For the TMI Boss, pick one (or more) ilvl-appropriate bosses. For example, if you’re in heroic T16 gear, then you shouldn’t bother simming against the T15 bosses at all, and probably not against T16N10. Stick to T16H bosses or the 17Q boss. Please do not use “custom” – that will pit you against Fluffy Pillow, who is not so fluffy anymore now that he learned how to perform melee and spell nukes.

For TMI Window, the standard is six seconds. Feel free to leave it at that if you’re getting reasonable results. If you get really weird-looking stat weights or your TMI is below, say, 1000, consider dropping this a little, maybe to four seconds. Please submit the wonky stat weight data anyway, because that’s also useful to me, but then submit the (hopefully) normal-looking data you get using the lower TMI window.

On to the Scaling tab:

SimC’s Options -> Scaling tab.

As you can see, I’ve checked all of the stats I’m interested in. If you’re a druid or monk tank, please check the Agility box too (in that case you can skip Strength if you want to). Above all though, make sure you’ve chosen to scale over “tmi.” I can’t stress this enough, because if you scale over DPS you’ll get scale factors that are useless to me, and it will just mean I have to spend time filtering the data to eliminate those useless data points.

Once you’ve completed the simulation, you can enter the data in the form below. Please also attach the html results (which you can get by using the “Save” button at the bottom right of the results pane in SimC) using the “Upload” button at the bottom of the form. I’m requesting the html so that I can sanity-check the data and figure out what’s happening with outliers, so you can’t submit data without first attaching that file.

There’s no limit to how many times you can submit data, so you can run several different characters through the simulation if you want to. In fact, that’s encouraged, because data from undergeared alts is just as valuable to me (if not more) as data from overgeared mains. And of course, you can run a character against several different TMI bosses and submit each result separately.

Just don’t keep re-submitting a single simulation result multiple times, because each submission after the first would be useless for obvious reasons.

If the embedded form below isn’t working for you for some reason, you can also access it directly via this link: http://goo.gl/SY36xu. Note that you’ll have to reload the page (or open the link in another new tab) in order to use the submission form again.

Thanks in advance for your help! Depending on how quickly the data comes rolling in, I may be able to have this all wrapped up as early as next week.

As soon as that’s done, I’ll be making a much longer post detailing what changes I’ve made, why I’ve made them, and how the new formula works, including simulated data that I used to develop the metric and actual data from this exercise.

Posted in Theck's Pounding Headaches, Theorycrafting | | 29 Comments