TC101: Experimental Design

In the previous post, we talked about what theorycrafting means and worked through a basic example of beginner theorycrafting. In this post, I want to go into a little more detail about the laboratory-based part of theorycrafting – in other words, designing and carrying out in-game “experiments” to test how mechanics work.

WoW Experiments

Luckily, “experiments” in WoW are pretty simple in a relative sense. While the entire system may be complicated, we generally have a good idea about how things work and what’s causally related (and not).

For example, I know when I press the key for Crusader Strike, my character will cast Crusade Strike on my target, if possible. I know that the damage it deals depends on a few factors: my weapon damage, my own stats and temporary buffs, any damage-increasing debuffs on the target, the target’s armor mitigation (which depends on its armor and both of our levels), and so on. Even if we don’t know exactly how those relationships work – that’s what we’re testing after all – we know that they exist or might exist.

Likewise, we can also quickly rule out a lot of variables. We don’t expect our Crusader Strike damage to depend on the time of day or the damage that another player is doing to a different target. This sounds silly, but it’s actually a pretty big deal. In real experiments (i.e. in a research laboratory), there are loads of external factors that can affect results, and we have to take great care to identify and eliminate (or at least minimize) those factors. WoW experiments are incredibly easy because we don’t have to do much of that at all.

To illustrate that thought, let me give you a real-life example. As an undergraduate, I spent one summer doing nuclear physics research at the University of Washington. One of the research groups there was making precise force measurements to test General Relativity. Their setup involved a very specially-designed arrangement masses and a smaller (but still hefty) hanging mass oscillator driven by a small motor.

When they made their measurements, they found a deviation from what they expected. After hours and hours of brainstorming, adjustments, repeating the experiment, and what not, it was still there. They looked at every external factor they could think of that might affect the result, and nothing seemed to be the culprit. After a few months of this, the grad students were beginning to think that maybe they had made a breakthrough discovery.

Their advisor, however, wasn’t as convinced. He made them continue searching for the error. I think he even made them build a second copy of the experiment from scratch to cross-check the results. In any event, eventually they narrowed down the culprit to the motor. As it turned out, the one they had been sent did not meet the manufacturer’s specifications (which had been pored over and chosen very carefully for exactly this reason!), and was malfunctioning in a very subtle way that caused the anomaly they observed.

In WoW experiments, we rarely, if ever, need to worry about being influenced by factors that aren’t immediately obvious. Generally, we have a very limited set of variables to work with, so identifying and isolating problems is pretty easy.

Basic Experimental Design

Before performing any experiment, you should first make sure you can answer (or have at least tried to answer) all of these questions:

  1. What am I trying to test (i.e. what question am I trying to answer)?
  2. What am I going to vary (and how)?
  3. What am I going to hold constant?
  4. What am I going to measure (and how)?
  5. How much data do I need to take?

The first is pretty obvious – it’s hard to perform an experiment if you don’t have a clue what it is you’re trying to determine. Using our example from the previous post, if my question is “How does Judgment’s damage vary with attack power,” then the obvious answer to (1) is that we’re going to test whether Judgment’s damage changes when we change our character’s attack power. So far so good.

Variables

Our question also gives us the answers to questions (2) and (3). We’re going to vary attack power, and we want to (ideally) keep everything else constant. Implicit in this is a central tenet of experimental design that you should try to adhere to as often as possible: only vary one thing. That one thing is called the independent variable, in this case our attack power. In an ideal world, we only ever have one independent variable, so that we know for sure that whatever change we see in the measurement is due to that variable.

For a concrete example of that, let’s say we have an ability that depends on both weapon damage and spell power. If we make our measurements in such a way that we’re changing both our weapon and our spell power, we have a giant mess. We’d have to untangle it to determine how much of the change in damage was due to the weapon and how much was due to the change in spell power. That task may not be impossible, but it usually significantly complicates our data analysis.

In some cases, that complication is unavoidable. For example, if you look back at the most recent diminishing returns post, you’ll see that I was performing surface fits to three-dimensional sets of data with two independent variables. I had to do that to properly and accurately fit the two constants of the diminishing returns equation simultaneously, and only worked as well as it did because we already had a good idea what the formula looked like and what those constants were from prior experience and single-independent-variable experiments. In general, I wouldn’t recommend this technique to a beginner theorycrafter.

So, in our case our independent variable is attack power, and we’re going to keep all of the other potential variables constant. These constants are often called controls, though I prefer to call them “fixed variables” or “variables held constant” because they are variables, just ones you don’t want to vary. So our list of fixed variables includes crit, mastery, multistrike, versatility, etc.

This poses a potential problem for us, though. We haven’t yet answered the question, “how are we going to vary our attack power?” Normally, we would do this by putting on or taking off gear to change our strength. That seems pretty straightforward, but since gear has secondary stats, we’re also changing our crit, mastery, etc. at the same time!

Sometimes you can get around this by using certain temporary effects. For example, if we have several trinkets with varying amounts of attack power on them (and nothing else), we could swap them around to isolate attack power as a dependent variable. But we aren’t always that lucky, so generally we’re going to need to make some compromises here.

We might know that some of these are irrelevant – crit, for example, won’t change our results as long as we’re filtering out or adjusting for crits. Likewise, we probably don’t care what our multistrike chance is as long as we’re ignoring multistrike results. In cases where we’re sure, we can be more lenient about letting those factors vary. Since we know that crit and multistrike are independent events, we can safely ignore them as long as we’re careful during our data collection (see below).

But sometimes we’re not sure about it – for example, we may not know whether mastery does or doesn’t affect the damage of Judgment. And for another example, Chaos Bolt damage does increase with crit. Since we don’t always know, it’s safer to try and keep everything else constant when possible.

As an experiment designer, your goal is to juggle these constraints. You’ll be searching for ways to isolate a particular variable (say, attack power) while keeping certain other variables constant. But you’ll also have to decide when it’s acceptable for another variable to change value during an experiment, sometimes by confirming that the variable is irrelevant to the test at hand. Often this means thinking critically about how you’ll design your experiment. For example, if you were testing ability damage, you could safely ignore hit rating, but only if you made sure you ignored misses when you tallied up your results.

When testing ability damage in previous expansions, we generally just took off gear to change attack power. This wasn’t a problem because hit, miss, crit, haste, dodge, and parry were all independent from the raw (non-crit) damage done by abilities when attacking a target dummy. Multistrike doesn’t appreciably complicate matters, but as you might guess, versatility is a big problem. Which means we have at least one serious constraint on our experiment: we want to use gear that doesn’t have any versatility. Again, if we didn’t have any other choice, we could use advanced techniques to get around this constraint, but it’s far simpler if we just adhere to the constraint.

This brings up another major constraint on most in-game experiments: we don’t want any procs that change our stats temporarily. Otherwise, we’ll have periods where our experimental conditions have changed, which will make the data difficult or impossible to analyze properly. So generally, we want to take off any trinkets or special gear that has stat-changing or damage-increasing procs, and we want to avoid using weapon enchants like Dancing Steel that give temporary buffs.

Measurements

Which finally brings us to question (4). We’re going to measure Judgment’s damage, obviously. The thing we’re measuring is called the dependent variable, because its value may depend on the independent variable. But we don’t just need to know what we’re measuring, we also need to make sure we know how we’re going to perform that measurement and what we’re going to do with the result.

For example, am I going to cast Judgment and write down the value listed in the combat log on a piece of paper? That might be fine if I only need one or two casts, but could quickly become cumbersome if I plan on collecting hundreds of casts worth of data.

More commonly, we’d turn on combat logging (via the /combatlog slash command) so we have a text record of everything. From there, we could upload that result to a parsing site like Warcraft Logs, import the log into MATLAB, write a quick script to scrape the log and convert to a CSV file we can open in Excel, or any number of other analysis methods.

Similarly, within those steps, are we just going to count normal Judgment hits and ignore crits and multistrikes entirely? Or are we going to try to use that extra data, say, by dividing each crit by two and each multistrike by 0.3 and using the adjusted values? The latter gets us more data faster, which could reduce the amount of time it takes. But it relies on two very specific assumptions: that crits do 2x as much damage and multistrikes do 1.3x. If either of those are wrong, for example because we have a crit meta gem on, or our spec’s multistrikes have a different modifier, then our data is polluted.

Furthermore, we know we’re going to cast Judgment, but we haven’t specified how we’re going to cast it. Are we going to cast it on cooldown? Usually that’s the case, but sometimes we might have to wait longer to avoid another unwanted effect (think a single-target version of the Double Jeopardy glyph). Are we going to cast it on a single dummy, or tab-target around between different ones (for example, to test Double Jeopardy). If so, must those dummies all be the same level? Multi-target considerations are obviously even more important when testing AoE abilities.

Are we only going to cast Judgment, or are we going to cast other things while we’re at it? Maybe we’ll do a single data collection run that combines multiple tests – say, simultaneously Judgement, Crusader Strike, and Avenger’s Shield damage. If we’re casting multiple things, are we sure they don’t interact at all?

All of these are questions you’ll need to consider when deciding on your experimental method or procedure. No matter what we decide to do with the data or how we decide to collect it, we should know the entire plan ahead of time to make sure we’re collecting the right thing. It can be incredibly frustrating to take several hours worth of data, and later during analysis find out that your measurements depend on another factor that you forgot to record.

This means that every time we swap gear, we might want to write down everything – the value of all potential variables (strength, agility, intellect, attack power, spell power, mastery, crit, multistrike, haste, versatility) – before we start taking data. That way, if we start analyzing our data and find an anomaly, we have some information we can use to determine what the problem was.

For example, maybe we accidentally removed a piece with versatility on it that we intended to keep on for the entire test. If we’ve been recording versatility before every new set of data, we might be able to catch that after-the-fact and be able to salvage that data set, or at least know why we need to exclude it. Without that knowledge, we might have to re-take the data.

Again, if you’re very certain that a particular stat doesn’t matter (haste, for example, in our case), you can skip recording it. In practice, I rarely record everything. By now, I’m familiar enough with what factors matter that I generally write down only the handful of things I care about. That sort of intuition will come with time, practice, and familiarity with the mechanics. But even I make mistakes, and end up re-taking data (versatility is especially bad in that regard, because I’m still not used to it), so for a beginner I’d recommend erring on the side of caution.

Data Collection

The next question we need to answer is how much data we need to collect. This will vary somewhat from experiment to experiment. For example, if we just want to know whether Judgment procs Seal of Truth, we might only need a single cast. But more often, we’ll need to invoke some statistics. In this section, we’ll give a brief overview of two common ways to use statistics to determine just how much data we need.

Unknown Proc Rate

For example, let’s say we’re trying to accurately determine the proc rate of Seal of Insight. We expect we’ll need to record a lot of auto-attacks and count the number that generate a proc. We can use statistics to figure out how many swings we need to make, at minimum, to feel confident in our result. That amount could be a few hundred swings or even several thousand depending on how accurately we want to know the proc rate.

Proc-based effects are usually modeled by a binomial distribution because they’re discrete events with two potential outcomes (proc or no proc), every proc chance is independent (usually), and the proc rate is constant (again, usually). Most of the time, we can use something called the Normal Approximation Interval to estimate the possible error of our measurements, which we can reverse-engineer to figure out the number of swings we need.

In short, thanks to the Central Limit Theorem we can approximate the error in our measurement of a proc chance $p$ with the following formula:

$$ p \pm z\sqrt{\frac{p(1-p)}{N}}$$

where $N$ is the number of trials (in our case, swings) and $z$ is a constant that depends on how confident we want to be on the result. If you want to know how to calculate $z,$ read up on Standard normal distribution, but most of us just use one of several precalculated values. The most common one is to use $z =1.96$, which corresponds to a 95% confidence interval. Other common values are $z=2.58$ for a 99% confidence interval and $z=3.29$ for a 99.9% confidence interval. If you’re lazy (like me) or don’t feel like memorizing or looking up those numbers, you can use $z=2$ and $z=3$ as rough approximations of the 95% and 99+% confidence intervals.

The way this is normally used, you’d first run an experiment to collect data. So maybe we perform 100 swings and get a proc on 25 of them. We would then have $p=25/100=0.25$ and $N=100$, and our 95% confidence interval is $0.25 \pm 0.0849$, in other words from 16.51% to 33.49% – a pretty wide range. We’d get a narrower range if we performed $N=1000$ swings and got 250 procs; $p$ is still 0.25, but the 95% confidence interval shrinks to $\pm 0.0268$, or from 22.32% to 27.68%. Since we’re dividing by $\sqrt{N}$ in the formula, to increase the precision by another decimal place (factor of 10) we need to use 100 times as many trials.

And of course, we can also use this formula to figure out how many iterations we need to reach a certain precision. Let’s say we want to know the value to precision $P=\pm 0.001$. We can set $P$ equal to term that describes the interval:

$$ P = z \sqrt{\frac{p(1-p)}{N} }$$

and solve for $N$:

$$ N = \frac{z^2 p(1-p)}{P^2}$$

So for example, let’s say we suspect the proc rate is $0.25$ (this would be our hypothesis). If we want to know the proc rate to a precision of $\pm 0.001$ with 95% confidence, we need $N=720300$ melee swings.

There are two caveats here. First, this formula is only an approximation, which means it’s got a range of validity. In particular, it becomes poor if $p$ is very close to zero or one and breaks down entirely if it’s exactly zero or one (though in those cases, the behavior is usually clear enough that we don’t need this method anyway). The rule of thumb is that it gives good results as long as $pN>5$ and $(1-p)N>5$. Since this rule includes $N$, you can still use this approximation when $p$ is very small by increasing the number of trials $N$ to keep the product over 5.

The second caveat is very subtle. Technically speaking, if we find the 95% confidence interval to be $0.25 \pm 0.05$, or 20% to 30%, that does not mean that the true value of the proc rate (let’s call it $\mu$) has a 95% chance to be between 20% and 30%. Instead, it means that if we repeat the experiment 100 times, and calculate confidence intervals for each of them, 95 of those confidence intervals will contain the true value $\mu$.

The wikipedia article for confidence interval makes this distinction as clear as mud (though to be fair, it’s better than any other treatment of it I’ve read; $1-\alpha$ is their representation of confidence, so for a 95% confidence interval $\alpha=0.05$):

This does not mean there is 0.95 probability that the value of parameter μ is in the interval obtained by using the currently computed value of the sample mean.

Instead, every time the measurements are repeated, there will be another value for the mean X of the sample. In 95% of the cases μ will be between the endpoints calculated from this mean, but in 5% of the cases it will not be.

The calculated interval has fixed endpoints, where μ might be in between (or not). Thus this event has probability either 0 or 1. One cannot say: “with probability (1 − α) the parameter μ lies in the confidence interval.” One only knows that by repetition in 100(1 − α) % of the cases, μ will be in the calculated interval. In 100α% of the cases however it does not. And unfortunately one does not know in which of the cases this happens. That is (instead of using the term “probability”) why one can say: “with confidence level 100(1 − α) %, μ lies in the confidence interval.”

Got all that?

In practice, the distinction isn’t that important for us – it’s mostly a matter of semantics. If we were submitting our work to a scientific journal, we’d care, but for theorycrafting we can be a little loose and fast with our statistics. Just don’t tell the real statisticians. The key is to remember that you can run an experiment and get a confidence interval that doesn’t contain the value you’re looking for.

In fact, it’s almost certain that you will see that happen if you’re using 95% confidence intervals, just because a 5% chance is pretty high. That’s one in every 20 experiments. When that happens, you may need to take further measures. That may mean repeating the experiment, or it may mean using a tighter confidence interval. Sometimes it means that you reject your hypothesis, because the value really isn’t inside the confidence interval. This is where the critical thinking aspect comes in – you have to interpret the data and determine what it’s really telling you.

Remember that you can always increase $z$ to calculate a more inclusive confidence interval, which doesn’t require taking extra data. Sometimes that will answer the question for you (“The result is outside the 99.9% confidence interval, the hypothesis is probably wrong, Seal of Insight’s proc rate is not 25%”). And on the other extreme, you can increase $N$ to reduce the size of the confidence interval if you’re trying to increase the precision of an estimate. Though that obviously means taking more data!

Unknown Proc Trigger

Sometimes, we just want to know if an effect can occur. For example, on beta I was doing some testing to see exactly how Sacred Shield benefits from multistrike. To test that, I just kept re-applying Sacred Shield until I saw it generate bubbles that didn’t match the baseline or crit values. A single multistrike should generate a shield that is 30% larger than the baseline one, so I just kept going until I observed one. Likewise, I wanted to know if a double-multistrike proc would generate a shield that was 60% larger, so I kept casting until I observed one of those as well.

This type of test, where a single positive result proves the hypothesis, can be very easy to perform if the chance is high. If I get a positive result very quickly, the test won’t take very long at all. But if the chance is low, you could be at it all day. And if the chance is actually zero (because your hypothesis is wrong), you could go forever without seeing the event you’re looking for. Again, statistics help us make the determination as to how many times we need to repeat the test before we can say with reasonable certainty that the event can’t happen.

Generally in this type of test, you already know the proc rate. For example, if I have 10% crit and I want to know if a particular ability can crit, I know that it should have either a 0% chance (if it can’t crit) or a 10% chance (if it can crit) to do so. So my $p$ here is 10%, or $p=0.10$. According to binomial statistics, if I perform the test $N$ times, the probability of getting exactly $k$ successes is calculated by:

$$Pr(k; N, p) = {N \choose k} p^k (1-p)^{ N-k},$$

where

$$ {N \choose k} = \frac{N!}{k!(N-k)!}.$$

This is relatively easy to evaluate in a calculator or computer for known values of $p$, $N$, and $k$. Many calculators have a “binopdf” function that will do the entire calculation; in others you may need to calculate the whole thing by hand.

So let’s say we perform the experiment. We cast the spell 100 times and don’t observe a single crit. The probability of that, according to our formula, is:

$$Pr(0; N, p) = {N \choose 0} p^0 (1-p)^N = (1-p)^N$$

Plugging in $N=100$ and $p=0.1$ we find that the probability is around $2.6\times 10^{-5}$, or around 0.00266%. Pretty unlikely, so we can probably safely assume that the ability can’t crit. Though again, there’s a 0.00266% we could be wrong!

A related problem is if we want to know the probability of getting up to a certain number of procs or crits. For example, if we have 10% crit, what’s is the probability of getting five or fewer crits in 100 casts. To do that, we’d have to calculate the separate probabilities of getting exactly 0, 1, 2, 3, 4, and 5 crits and sum them all up. Mathematically, that would be:

$$ P(k\leq 5; N, p) = \sum_{k=0}^{5}P(k;N,p) = \sum_{k=0}^5 {N \choose k }p^k (1-p)^{N-k},$$

and at this point we’d probably want to employ a calculator or computer to do the heavy lifting for us. For our example, MATLAB gives us:

>> sum(binopdf(0:5,100,0.1))

ans =

 0.0576

So we’d have a 5.76% chance of getting five or fewer crits in 100 casts with a 10% crit chance.

Ability Damage Tests

Since we started this post discussing a test related to the damage formula of Judgment, I want to make one final note about testing ability damage formulas. Normally, you can get this information from tooltips or datamining. But tooltips can lie, especially during a beta, and sometimes even on live servers. Word of Glory’s tooltip was wrong for almost the first half of Mists of Pandaria, for example. So it can be useful to perform tests to double check them. One of my first tasks every beta is to take data on every spell in our arsenal and attempt to fit the results to confirm whether the tooltips are correct.

Your first instinct when performing this type of experiment may be cast the spell a few hundred times and record the average damage, and then repeat at various different attack power (or spell power) values. However, while that works, it’s not always the most accurate (or efficient) way. Hamlet wrote an excellent article about this topic earlier this week, and you should really go read it if you want to understand why an alternative method (which I’ll briefly outline below, since I already had it written) is advantageous.

Spells in WoW traditionally have had a base damage range (either by default, or based on weapon damage) and then some constant scaling with attack power and/or spell power. The base damage range was fixed and obeyed a uniform distribution, and accounted for all damage variation in the spell. So it was often more accurate to record the maximum and minimum, and average for each set of casts, and then attempt to fit the maximum and the minimum values separately.

This was especially useful for abilities that did some percentage of weapon damage, because one could equip a weapon with a very small damage range (i.e. certain low-level common-quality weapons), at which point it might only take a handful of casts to cover the entire range.

I’ve used the past tense here, because they’ve changed how abilities work in Warlords. They no longer have any base damage values, which means that they’ve had to change the method they use to make spell damage vary from cast to cast. I don’t know what they’ve chosen to do about that, because I haven’t had time to test it thoroughly. In my limited time on beta, I’ve noted that some spells don’t appear to vary at all anymore, while others do. For example, Judgment and Exorcism both do the same damage every time they connect, while the healing done by Flash of Light and Word of Glory still varies from cast to cast. Abilities based on weapon damage, like Crusader Strike and Templar’s Verdit, vary according to their weapon damage range.

The ones that vary could just use a flat multiplicative effect, such that the spell always does $X \pm \alpha X$ for some value of $\alpha$. In other words, maybe it always does $\pm 10%$ of the base damage. But it could also be some other method. I’m sure we’ll figure this out as beta goes on (if nobody else has yet), but just keep in mind that this slightly changes the procedure above. You’d still be matching the min and max values, of course, but you’d potentially be looking for scale factors that are, say, 10% larger or smaller than the expected mean value.

Coming Soon

That wraps up our primer on performing in-game experiments. We could talk in a lot more detail about any of these examples and identify other nuances, tips, and tricks. But this post, while dense, covers the basics one would need to set up and perform an in-game experiment. Most of what we would gain by going into more depth is improvements in accuracy and efficiency.

Obviously both of those are good things, but they’re not necessary for your first few attempts at in-game experimentation. In the future, I might write a few shorter articles that are more focused on the nuances involved in a particular kind of measurement, provided there’s interest in the topic. If you have something in particular you’d like me to write about in depth, please mention it in the comments.

In the next post, I want to look at how we can use the results of in-game experiments to check Simulationcraft results. Which also means designing and executing “experiments” in Simulationcraft that we can use for comparison. Many of the same basic ideas will apply, of course; for example, eliminating as many variables as possible and making sure you’ve collected enough data. But in this case, we’ll be applying those principles to designing action priority lists and interpreting reports.

This entry was posted in Theck's Pounding Headaches, Theorycrafting, Uncategorized and tagged , , , , , , , , . Bookmark the permalink.

12 Responses to TC101: Experimental Design

  1. Hamlet says:

    The possibility I’m worried about this that we’ll look at the new random damage mechanic and find that it uses a normal distribution :P .

  2. Tempus says:

    This is really interesting, breaking it down into something understandable. It’s cool seeing how this ties into how scientists collect data to learn about the world.

  3. Capstone says:

    http://us.battle.net/wow/en/forum/topic/13087818929?page=19#373

    Seems likely they would use something similar for player spells as well…

    • Theck says:

      As Sofie said, that system is a particular replacement for pet battles. That doesn’t mean they can’t use it for player abilities, but traditionally player abilities have used a uniform distribution. The simplest change for Blizzard would be to just redefine the damage range and use the existing damage calculation code.

  4. Sofie says:

    Typo: “I know that it should have either a 0% chance (if it can’t crit) or a 10% chance (if it can crit) to do so. So my p here is 25%, or p=0.25.”

    Capstone, those battle pet abilities are different. They were originally 80-95% hit abilities, which were changed to 100% because missing had too big impact. But they still wanted to keep them different from the original 100% hit abilities, so they added that variance.

  5. Pingback: TC101: Testing Simulationcraft | Sacred Duty

  6. ironshield says:

    At Blizzcon this year try and corner as many drunken Devs as you can lay your hands on and suggest they add a stat-sink trinket that will just reduce the amount of a selectable stat by a selectable amount. No use to anyone as it reduces stats, but would make your experiments much easier. You might get it in via alcohol fuelled “inception” ;)

  7. jeremiah says:

    Hi, I am loving these articles. I love warcraft, I teach discrete math, and I gear many examples towards game design.

    I think you have a small typo here:
    “So we’d have a 5.76% chance of getting less than five crits in 100 casts with a 10% crit chance.”
    In keeping with the math immediately preceding, I believe you meant “So we’d have a 5.76% chance of getting five or less crits in 100 casts with a 10% crit chance.”

    Thanks for the great read!

    -Jeremiah

Leave a Reply