Margin of Error and Joe Random/Joe Awesome

Elric

Full Members
Basic User

Posts: 213

Margin of Error and Joe Random/Joe Awesome

« on: October 09, 2004, 09:09:08 pm »

Margin of Error and Joe Random vs. Joe Awesome

With the presidential campaign going on right now, political polling is incredibly widespread. One of most difficult parts of a poll to interpret is the margin of error. Knowing how the margin of error works is quite important to understanding the results of a poll.

Determining how good a magic deck is will be very similar to taking a political poll. You play a number of games, of which you win a given percentage. The higher percentage a deck wins, the better it is. However, the win percentage of a deck is subject to the same error that any other repeated test is. Winning 6 out of 10 games does not tell you that the match is 60-40, or even roughly 60-40.

This margin of error is the reason it is so hard to convince anyone that a deck is good. Using statistics, you can construct a confidence interval for a deck's win percentage (say, 95%). That is, if you played the matchup an infinite number of times, instead of the n times you just played it (you play the same decks each time and there is no learning curve), what interval can you be 95% confident the win percentage falls into?

When you played n games, this interval is the win percentage you measured (over the n games), plus or minus 0.98/squareroot(n). So when you play 40 games and you want a 95% confidence interval, your margin of error is 15.5% (so, 95% of the time the win percentage would be within 15.5% above or below what you have measured).

Let's suppose that you know exactly what the metagame will be, either that it is a single deck or that it is a set probability of a number of decks. Then you try to design a deck that beats these decks the majority of the time. To have a deck that has a 95% chance to have a favorable matchup against the field, you need so play n games such that your win percentage during those games is greater than or equal to ½ + 0.98/sqrt(n). After 100 games your win percentage would have to be 60% for the matchup to be a statistically significant amount in your favor.

This tells you why so many promising ideas don't amount to much, even if an idea does well initially in testing. The deck that tests the best for you is the deck that you have the highest probability of having overestimated. It's still the deck most worth playing on the basis of that testing alone.

Here's where the difference between Joe Random and Joe Awesome comes in. If someone goes to a large local tournament (but not a national tournament like GenCon) and does well with a new deck type, the result depends on whether they are Joe Random or Joe Awesome.

If you are Joe Random, you are going to have a hard time convincing anyone that your results are replicable. Whether your deck is good, there will never be enough data points to tell (the tournaments you play in are a completely insignificant set of data, unless you never lose a game). In addition, metagames are not identical and static. There is no reason to assume that availability of cards, choice of decks, or playskill are constant in other areas, or even in the same area a month from now.

Now, let's suppose you are Joe Awesome. You build a deck and go do well with it. Your seal of approval and good finish, while statistically irrelevant, are taken as a sign of the deck being good. Your decision to play the deck might be evidence it is good, since you presumably tested it extensively before taking it to a tournament. Other people test the deck and try to improve it. Some of the changes will be open to debate, and people will play different configurations of the deck. Most importantly, your deck will be evaluated. If it is good, you will get the credit.

I don't have any definitive statements to make on this issue, but I think it is important to note that what you can observe directly about a deck in tournaments is not significant unless the deck is widely played for a relatively extended period of time. As such, other factors, from intuition to the identity of the players, will be used to identify meaningful results. As views on Fish (especially Standstill) show, you ignore results at your own peril.

I also don't want this to come across as a critique of players who win tournaments. The Atog Lord's Waterbury Control Slaver deck, for example, had a lot of good, new card choices that worked out very well. If you had to pick a best deck in Vintage based solely on knowing that he won Waterbury with Control Slaver, you would probably pick Control Slaver. Due to a high margin of error, though, a single tournament win cannot distinguish a dominant deck.

Additionally, this analysis leaves out all kinds of factors, especially playskill-as I've opined before, Rich benefited from being one of the few people who can play Control Slaver, a particularly difficult deck, very well.

Here is the math behind the margin of error calculation. You should probably only read it if you like statistics. I'm sure lots of people on TMD know how a statistics problem like this would work. However, no one ever says "Don't post win percentages?" and then gives an explanation based on the margin of error.

The explanation would go like this: even if you are testing a matchup accurately under the exact same circumstances as everyone else, your results are subject to a high margin of error so you shouldn't assume that your measurements are particularly accurate. In addition, there is a selection bias- people with unusual results are more likely to post them making posted results less useful.

Math:
You have n independent identically distributed trials, where you win with a set probability Pactual each time. This probability Pactual is unknown, so you have to estimate it. Measure the win probability (number of wins/total number of games played), Pmeasured. It has average Pactual and variance= p*(1-p)/n.

For n large, the distribution of Pactual is approximated by the normal (bell-curve) distribution with mean Pmeasured and variance= Pactual*(1-Pactual)/n. We don't know Pactual, but since Pactual*(1-Pactual) <= 1/4, just assume the equality. This is the highest the variance can get (and assumes p=1/2, otherwise it would be lower), but since we don't know the actual value of p assuming it to be higher rather than lower avoids potential trouble.

Then variance=1/(4n). Standard deviation= sqrt(variance)= 1/[2*sqrt(n)]. A 95% confidence interval is your measured average plus or minus 1.96*the standard deviation (this is a standard result for normal distributions). So, the 95% confidence interval is your measured average plus or minus 0.98/sqrt(n)

Edit: fixed formatting problems.


« Last Edit: February 07, 2007, 02:06:21 am by Elric »	Logged

bebe

Full Members
Basic User

Posts: 555

Margin of Error and Joe Random/Joe Awesome

« Reply #1 on: October 09, 2004, 10:34:23 pm »

Quote

Additionally, this analysis leaves out all kinds of factors, especially playskillâ€”as Iâ€™ve opined before, Rich benefited from being one of the few people who can play Control Slaver, a particularly difficult deck, very well.

And there's the rub. Good players know their deck inside out and additionally after playing it enough times at good sized venues are able to avoid play errors that at first might be seem obscure or insignificant. Top players succeed with decks that others have difficulty winning with. I've seen these so many times. No one will replicate Marc's record playing Fish.

The cream rises to the top and decks that are not dominate will win tournaments in the right hands at the appropriate time. The appropriate time you ask? Well, metagaming is an art and a skill. Often decks do well because the environment is ripe for them. Let's be honest. How did a slew of red decks do well at an Italian tournament? These are not even Tier 1 decks let alone dominant decks. At the last large tournamewnt in Toronto half the decks were unknown outside of our meta and the winner played a deck that is totally off the radar outside our circle.

While compiling statistics and rigorous testing is useful, the unknown factors will often determine the final results. I'm weary of statistical analysis. Yes, when presenting a new deck I will test and share results of my testing and my tournament results. But as noted these must be tempered by the level of competition faced. I can state that my newest deck has a win percentage of 65% at the last four tournaments and this would be true. That does not mean it can do well in a different environment piloted by someone else.

So while I read Dr. Sylvan's stats, I'm still not tempted to pick up a deck someone else has tweaked for the New England meta game. There are a good number of viable decks out there that can win on any given day.


	Logged

Rarely has Flatulence been turned to advantage, as with a Frenchman referred to as "Le Petomane," who became affluent as an effluent performer who played tunes with the gas from his rectum on the Moulin Rouge stage.

Zherbus

Administrator
Basic User

Posts: 2406

Margin of Error and Joe Random/Joe Awesome

« Reply #2 on: October 10, 2004, 10:50:09 am »

Excellent post! I would like to clarify for some that there are many margins of error in especially type 1. As you pointed out so eloquently, playskill has a lot to do with it, as does play-experience. A third element that I often refer to is the type 1 brokeness element. It's the single reason why no deck ever goes 90-100% against anything in the format. Any powered deck can start off with Lotus, Ancestral, and Time Walk, so another true 'test' of a decks value is being able to offset that brokenness.


	Logged

Founder, Admin of TheManaDrain.com

Team Meandeck: Because Noble Panther Decks Keeper

Cuandoman

Full Members
Basic User

Posts: 53

Margin of Error and Joe Random/Joe Awesome

« Reply #3 on: October 12, 2004, 09:16:02 pm »

Ok, It helps if you know that I'm an Industrial Engineering student. What that means is time studies, experiments and their data are my kin at school. That said...

This is MUCH more complicated than that which is presented. What you are after is a statistical model of a deck that can predict, with a degree of certainty, the outcome of a deck's games. Of course that can only be done provided copious data (which we dont have) and statistical knowhow. You do have the basics down. (In all honesty I didn't read and validate the math, but the idea is valid)

In reality you have many factors that contribute to the actual values you are after. They can be put into the model provided they can be measured. Even playskill could be measured (thats another study in and of itself... sigh).

And for those of you who are weary of practical statistics, the buisness world, military, and manufacturing sector effectively uses them constantly. The are a valuble tool when a facto of error is involved that can't be taken out with a robust experiment.

Any questions?


	Logged

The fear you feel in your heart - it is only an illusion.
When you feel hunger, you feed your belly, eh?
When you feel fear, feed your heart with courage.
- Matsu Gohei

Elric

Full Members
Basic User

Posts: 213

Margin of Error and Joe Random/Joe Awesome

« Reply #4 on: October 12, 2004, 11:06:06 pm »

Zherbus- right, playskill and play experience are pretty critical. I donâ€™t pretend to be able to account for them. I definitely agree with you on the â€œtype 1 brokenness aspect.â€? I suppose a good way to think about this is the power of a deck. Iâ€™m going to try to avoid much math here, since I seem to have overdone it already.

If you imagine your deck having a â€œpower curveâ€? your deck has a â€œpower levelâ€? for each game drawn randomly from this curve (like rolling dice). The higher total wins. This of course is a terrible model to apply to actual play, unless decks are independent and cannot interfere with each other in any way, in which case power curve is just goldfish kill.

Most fully powered type 1 decks have their good hands really high up on the â€œpower curve.â€? Thus, a very good hand by a fully powered type 1 deck will win any matchup the majority of the time (only losing to similarly good hands). Knowing how to mulligan well helps prevent hands from being too far down on the power curve. In general, when your deck isnâ€™t favored in a matchup you need to mulligan more aggressively to try to get really good handsâ€”keeping a pretty good hand that will lose the game is a bad option (I have learned this the hard way).

Cuandoman- obviously this model cannot predict everything associated with playing repeated games of magic. In fact, something as simple as what deck you are expecting (going first, do you play around Force of Will?) can have a large impact on results.

The way I presented it, though, I assumed away all of these problems by having independent, identically distributed games. (Edit: This doesn't mean that the model is correct, only that it is correct if my assumptions hold, which they clearly do not).

My point is not to figure out every variable. I cannot draw the â€œpower curveâ€? for a deck, but it is still a useful concept to have in mind (everything else equal, when you are down, increase variance, when you are up decrease variance).

I think the better idea is for people to be aware of how significant the margin of error is in their results. Since the way decks play against each other could get no simpler than this, your results will always be subject to the minimum error presented here. Suppose your deck splits 10 games with another deck, 5-5. If you have a feel for a matchup based on seeing it play out (and your general knowledge of each deck) and want to conclude it is probably quite even, thatâ€™s fine. However, you should be aware that statistically, splitting 10 games means very little. Itâ€™s this insight Iâ€™m after, not â€œGrand Unified Magic Theory.â€?


	Logged

Pages: [1]