Elric
Full Members
Basic User
  
Posts: 213
|
 |
« on: October 09, 2004, 09:09:08 pm » |
|
Margin of Error and Joe Random vs. Joe Awesome
With the presidential campaign going on right now, political polling is incredibly widespread. One of most difficult parts of a poll to interpret is the margin of error. Knowing how the margin of error works is quite important to understanding the results of a poll.
Determining how good a magic deck is will be very similar to taking a political poll. You play a number of games, of which you win a given percentage. The higher percentage a deck wins, the better it is. However, the win percentage of a deck is subject to the same error that any other repeated test is. Winning 6 out of 10 games does not tell you that the match is 60-40, or even roughly 60-40.
This margin of error is the reason it is so hard to convince anyone that a deck is good. Using statistics, you can construct a confidence interval for a deck's win percentage (say, 95%). That is, if you played the matchup an infinite number of times, instead of the n times you just played it (you play the same decks each time and there is no learning curve), what interval can you be 95% confident the win percentage falls into?
When you played n games, this interval is the win percentage you measured (over the n games), plus or minus 0.98/squareroot(n). So when you play 40 games and you want a 95% confidence interval, your margin of error is 15.5% (so, 95% of the time the win percentage would be within 15.5% above or below what you have measured).
Let's suppose that you know exactly what the metagame will be, either that it is a single deck or that it is a set probability of a number of decks. Then you try to design a deck that beats these decks the majority of the time. To have a deck that has a 95% chance to have a favorable matchup against the field, you need so play n games such that your win percentage during those games is greater than or equal to ˝ + 0.98/sqrt(n). After 100 games your win percentage would have to be 60% for the matchup to be a statistically significant amount in your favor.
This tells you why so many promising ideas don't amount to much, even if an idea does well initially in testing. The deck that tests the best for you is the deck that you have the highest probability of having overestimated. It's still the deck most worth playing on the basis of that testing alone.
Here's where the difference between Joe Random and Joe Awesome comes in. If someone goes to a large local tournament (but not a national tournament like GenCon) and does well with a new deck type, the result depends on whether they are Joe Random or Joe Awesome.
If you are Joe Random, you are going to have a hard time convincing anyone that your results are replicable. Whether your deck is good, there will never be enough data points to tell (the tournaments you play in are a completely insignificant set of data, unless you never lose a game). In addition, metagames are not identical and static. There is no reason to assume that availability of cards, choice of decks, or playskill are constant in other areas, or even in the same area a month from now.
Now, let's suppose you are Joe Awesome. You build a deck and go do well with it. Your seal of approval and good finish, while statistically irrelevant, are taken as a sign of the deck being good. Your decision to play the deck might be evidence it is good, since you presumably tested it extensively before taking it to a tournament. Other people test the deck and try to improve it. Some of the changes will be open to debate, and people will play different configurations of the deck. Most importantly, your deck will be evaluated. If it is good, you will get the credit.
I don't have any definitive statements to make on this issue, but I think it is important to note that what you can observe directly about a deck in tournaments is not significant unless the deck is widely played for a relatively extended period of time. As such, other factors, from intuition to the identity of the players, will be used to identify meaningful results. As views on Fish (especially Standstill) show, you ignore results at your own peril.
I also don't want this to come across as a critique of players who win tournaments. The Atog Lord's Waterbury Control Slaver deck, for example, had a lot of good, new card choices that worked out very well. If you had to pick a best deck in Vintage based solely on knowing that he won Waterbury with Control Slaver, you would probably pick Control Slaver. Due to a high margin of error, though, a single tournament win cannot distinguish a dominant deck.
Additionally, this analysis leaves out all kinds of factors, especially playskill-as I've opined before, Rich benefited from being one of the few people who can play Control Slaver, a particularly difficult deck, very well.
Here is the math behind the margin of error calculation. You should probably only read it if you like statistics. I'm sure lots of people on TMD know how a statistics problem like this would work. However, no one ever says "Don't post win percentages?" and then gives an explanation based on the margin of error.
The explanation would go like this: even if you are testing a matchup accurately under the exact same circumstances as everyone else, your results are subject to a high margin of error so you shouldn't assume that your measurements are particularly accurate. In addition, there is a selection bias- people with unusual results are more likely to post them making posted results less useful.
Math: You have n independent identically distributed trials, where you win with a set probability Pactual each time. This probability Pactual is unknown, so you have to estimate it. Measure the win probability (number of wins/total number of games played), Pmeasured. It has average Pactual and variance= p*(1-p)/n.
For n large, the distribution of Pactual is approximated by the normal (bell-curve) distribution with mean Pmeasured and variance= Pactual*(1-Pactual)/n. We don't know Pactual, but since Pactual*(1-Pactual) <= 1/4, just assume the equality. This is the highest the variance can get (and assumes p=1/2, otherwise it would be lower), but since we don't know the actual value of p assuming it to be higher rather than lower avoids potential trouble.
Then variance=1/(4n). Standard deviation= sqrt(variance)= 1/[2*sqrt(n)]. A 95% confidence interval is your measured average plus or minus 1.96*the standard deviation (this is a standard result for normal distributions). So, the 95% confidence interval is your measured average plus or minus 0.98/sqrt(n)
Edit: fixed formatting problems.
|