Statistical significance: Is my new bot really better than the previous one?


Before using a formula for statistical significance, we need to get a sense of what it is. We’ll thus compare two extreme cases.

Case 1

You run 10 matches opposing your bots and the new one wins 7 of them. Wow, it’s a 70 % win rate, amazing, the new bot is really better!

Oh, really?

Flip 10 fair coins and count heads as wins: You have a ≈ 17.2 % probability to reach or exceed that 70 % win rate. Said otherwise, it’s even more probable than rolling a 6 on a fair dice (≈ 16.7 %), which is not an impressive feat, to say the least. This 70 % win rate is thus statistically insignificant, maybe the new bot is in fact worse than the old one!

Case 2

You run 1 million matches and the new bot wins 503,500 of them. Meh, the win rate is a tiny 50.35 %, so close to 50 % that the new bot is in fact not better…

Oh, really?

Flip 1 million fair coins (or 1000 times 1000 coins) and count heads as wins (by the way it’ll take you roughly a month): You have only one chance in 776 billion to reach or exceed that 50.35 % win rate. Said otherwise, if every human being on the planet flips their million coins, there is only a 1 % chance that someone (you can’t predict who) finally reaches or exceeds that 50.35 % win rate (and a 99 % chance that nobody on Earth will even reach it!) Nevertheless your new bot was able to do it on its first try! The new bot may be only slightly better than the old one, but it’s undoubtedly better, the difference is statistically significant.

Comparing cases

As you can see, the win rate isn’t enough to decide whether the new bot is actually better than the old one, you must take into account the total number of matches too. The win rate and the total number of matches allows to calculate the probability that mere blind luck could explained the observed result (respectively 17.2 % and 1/776ᴇ9 in the cases above) while assuming that the two bots are in fact equivalent. That probability is named the one-tailed p-value. However, in practice, the p-value is often costly to calculate, we’ll thus calculate a test statistic instead: In a given context, the higher the test statistic, the lower the p-value, and thus the higher the statistical significance.

Create your playground on
This playground was created on, our hands-on, knowledge-sharing platform for developers.
Go to