In order to use correctly and efficiently the formula for statistical significance, several features must or should be implemented.
The following features must be implemented for the statistical significance to be correct.
Games are rarely treating fairly the players, most of them have a bias in favor of one of the players, some of them are even first-player-win or second-playe-win games. So, if the new bot is always player 1 or always player 2, you will mostly measure the statistical significance of the inherent asymmetry between the two players due to the game rules, hidding the actual difference of the bots’ performances. In order to mitigate this issue, you must make the new bot (measured against the old bot) alternatively player 1 and player 2 (hard coded alternation or random start position with 50-50 chance).
Statistical significance is measured by gathering a lot of data, but if you use again and again the same data you’ll increase |T| artificially even though no actual information is added, so the statistical significance will be wrong. Different games in a run and different runs must be independent (new seed each time, even if one of the bots has stochastic parts). It does not mean that a played seed must not be used anymore, it could happen, but only from pure chance.
The following features should be implemented for the use of statistical significance to be efficient.
Don’t wait for the end of the run to display T and ρ: If you have to stop early your run, you’ll lose all the gathered data. Better display T and ρ in real time, i.e., after each completed game. If you can display it graphically in order to visualize the tendencies of T and ρ, it’s even better!
Note: The expected value for T is proportional to N (the number of matches played), so the data points should be roughly aligned on the graphical visualization.
In addition you could add visualization of triplets (W, D, L) and (α₀, N−2α₀, α₀) (see proof of formula of statistical significance): It’s less useful than T and ρ but would be a nice addition.
In case you must abort a run before it gathers enough data, allow runs to start with a non-null (W, D, L) triplet from a previous run.
Warning: The data of the previous run must have been gathered in the same conditions (same bot versions, same game version) but the new run must be independent (don’t launch the new run with the same seeds as the previous one).
If a match does not take all your available cores, you can speed up a run by making it multi-threaded. Just remember to synchronize on the (W, D, L) triplet when updating it.
If you measure than bot B is significantly better than bot A, then that bot C is better than bot B, you should check that C is better than A too and that you didn’t reach a Rock-Paper-Scissors situation. Ideally you should test any new bot against all its previous versions, not only the one or two previous ones, but it would require a lot of resources.