AlphaZero like implementation for Oware Abapa game (Codingame)

Marchete

10.7K views

01 AlphaZero like implementation for Oware Abapa game

02 Introduction

03 Glossary

04 Differences between CGZero and Alphazero

05 Environment Preparation

06 Final words

Previous: Introduction Next: Differences between CGZero and Alphazero

Glossary

These are simplified descriptions, search on other AlphaZero papers for more formal explanations. I simplified the terms to reflect exactly how I'm using them, they aren't accurate for a ML expert:

Neural Network/NN: It's just a vector<vector<weights,bias>>. You get some inputs in the 0.0 .. 1.0 range, and you do input * weight + bias = prediction (well, it's very simplified). For the bot the NN is just a big function approximation. It's very similar to heuristics score functions where you apply some coefficients to some data from the gamestate (in CSB it can be angle to checkpoint, speed, checkpoint passed, etc). But the NN does all the paramtuning and function definition by itself.
Dense layer / MLP Layer: One of these vector<weights,bias>
One-hot: A vector with all zeros except one value with 1.0. For inputs a one-hot encoding means the following: Neural Networks expects that inputs values are like gradient, where 0.0 is worse than 1.0. If you have a cell with 3 possible states( empty, white pawn, black pawn) you shouldn't encode it as a single input with values 0.0, 0.5 and 1.0. A black pawn isn't 2x the value of white pawn. So that single cell could be encoded as one-hot encoding with 3 possible activations. Empty: [1.0,0.0,0.0] White[0.0,1.0,0.0] Black [0.0,0.0,1.0].
Value: Given a gamestate, it's a score evaluating the winning chances for the current player. It's similar to Eval() functions on heuristic bots. But in AlphaZero the function is replaced by the output of a NN prediction. Value will be limited to -1.0 to 1.0 range.
Reward/Z: The endgame Value. It should be -1.0 if player loses, 1.0 for winning and 0.0 for a draw. But I've tweaked this to have different winning qualities. In my opinion, if I win faster I need to reflect a better score, or if I win by a big margin. So I give 0.8 as the minimal win score, then I give bonus for turns and score difference. I prefer turns bonus over score difference. The loss score is similar.
meanScore/Q: On MCTS you have a sumValue on each MCTS node, coming from the backpropagation of scores. if you divide sumValue/visits you have a mean value.
Policy: Given a gamestate and the legal moves for the current player, it's a vector of probabilities for these legal moves, it's similar to giving a score to the moves or a distribution probability. I.e. Imagine you can do three different actions [Rock,Paper,Scissors], the policy can be [0.333,0.333,0.333]. In this case all moves will be tested similarly, and MCTS will have a similar amount of visits for each move. But if you put [0.1,0.1,0.8] as policy the MCTS search will explore the "Scissors" move much more. A policy should always sum 1.0.
NN model: Definition of the NN structure, Input size, number of Dense layers, and output layers. The CGZero design needs one input layer (the gamestate), and two outputs (value and policy). Some layers are shared between the two outputs. See [1] for a possible diagram.
Dirichlet noise: It's a noise for the policy part. Alphazero uses it on the self-play to create diversity, otherwise the NN will always play exactly the same game over and over (it's deterministic). You pass a the legal moves count V and other parameters (epsilon and alpha), the dirichlet noise will create some random vector of size V that sums 1.0. With the parameter called epsilon it will mix the policy and the noise: final_policy = policy * (1.0-epsilon) + noise * epsilon . This noise won't change the sum policy, it's just a redistribution. As an example, if the [Rock,Paper,Scissors] policy was [0.1,0.1,0.8], I'd create a dirichlet noise of size 3, that can be [0.3,0.4,0.3]. With an epsilon 0.2 it will end as [0.14,0.16,0.7]. It stills sums 1.0. Alpha on dirichlet noise seems to control how different can be each value of the vector, I use something around 1.0 and 1.5.
sample: It's a single gamestate + Value+Policy+Count. While training I use a count = 1 (I average Value and policy)
y_true: An array with value and policy from samples. This is what I want as the NN output.
y_pred: Predicted array of values and policies. It's the NN output, and after training it should be similar to y_true.
loss and loss function. While you are training a NN you need to know how different the y_true and y_pred are, because the training needs to have some numerical value (gradient) to direct the training. When you are training you'll see that the loss value will decrease, that means the NN is predicting better.
cross-entropy loss: A loss calculation for policies, it gives a single number for the whole vector. Caution at that point, some crossentropy losses are targeted for one-hot policies. In general Alphazero policies aren't one-hot.
mean squared loss: A loss calculation for values. It's squared because it seems it's better for the learning process.
hyperparameter: Some coefficient that changes the NN learning behavior.
learning rate / LR: Hyperparameter that controls how fast the NN learns. In a simplified way it's how much the NN weights change to reduce the losses. It's good to have a "big" learning rate at start (like 0.01 or similar) and then reduce it after some training to improve corvengence.
epoch: I use it as 'training iterations on the current sample subset'. The training process can iterate N times over the samples to learn. This is what I define as EPOCH.
batch size: Training steps are done in batches of samples. This define the amount of samples per training step. I don't know what is a good value for this.
Temperature: After reviewing it, I think AZ does the following: For the first 30 turns the move selection is done with a weighted random probability, based on visits. The more visits, the more chance to pick this move. They can tweak the visits by using pow(childVisit, 1.0/temperature). In the github code it wasn't implemented, but I see it logical, so I added it to my code instead my random move selection (it wasn't weighted). I already had a weighted random selection from RN_Explorer, I just reused it.
cpuct: This is the exploration parameter for MCTS Search. Original MCTS uses UCB1 formula with an exploration parameter C. In Alphazero the formula is replaced to Uchild=Qchild + cpuct⋅Pchild⋅(√visitsParent) /(1+visitsChild) Qchild is Q for that child (meanValue), Pchild is the Policy value for that child (it's not a vector, but a single value from parent's policy vector). As you see, cpuct is similar to C in the original UCB1. But guess what? it isn't. According to [2] cpuct value grows as search progresses!. In theory cpuctshould be between 1.0 and 4.0. I have absolutely no idea how to tune it. I don't know if I need a constant, or if I need to change it based on visit count, based on game turn, or based on MCTS depth (i.e. maybe I should reduce exploration after depth 2 on the MCTS Tree to focus more on best moves after some exploration in lower depths). It's another obscure hyperparatemer to me. After reviewing AZ pseudocode, in general the cpuct value changes too little with visits, in CG maybe it affects the first turn (high visit count) but not a lot.

[1] Simple NN model

[2] https://lczero.org/blog/2018/12/alphazero-paper-and-lc0-v0191/

Create your playground on Tech.io

This playground was created on Tech.io, our hands-on, knowledge-sharing platform for developers.

Go to tech.io

ABOUT US CAREERS PRIVACY FAQ BLOG

The #1 tech hiring platform

SCREENING INTERVIEWING

Glossary

CGFunge Batch Runner

PHP Dev Tools (for CodinGame or elsewhere)

Swapping languages on CodinGame (a.k.a. system() calls)

Endgame books in Oware Abapa