XOR example is a toy problem in machine learning community, a hello world for introducing neural networks. It means you have to build and train the neural network so that given 2 inputs it will output what a XOR function would output (at least close to it).
This isn't math heavy explanatory tutorial, there are plenty of them out there. Just google "neural network xor example", and you'll get for example 1 and 2 with detailed explanation. I assume you have some vague knowledge of neural networks and try to write a simple one. This article is just a bunch of simple python scripts that implement neural networks. No numpy or other libraries are used, so they should be easily translatable to other languages.
Codingame doesn't have NN specific libraries. For Codingame people mostly train locally using NN libraries, and just copy the learned weights and implement the forward propagation themselves in their language.
All the scripts use stochastic gradient descent to train the neural network, one data row at a time, so no need for matrix tranpositions, which would be required for mini-batch. The loss function is mean squared error.
This is the simplest script, an implementation of
Here the neural network is just a bunch of loosely written variables.
epoch 1000 mean squared error: 0.24950644703013328 epoch 2000 mean squared error: 0.21033216889786982 epoch 3000 mean squared error: 0.06970183505211733 epoch 4000 mean squared error: 0.010193355003568688 epoch 5000 mean squared error: 0.004621174723851184 epoch 6000 mean squared error: 0.002889802406873843 epoch 7000 mean squared error: 0.0020756513163735827 epoch 8000 mean squared error: 0.001609116960177383 epoch 9000 mean squared error: 0.0013089350608506563 epoch 10000 mean squared error: 0.0011005064043563537 for input [0, 0] expected 0 predicted 0.02906 which is correct for input [0, 1] expected 1 predicted 0.9684 which is correct for input [1, 0] expected 1 predicted 0.9684 which is correct for input [1, 1] expected 0 predicted 0.03951 which is correct
Your mileage may vary. Sometimes this simple net will diverge and output for all inputs the 0.666..., or it would need more iterations to train. It's normal as it is more sensitive to starting random weights than more complex models. NN libraries suffer from that too, but they can mitigate it by smarter weights initialization (1, 2, 3).
You can play around with learning rate (alpha) and see how it affects the speed of learning. It is one of the most important hyperparameters in machine learning world. For this example, the orders of around 0.1 is used. In real world applications, 0.001, 0.0001 or even less are used, along with some decay rate. See 1, 2, 3 for details.
This one is more flexible. The variables are stored in array and the for loops more resemble the matrix operations. With HIDDEN = 3, it behaves the same as the first script.
1000 mean squared error: 0.25001130384302916 2000 mean squared error: 0.24998178999223236 3000 mean squared error: 0.2498882299395278 4000 mean squared error: 0.2487285310117196 5000 mean squared error: 0.19301651904542888 6000 mean squared error: 0.06884957172680736 7000 mean squared error: 0.011780981900514784 8000 mean squared error: 0.005242561964051928 9000 mean squared error: 0.0032150923746817523 10000 mean squared error: 0.0022766661560815666 for input [0, 0] expected 0 predicted 0.02494 which is correct for input [0, 1] expected 1 predicted 0.9528 which is correct for input [1, 0] expected 1 predicted 0.9563 which is correct for input [1, 1] expected 0 predicted 0.06596 which is correct
You can play around with the number of hidden units.
This one is exactly as second script, except the activation function for hidden layer is tanh, instead of sigmoid. Also, sigmoid_prime is replaced by tanh_prime in hidden layer during training.
1000 mean squared error: 0.0061729073060459915 2000 mean squared error: 0.0015754154512582549 3000 mean squared error: 0.0008510341785755552 4000 mean squared error: 0.0005731795112170564 5000 mean squared error: 0.00042899519927730044 6000 mean squared error: 0.0003414967710025598 7000 mean squared error: 0.0002830319449863964 8000 mean squared error: 0.00024133098714384138 9000 mean squared error: 0.00021014923444340877 10000 mean squared error: 0.0001859852279128196 for input [0, 0] expected 0 predicted 0.01346 which is correct for input [0, 1] expected 1 predicted 0.9888 which is correct for input [1, 0] expected 1 predicted 0.9862 which is correct for input [1, 1] expected 0 predicted 0.01573 which is correct
See how faster and better it converges than only sigmoid one.
The common choice for activation function is relu. Has plenty advantages, but for me the most important one is speed. In game tree search, doing bunch of relus is much faster than doing bunch of tanhs.
For exercise, replace tanh and tanh_prime with relu and relu_prime accordingly and see how does it affect training. I remark that relu seems more sensitive to weights initialization, in the simplest xor example it diverges more often.
def relu(x): return max(0,x) def relu_prime(x): if x <= 0: return 0 else: return 1
Let's add momentum next to the weights. Here instead of updating weights directly, we update the momentum and then the weights according to the momentum.
1000 mean squared error: 0.00044142174881360634 2000 mean squared error: 0.00020205770098357436 3000 mean squared error: 0.00013037626876152424 4000 mean squared error: 9.606031869410035e-05 5000 mean squared error: 7.598012945270696e-05 6000 mean squared error: 6.281017989283364e-05 7000 mean squared error: 5.351418734233088e-05 8000 mean squared error: 4.660449366354482e-05 9000 mean squared error: 4.1269082890237664e-05 10000 mean squared error: 3.702491046845857e-05 for input [0, 0] expected 0 predicted 0.002033 which is correct for input [0, 1] expected 1 predicted 0.9932 which is correct for input [1, 0] expected 1 predicted 0.9932 which is correct for input [1, 1] expected 0 predicted 0.007203 which is correct
New hyperparamater has been introduced, commonly called lambda. It says how much the past updates should affect the current update. You can play with it, the usual values are within [0.8,0.99] range. It is very important hyperparameter, next to learning rate.
See how much faster does it converge than the version without momentum. Note that X is 1000 iterations instead of 10000.
Stochastic gradient descent with momentum is a standard baseline training algorithm. There are various optimizations proposed with Adam being the preferred choice. Using some formula they update individual momentum, learning rates and possibly other things. Depending on the problem they may converge much faster than standard SGD.
I wrote this article because in the beginning I struggled to understand how to train neural networks. Most tutorials introduce heavy math, then use python libs with few lines and call it a day.
import nnlib import mnist nn = nnlib.NeuralNetwork() nn.learn(mnist) nn.predict(mnist) # yay we have trained our first neural network completely by ourselves!
I wouldn't learn much from that. I am dumb as the NNs I create, I learn by examples. I need many many similar, slightly increasing in difficulty examples to grasp the new concept. Of course the neural network world is much bigger than that, but hopefully the scripts I provided will give you some insight into NNs and encourage you to learn more about them.
Then after grasping the basic concepts, I would use the libraries. Most common are pytorch and keras/tensorflow. No need to implement Adam or other optimization, smarter weights initialization, regularization etc. And it's easy to prototype and test various NN architectures (multiple hiddden layers, different units) to see which one is the best.
I'd like to thank Wontomino for providing some feedback on this article.