Processing math: 100%

# Deep Learning From Scratch - Theory and Implementation

DanielSabinasz
285.6K views ## Perceptrons

### A motivating example

Perceptrons are a miniature form of neural network and a basic building block of more complex architectures. Before going into the details, let's motivate them by an example. Assume that we are given a dataset consisting of 100 points in the plane. Half of the points are red and half of the points are blue.

As we can see, the red points are centered at and the blue points are centered at . Now, having seen this data, we can ask ourselves whether there is a way to determine if a point should be red or blue. For example, if someone asks us what the color of the point should be, we'd best respond with blue. Even though this point was not part of the data we have seen, we can infer this since it is located in the blue region of the space.

But what is the general rule to determine if a point is more likely to be blue than red? Apparently, we can draw a line that nicely separates the space into a red region and a blue region:

We can implicitly represent this line using a weight vector and a bias . The line then corresponds to the set of points where

In the case above, we have and . Now, in order to test whether the point is blue or red, we just have to check whether it is above or below the line. This can be achieved by checking the sign of . If it is positive, then is above the line. If it is negative, then is below the line. Let's perform this test for our example point :

Since 5 > 0, we know that the point is above the line and, therefore, should be classified as blue.

### Perceptron definition

In general terms, a classifier is a function that maps a point onto one of classes. A binary classifier is a classifier where , i.e. we have two classes. A perceptron with weight and bias is a binary classifier where

partitions into two half-spaces, each corresponding to one of the two classes. In the 2-dimensional example above, the partitioning is along a line. In general, the partitioning is along a dimensional hyperplane.

### From classes to probabilities

Depending on the application, we may be interested not only in determining the most likely class of a point, but also the probability with which it belongs to that class. Note that the higher the value of , the higher is its distance to the separating line and, therefore, the higher is our confidence that it belongs to the blue class. But this value can be arbitrarily high. In order to turn this value into a probability, we need to "squash" the values to lie between 0 and 1. One way to do this is by applying the sigmoid function :

Let's take a look at what the sigmoid function looks like:

As we can see, the sigmoid function assigns a probability of 0.5 to values where (i.e. points on the line) and asymptotes towards 1 the higher the value of becomes, and towards 0 the lower it becomes, which is exactly what we want.

Let's now define the sigmoid function as an operation, since we'll need it later:

The entire computational graph of the perceptron now looks as follows: #### Example

Using what we have learned, we can now build a perceptron for the red/blue example in Python.

Let's use this perceptron to compute the probability that is a blue point:

### Multi-class perceptron

So far, we have used the perceptron as a binary classifier, telling us the probability that a point belongs to one of two classes. The probability of belonging to the respective other class is then given by . Generally, however, we have more than two classes. For example, when classifying an image, there may be numerous output classes (dog, chair, human, house, ...). We can extend the perceptron to compute multiple output probabilities.

Let denote the number of output classes. Instead of a weight vector , we introduce a weight matrix . Each column of the weight matrix contains the weights of a separate linear classifier - one for each class. Instead of the dot product , we compute , which returns a vector in , each of whose entries can be seen as the output of the dot product for a different column of the weight matrix. To this, we add a bias vector , containing a distinct bias for each output class. This then yields a vector in containing the probabilities for each of the classes.

While this procedure may seem complicated, the matrix multiplication actually just performs multiple linear classifications in parallel, one for each of the classes - each one with its own separating line, given by a weight vector (one column of ) and a bias (one entry of ). #### Softmax

While the original perceptron yielded a single scalar value that we squashed through a sigmoid to obtain a probability between 0 and 1, the multi-class perceptron yields a vector . The higher the i-th entry of , the higher is our confidence that the input point belongs to the i-th class. We would like to turn into a vector of probabilities, such that the probability for every class lies between 0 and 1 and the probabilities for all classes sum up to 1.

A common way to do this is to use the softmax function, which is a generalization of the sigmoid to multiple output classes:

#### Batch computation

The matrix form allows us to feed in more than one point at a time. That is, instead of a single point , we could feed in a matrix containing one point per row (i.e. rows of -dimensional points). We refer to such a matrix as a batch. Instead of , we compute . This returns an matrix, each of whose rows contains for one point . To each row, we add a bias vector , which is now an row vector. The whole procedure thus computes a function where . The computational graph looks as follows: #### Example

Let's now generalize our red/blue perceptron to allow for batch computation and multiple output classes.

Since the first 10 points in our data are all blue, the perceptron outputs high probabilities for blue (left column) and low probabilities for red (right column), as expected.   