# Deep Learning From Scratch - Theory and Implementation

## Perceptrons

### A motivating example

Perceptrons are a miniature form of neural network and a basic building block of more complex architectures. Before going into the details, let's motivate them by an example. Assume that we are given a dataset consisting of 100 points in the plane. Half of the points are red and half of the points are blue.

As we can see, the red points are centered at (−2,−2)$(-2,-2)$ and the blue points are centered at (2,2)$(2,2)$. Now, having seen this data, we can ask ourselves whether there is a way to determine if a point should be red or blue. For example, if someone asks us what the color of the point (3,2)$(3,2)$ should be, we'd best respond with blue. Even though this point was not part of the data we have seen, we can infer this since it is located in the blue region of the space.

But what is the general rule to determine if a point is more likely to be blue than red? Apparently, we can draw a line y=−x$y=-x$ that nicely separates the space into a red region and a blue region:

We can implicitly represent this line using a **weight vector** w$w$ and a **bias** b$b$. The line then corresponds to the set of points x$x$ where

In the case above, we have w=(1,1)T$w=(1,1{)}^{T}$ and b=0$b=0$. Now, in order to test whether the point is blue or red, we just have to check whether it is above or below the line. This can be achieved by checking the sign of wTx+b${w}^{T}x+b$. If it is positive, then x$x$ is above the line. If it is negative, then x$x$ is below the line. Let's perform this test for our example point (3,2)T$(3,2{)}^{T}$:

Since 5 > 0, we know that the point is above the line and, therefore, should be classified as blue.

### Perceptron definition

In general terms, a **classifier** is a function ^c:Rd→{1,2,...,C}$\hat{c}:{\mathbb{R}}^{d}\to \{1,2,...,C\}$ that maps a point onto one of C$C$ classes. A **binary classifier** is a classifier where C=2$C=2$, i.e. we have two classes. A **perceptron** with weight w∈Rd$w\in {\mathbb{R}}^{d}$ and bias b∈Rd$b\in {\mathbb{R}}^{d}$ is a binary classifier where

^c$\hat{c}$ partitions Rd${\mathbb{R}}^{d}$ into two half-spaces, each corresponding to one of the two classes. In the 2-dimensional example above, the partitioning is along a line. In general, the partitioning is along a d−1$d-1$ dimensional hyperplane.

### From classes to probabilities

Depending on the application, we may be interested not only in determining the most likely class of a point, but also the probability with which it belongs to that class. Note that the higher the value of wTx+b${w}^{T}x+b$, the higher is its distance to the separating line and, therefore, the higher is our confidence that it belongs to the blue class. But this value can be arbitrarily high. In order to turn this value into a probability, we need to "squash" the values to lie between 0 and 1. One way to do this is by applying the **sigmoid** function σ$\sigma $:

Let's take a look at what the sigmoid function looks like:

As we can see, the sigmoid function assigns a probability of 0.5 to values where wTx+b=0${w}^{T}x+b=0$ (i.e. points on the line) and asymptotes towards 1 the higher the value of wTx+b${w}^{T}x+b$ becomes, and towards 0 the lower it becomes, which is exactly what we want.

Let's now define the sigmoid function as an operation, since we'll need it later:

The entire computational graph of the perceptron now looks as follows:

#### Example

Using what we have learned, we can now build a perceptron for the red/blue example in Python.

Let's use this perceptron to compute the probability that (3,2)T$(3,2{)}^{T}$ is a blue point:

### Multi-class perceptron

So far, we have used the perceptron as a binary classifier, telling us the probability p$p$ that a point x$x$ belongs to one of two classes. The probability of x$x$ belonging to the respective other class is then given by 1−p$1-p$. Generally, however, we have more than two classes. For example, when classifying an image, there may be numerous output classes (dog, chair, human, house, ...). We can extend the perceptron to compute multiple output probabilities.

Let C$C$ denote the number of output classes. Instead of a weight vector w$w$, we introduce a weight matrix W∈Rd×C$W\in {\mathbb{R}}^{d\times C}$. Each column of the weight matrix contains the weights of a separate linear classifier - one for each class. Instead of the dot product wTx${w}^{T}x$, we compute xW$x\phantom{\rule{thinmathspace}{0ex}}W$, which returns a vector in RC${\mathbb{R}}^{C}$, each of whose entries can be seen as the output of the dot product for a different column of the weight matrix. To this, we add a bias vector b∈RC$b\in {\mathbb{R}}^{C}$, containing a distinct bias for each output class. This then yields a vector in RC${\mathbb{R}}^{C}$ containing the probabilities for each of the C$C$ classes.

While this procedure may seem complicated, the matrix multiplication actually just performs multiple linear classifications in parallel, one for each of the C$C$ classes - each one with its own separating line, given by a weight vector (one column of W$W$) and a bias (one entry of b$b$).

#### Softmax

While the original perceptron yielded a single scalar value that we squashed through a sigmoid to obtain a probability between 0 and 1, the multi-class perceptron yields a vector a∈Rm$a\in {\mathbb{R}}^{m}$. The higher the i-th entry of a$a$, the higher is our confidence that the input point belongs to the i-th class. We would like to turn a$a$ into a vector of probabilities, such that the probability for every class lies between 0 and 1 and the probabilities for all classes sum up to 1.

A common way to do this is to use the **softmax function**, which is a generalization of the sigmoid to multiple output classes:

#### Batch computation

The matrix form allows us to feed in more than one point at a time. That is, instead of a single point x$x$, we could feed in a matrix X∈RN×d$X\in {\mathbb{R}}^{N\times d}$ containing one point per row (i.e. N$N$ rows of d$d$-dimensional points). We refer to such a matrix as a **batch**. Instead of xW$xW$, we compute XW$XW$. This returns an N×C$N\times C$ matrix, each of whose rows contains xW$xW$ for one point x$x$. To each row, we add a bias vector b$b$, which is now an 1×m$1\times m$ row vector. The whole procedure thus computes a function f:RN×d→Rm$f:{\mathbb{R}}^{N\times d}\to {\mathbb{R}}^{m}$ where f(X)=σ(XW+b)$f(X)=\sigma (XW+b)$. The computational graph looks as follows:

#### Example

Let's now generalize our red/blue perceptron to allow for batch computation and multiple output classes.

Since the first 10 points in our data are all blue, the perceptron outputs high probabilities for blue (left column) and low probabilities for red (right column), as expected.