Neural networks 101

Epoch 1- Neural Networks

Welcome to the first article in our series on neural networks. In this post, we will explore the fundamental concepts of neural networks and provide a clearer and more logical explanation. Let's dive in!

The Neuron: A Building Block

Let's start with understanding what a neuron is in the context of neural networks. In simple terms, a neuron takes inputs and produces an output based on a calculated value.

Usually this calculated value is mathematically written as:

Alternatively, we can rewrite the equation by moving the threshold to the other side of the equals sign and calling it a bias:

f(w,b)=w^Tx+b

Here, w represents the weight, b represents the bias, and x represents the input, all of which can be represented as matrices or vectors. The function f(w,b) calculates the weighted sum of the inputs and bias. If the output of the function is greater than 0, the neuron "fires" and produces an output of 1. If the output is less than or equal to 0, the neuron produces an output of 0. In essence, a neuron can have two states: 1 or 0.

To introduce non-linearity and enable continuous outputs, an activation function, such as the sigmoid or ReLU function, is applied to the neuron.

f(a,b)=\sigma(w^Tx+b)

The sigmoid function was used as the activation function, but recently due to improved performance and faster training times, the ReLU activation function is used. The activation function is applied to the neuron in order to enable the outputs to be continuous rather than being a discontinuous step function.

Using these neurons, one can easily implement basic logic gates such as AND, OR, NOT and NAND gates. We can make a NAND gate by taking weights of -2 and -2 and an overall bias of 3. The following figure illustrates how neuron can be connected to form a neural network that can be used to perform NAND gate addition:

If a neuron can be configured to act as basic logic gates, then it naturally follows that a neuron can be used to perform any type of computation possible. But that’s not the reason why Neural networks are powerful. The real reason why neural networks can simulate biological neurons is due to the fact that these neurons can be programmed to learn and adjust their weights and biases based on training data. This makes neural networks very powerful. Hence, a network can essentially ’learn’ the various parameters (usually billions of parameters) to be able to closely model the given training data.

So much for neurons. Now let us see how a neuron is typically connected to make a complete neural network.

The Network

Neurons are connected to form a network, which is organized into layers. Every neural network consists of the following layers:

Input layer
Hidden layer(s)
Output layer

The input layer contains the input parameters, such as pixel intensities in image recognition tasks. The hidden layer(s) capture the complexity of the model. Having too few or too many hidden layers can lead to underfitting or overfitting the training data. The output layer depends on the number of desired outputs, such as classifying handwritten digits represented by numbers, which would require 10 neurons in the output layer.

Determining the appropriate number of neurons and hidden layers depends on the data, training set, and the network's purpose. It often involves heuristics, intuition, and trial-and-error to achieve optimal performance. Additionally, depending on the problem, different network architectures like Convolutional Neural Networks, Recurrent Neural Networks, or Attention Models may be used.

The Training and Testing datasets

For the purpose of explanation, let's consider the classic machine learning problem of handwritten digit classification using the MNIST dataset. More details of the dataset can be found here. Our goal is to build a binary classifier that determines whether an image represents the number 7 or not.

Let us also define our neural network as such:

Since our neural network requires weights and biases, we define two matrices: w and b. w is a 1x4096 matrix, and b is a 1x4096 matrix (representing a single real number).

w=\begin{bmatrix} . & . & .& .& .& .& .\\ \end{bmatrix}_{1\times4096}

b=[b]_{1\times 4096}

where b is just a real number.

The MNIST dataset consists of greyscale images of hand written digits which are 64x64 pixels. The image can be represented as a 64x64 dimensional matrix like so:

\begin{bmatrix} . & . & .\\ . & . & .\\ . & . & .\\ \end{bmatrix}_{64\times64}

The individual entries of the matrix is the pixel intensity of the image.

To feed this data into our neural network, we need to flatten the image matrix. Flattening the matrix means converting the 64x64 matrix into a 1 column matrix of dimensions 4096x1 (as 16x16=4096).

x=\begin{bmatrix} . & \\ . & \\ . & \\ . & \\ . & \\ \end{bmatrix}_{4096\times1}

So the flattened matrix is the input to the input layer which has exactly the same number of neurons as the image matrix (i.e. 4096).

Okay that fits in good for just 1 training image. But the dataset contains 60,000 number of images. So how do we represent that? We use another matrix representation. We'll call it X.

X=\begin{bmatrix} . & . & . & .\\ . & . & .& .\\ . & . & .& .\\ x_1&x_2&...& x_{60000}\\ . & . & .& .\\ . & . & .& .\\ . & . & .& .\\ \end{bmatrix}_{4096\times60000}

So essentially the dataset would be a matrix of 4096xm where m is the number of training examples which is 60,000 if you take the entire training set.

Similarly, the training set has the output labels which tells us whether the picture is a 7 or not. In the form of a matrix it looks like:

Y=\begin{bmatrix} y_1&y_2&...& y_{60000}\\ \end{bmatrix}_{1\times60000}

So mathematically speaking,

X \in R^{n_x \times m}

and

y \in {0,1}

So far this is the neural network architecture we have:

Input layer: Consists of 4,096 neurons, corresponding to a flattened image of size 64x64.
Output layer: Consists of a single neuron, representing the probability of the input image being a 7.

Predictions

The output of the neural network is mathematically represented as:

z=w^TX+b

and

\hat{y}=\sigma(z)

where w is the weight matrix and x is our image matrix and b is the bias.

Here, z represents the weighted sum of inputs and biases, w^T is the transpose of the weight matrix w, b is the bias, and σ() is the activation function (such as the sigmoid function) that introduces non-linearity to the output. ŷ represents the predicted output, which is the probability of the input image being a 7.

Loss Function

The loss function is used to measure the amount of error in the predicted value compared to the actual value.

One such function is the mean squared error:

\mathrm{MSE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2

Although the mean squared error function gives a good estimate of the error, we cannot choose this function because, the function is non-convex (i.e. It has many local maxima and minima which makes it difficult to find out a global minimum). Another commonly used loss function for binary classification problems is the log loss function (also known as binary cross-entropy loss). It calculates the logarithmic loss between the predicted probabilities and the true labels:

\mathrm{L}(y, \hat{y}) = -\sum_{i=1}^{n}[y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i})]

Here, J(w, b) represents the cost function, m is the number of training examples, yi represents the true label for the i-th example, and ŷi represents the predicted probability for the i-th example.

Cost Function

The cost function is simple. It is defined as the average of loss function applied to all the training examples.

J(w,b) = \frac{1}{m}\sum_{i=1}^{n}L(y_i, f_{w,b}(x_i))

So now you can see that we wish to choose parameters w and b such that the cost function is minimized. It is done by algorithms such as Gradient Descent. After learning phase these parameters w and b are then used by the model to perform predictions.

Gradient Descent and Backpropagation

To find the minimum of the cost function J(w, b), we use optimization algorithms like Gradient Descent. The process involves two steps: forward pass and backpropagation.

In the forward pass, the parameters w and b are passed through the neural network, and the final prediction is obtained. This prediction may or may not be accurate.

After the forward pass, the backpropagation step begins. It involves adjusting the weights and biases in the opposite direction of the gradients of the cost function with respect to the parameters. This step helps in updating the parameters to minimize the loss. The weights and biases are adjusted using the learning rate (alpha) and the gradients calculated during backpropagation as:

w_n=w_n-\alpha \frac{\delta z}{\delta w_n}

and

b=b-\alpha \frac{\delta z}{\delta b}

The forward pass and backpropagation steps are repeated iteratively for a desired number of epochs or until convergence is achieved. In each epoch, the weights and biases are adjusted in the direction of the steepest descent, proportional to the learning rate.

Conclusion

This concludes the explanation of the training and testing datasets, as well as the mathematical representations and concepts related to our handwritten digits classification problem using neural networks.

In the next article of the series, we will delve into the details of the Gradient Descent algorithm and derive equations for the backpropagation step.

Stay tuned for more insights into neural networks in our Neural Networks 101 series!

Neural Networks 101- Epoch 1