Neural networks 101

Epoch 2

In the previous article we saw the basic mathematical representation of neurons. In this article we shall derive the equations for gradient descent. Let’s get started.

The Equations

We have the following equations:

For Forward Pass

z=w^TX+b

\hat{y}=\sigma(z)

L(y,\hat{y})=-[ylog(\hat{y})-(1-y)log(1-\hat{y})]

For Back Propagation

dw_n=\frac{\delta L}{\delta w_n}

db=\frac{\delta L}{\delta b}

w_n=w_n-\alpha dw_n

b=b-\alpha db

During forward pass, the following steps take place:

The weight is multiplied with the input value (i.e. x) and is then added with the bias b. The point to be noted here is that the weights and the input values are vectors of a certain dimension, say n. Hence it is required to take the transpose of the weight matrix to ensure component wise multiplication. (Alternatively, the operation can also be seen as a dot product of w and x)
The sigmoid function is applied to the calculated value to ensure that the value that is returned by the neuron lies between 0 and 1. This output is taken as the predicted value.
The loss function calculates the error in the predicted value and the actual label related to the particular training example.

During back propagation, the following steps take place:

The derivatives dw and db are calculated for each dimension of the vector using the first two equations.
Then the weights and biases are updated according to the third and fourth equation. The lambda variable in the third and fourth equation is the learning rate. This learning rate is used to define how big of a ‘step’ the model will take in the direction of the decreasing slope.

This is just:

One step of gradient descent
On one training example
Using only one neuron

Yupp, it already looks complex in its most basic form.

Deriving the partial derivatives

The back propagation step involves the calculation of partial derivatives. The derivatives that are to be calculated requires the application chain rule in the reverse order (hence the name back propagation). Applying chain rule to the partial derivatives that appear in equations mentioned above we get:

\frac{\delta L}{\delta w_n}=\frac{\delta L}{\delta \hat{y}}\times\frac{\delta \hat{y}}{\delta z}\times\frac{\delta z}{\delta w_n}

\frac{\delta L}{\delta w_n}= -\begin{bmatrix} \frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}} \end{bmatrix} \times \hat{y}(1-\hat{y}) \times x_n

On solving, we get:

\frac{\delta L}{\delta w_n}=(\hat{y}-y)x_n

Similarly, for the bias

\frac{\delta L}{\delta b}=\frac{\delta L}{\delta \hat{y}}\times\frac{\delta \hat{y}}{\delta z}\times\frac{\delta z}{\delta b}

\frac{\delta L}{\delta b}= -\begin{bmatrix} \frac{y}{\hat{y}}-\frac{1-y}{1-\hat{y}} \end{bmatrix} \times \hat{y}(1-\hat{y}) \times 1

On solving, we get:

\frac{\delta L}{\delta b}=(\hat{y}-y)

Now since,

\frac{\delta L}{\delta z}=(\hat{y}-y)

We can rewrite the above equations as:

\frac{\delta L}{\delta w_n}=dz.x_n

\frac{\delta L}{\delta b}=dz

Where,

dz=\frac{\delta L}{\delta z}=(\hat{y}-y)

Now that we have the algorithm and successfully derived the equations for gradient descent, we can now develop a simple python code to perform a single step of gradient descent. However, since these equations calculate the gradient for just one training example, we have to use another function to essentially ‘average out’ the gradients of all the training examples. This is where the cost function is used.

Cost Function

The cost function is defined as the average of the loss function over all the training examples

J(w,b) = \frac{1}{m}\sum_{i=1}^{n}L(y,\hat{y})

Now using the cost function adds only one additional step to the existing algorithm, we need to average the gradients calculated for each training example.

(dz)_{avg} = \frac{1}{m}\sum_{i=1}^{m}dz.x_n

(db)_{avg} = \frac{1}{m}\sum_{i=1}^{m}dz

w_n=w_n-\alpha (dz)_{avg}.x_n

b=b-\alpha (dz)_{avg}

Now with this we have all the necessary equations to be able to implement one step of gradient descent.

Python pseudocode

Using our knowledge on back propagation, we can now write a simple pseudocode to perform gradient descent.

for i in range(0,no_of_epochs):
  z=np.dot(w.T,X)
  y=sigmoid(z)
  
  dz=y-Y
  dw=1/m*(np.sum(np.dot(dz.T,X)))
  db=1/m*(np.sum(dz))
  
  w=w-a*dw
  b=b-a*db

With each epoch, the parameters w and b are adjusted (or learned) by performing the gradient descent with a step size of a units. If we were to calculate the loss function at each epoch, we will notice that the loss value will be reduced in each epoch. Hence, as we increase the number of epochs, the model parameters are adjusted for a larger number of times.

And that’s it! In just 8 lines of code we have implemented gradient descent algorithm in python (However, it is pretty much useless as it deals with only one neuron).

Conclusion

With this, We have reached the end of the second article. In this article we have seen the how the partial derivatives for Gradient Descent are derived. We also extended our algorithm to be able to accommodate more than just one training example. In the upcoming articles, we will continue to improve upon our python code and extend it to handle more complex neural networks.

Until next time…

Neural Networks 101- Epoch 2