Ahmad Saeed

Consider a Single-Layer Neural Network with 1 neuron, 1 input for simplicity. The neuron has the following:

Input: $x$

Weight: $W$

Bias: $b$

Function: $z=Wx+b$

Activation: ReLU(z) = max(0, z)

Output: y

The typical Feed-forward process works as follows:

Once we get the output $y$ we have to compare it with the actual label/value $y_{true}$ . This is achieved by using a loss function as follows: $L = \frac{1}{2}(y-y_{true})^2$ . Where $y_{true}$ is the label we want to predict. Once the loss value is found we backpropagate gradients to update the weights, so the loss can be decreased. We take gradients as follows: $\frac{dL}{dW}$ , a gradient is just a derivative. As $y$ depends on $W$ , we use the chain rule to take partial derivatives and multiply them:

$\frac{dL}{dW} = \frac{dL}{dy} \cdot \frac{dy}{dz} \cdot \frac{dz}{dW}$ . When we calculate derivative of $ReLU$ activation, we check if $z > 0$ , the neuron activates and we pass the gradient, if $z \le 0$ the neuron dies and the gradient doesn't pass, hence why we add bias to reduce the risk of neuron dying.

After gradients have been calculated we update the weights by moving opposite to the gradients as: $W_{new} = W_{old} - n \cdot \frac{dL}{dW}$ , where $n$ is the learning rate which is usually a really small number to prevent gradients from taking large steps. The bias is updated the same way: $b_{new} = b_{old} - n \cdot \frac{dL}{db}$ . This step for updating trainable parameters (weights and biases) using computed gradients is called Gradient Descent.

This process of Feed-forward and Backpropagation continues till the loss converges to a minimum. One iteration of going through inputs and going back through gradients is called a Full Pass. The main goal is to minimize loss as much as possible.The lower the loss the more accurate the model.

This was just a simple single layer, single neuron neural network there's way more complex ones with more neurons and more layers to capture nuanced data, but the good thing: the structure is mostly the same with more neurons stacked. And so understanding more complex nn architectures becomes a little easier.

How does a Neural Network work?