 # Moving Below the Surface (3): Gradient Descent — William

This week, we are going to talk about gradient decrease. The beauty of artificial neural network is that it utilizes a very simple algorithm to optimize itself which brings down the error of the system, the gradient decrease.

Gradient Decrease could quickly find the local minimum for a function. In the field of machine learning, this method is applied to the error function E(x) (or sometimes also called the cost function) so we could find the point of minimum error. That is exactly what we are looking for training the system.

To visualize this, let us assume a set of example data shown below. Our task is to find the line of best fit, performing a simple linear regression.  With the data given above, we want to determine the best m and b that best represents the points the graph. To do this, we define the measure of the “representativeness” with the error function E(m,b) that takes the two coefficients as independent variables. In this case, we will utilize the average sum of squared differences as our error function. Essentially, we take the square of all differences between the predicted values of our best fit line and the actual values and average the sum over the all N data points (xi,yi) in the dataset. If we express our error function in a mathematical way, we will see something like this: Now with the help of MatLab, we could plot the error function in a 3-dimensional graph. We could see straight away that the global minimum point lies at the point where m=5 and b=3. Gradient Descent offers a systematic way to find the local minimum (global if you are lucky) without it being in a specific place. Of course, gradient descent has is own limitation, but we will put back this discussion to later posts. The goal of the algorithm is to find the minimum point (m*,b*) starting at a random point (m0,b0). Recall from the multivariable calculus class, the gradient of a function at a specific point is the vector formed by the partial derivatives of the function along both axes, i.e. The gradient vector represents the direction where the function increases the most greatly. Therefore, to find the point of the minimum we need to move along the opposite or the negative direction of the gradient vector. A more formative way to express this shown as the following: In this function group, mj and bj are the points we start with, while mj+1 and bj+1 are that of the next step. The γ parameter represents the learning rate which controls the effect of the variable movements. Again, we will leave how to find suitable learning rate for later blogs.

The challenge of this method comes when we find the partial derivatives of our error functions with respect to the coefficients. In this case, both derivatives of our error function are the following: Note that the average sum of squared differences is not difficult to take derivatives comparing to other ones.

The following is a single run of the gradient descent method of 20 steps with a 0.01 learning rate and we could see the algorithm approaches the real minimum quite well. We will explore more topics on machine learning next week. Stay put!

Work Cited