In lesson 1, we were introduced to the basics of linear regression in a univariate context.
Now in lesson 2, we start to introduce models that have a number of different input features (multivariate).
We also cover the Normal equation, mean normalisation, and feature scaling.
For multivariate linear regression, the hypothesis is this, for one training example. We now have multiple x features, and multiple theta weights.
We can convert this hypothesis into matrix form by doing this. Note that this is for ONE training example.
The above is more concisely expressed in this vectorised formula for the hypothesis
Summary of the info in the previous list
Notice that there is no x(0) term. For convenience we assume x{0}=1 with a weight of 0, so that we can use matrix multiplication
Vectorised Hypothesis
With multiple training examples, we use this vectorised formula
Now with multiple features, with multiple training examples, you can construct vectors of X and Theta like this.
Note: Where the first row of the X matrix represents one training example
Recall that the cost function, for all training examples is this. (Also known as the "mean squared error")
We can replace h(x) now with the vectorised form in the previous list. This is the form for multiple features, and multiple training examples.
Vectorised Cost Function - Therefore, the vectorised version of the cost function now becomes this, where XTheta is the hypothesis
Remembering that X * Theta is the vectorised hypothesis
Now, because we have arranged theta, X and y as matrices, we can translate the hypothesis into vectorised code
The general form for gradient decent
Gradient Descent is a process that lets you "descend" down the cost function in order to find the minimum/optimal theta values of the cost function
Gradient descent algorithm. Repeat this until it converges, where j = 0 to n number of features. Theta j, x j, and y are all matrices.
Breaking down the above, into it's seperate components for each Theta. Similar to univariate, but now generally we compute the descent for each ø in the cost function (remember that for each ø{j} you want to get the partial derivative with respect to j)
Remember also that this section of the function is the derivative of the J with respect to theta
The vectorised form of Gradient Descent is this.
Where X, theta and y are all matrices. Alpha is the learning rate and is a scalar. And m is the number of training examples.
Gradient Descent - Vectorized!
A single gradient descent step in Matlab code. We perform this for each gradient step iteration. X, hypothesis and theta are all matrices.
Full code example, with iteration loop
The learning rate is a scalar value, alpha
You need to choose an appropriate value of the learning rate alpha to tune gradient descent, to make sure that convergence happens (ie stops decreasing by non-trivial amounts)
If you set the learning rate too small, convergence will be slow. If you set it too high, is may not converge at all
To make sure gradient descent is working, plot the cost J theta, against the number of iterations, and make sure it is decreasing. If it decreases by less than a sufficiently small value (say 10 to the -3) then we can declare that gradient descent has converged.
Summarising learning rate guidelines for alpha
The equivalent in Octave of X(Transpose) is X' (Prime). So in Octave to compute ((X(Transpose)X)(inverse)) X(Transpose)Y - we can write pinv(X' X) X' * Y
Gradient Descent vs Normal
Summary
Normal equation, solves directly for the minimised cost function J
Basically set dx.J(ø) = 0 for every j
Example ...
If for one training example ...
Matrix X - To calculate the design matrix X ...
Matrix Y - To calculate the matrix Y (its simply the vector of all results in the training example)
in summary, for X, and Y ....
X is the design Matric
Diviiding by the max-min value should get x in approximately these values -1 < x < 1
We use feature scaling so gradient descent uses less steps to exectue
Polynomial Regression. If you choose your features in this way, then feature scaling becomes very important.
As an example, you can combine features to simplify a hypothesis expression
Summary:
Take (x - average of x) / max - min
... in other words