Neural networks have been around for decades. But it hasn't been until recently, with the rise of big data and the availability of ever increasing computation power, that we have really started to see a lot of exciting progress in this branch of machine learning.
The most ground breaking advances in the field of machine learning over the past decade, from computer vision to NLP, can be attributed to the rise of neural networks, and in particular deep learning.
Neural networks roughly model the "gating" functionality of biological neurons in the brain. In machine learning, we represent these neurons as "activation" units. In a computer neural network, these activation units are arranged in a number of different "layers". At its core, each activation unit in the network takes an input, runs that input through some sort of non-linear activation function (such as a sigmoid or ReLU), and produces an output value which is then passed through the next layer in the network.
By arranging large amounts of these layers and activation units together, we can construct a neural network that can take some sort of labelled data as it's input, and learn to make relevant predictions.
Adding more layers to a neural network often results in more accurate predictions. The term deep learning comes from neural networks that have multiple layers of activation units.
The following post is a theoretical introduction to neural nets, and is a set of summarised notes from lesson 4 of Andrew Ng's Machine Learning Standford course on Coursera. We start by learning how we represent neural networks in terms of math and code. We cover the structure of a basic neural net, how a hypothesis function looks like in neural net, and also start to represent some of the theory and bring that into some Matlab code.
A simplistic representation of what the above looks like this.
A neural network is composed of many neurons. Neurons are arranged into linear layers, but have non-linear activations. The internal layers of a neural network are called the "hidden" layers.
Now if we had one hidden layer
For the first activation unit, its value can be obtained by using the sigmoid function against the sum of its weights and inputs
Note that now for that single activation unit (a), we can collect the thetas into a vector, and X features into a vector.
All activation units (a) values can then be obtained like so:
The dimensions of these matrices of weights
The +1 comes from the addition in Theta^j of the "bias nodes," x0 â€‹ and Theta0^j
Note: Knowing the dimensions of the Theta matrix is important. When you are using matrix multiplication with Neural Nets, it will be useful to know the order in which to apply the Theta matrix in
The number of rows of the Theta matrices correspond to the number of "target" activation units.
The number of columns of the Theta matrices correspond to the number of "source" input units
Activation of unit i, in layer j
Notation for activation units
Each layer gets its own matrix of weights. Matrix of weights controlling function mapping from layer j to layer j+1
Reiterating how to obtain values of the activation units.
Now we assign a new value "z" to the inputs and theta weights.
In other words, for layer j=2 and node k, the variable z will be:
Turning x and z of j into vectors gives us this.
Now we can express z, generally for layer j as:
Therefore we can also express z j+1 as this.
Therefore, for activation unit a of j, we can apply the sigmoid function g "element-wise" to the matrix Z
Which gives us the hypothesis:
Example, to compute the a(superscript 2) layer
See the Coursera summary for a step by step run through of how to vectorise a Neural Net. https://www.coursera.org/learn/machine-learning/supplement/YlEVx/model-representation-ii
The above code:
Calculating training set accuracy
Predict from passing in a single training example.
Neural nets, learning their own features: So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression. It gets to learn its own features, a1, a2, a3, to feed into the logistic regression and as you can imagine depending on what parameters it chooses for theta 1. You can learn some pretty interesting and complex features and therefore 8:43 you can end up with a better hypotheses than if you were constrained to use the raw features x1, x2 or x3 or if you will constrain to say choose the polynomial terms, you know, x1, x2, x3, and so on