In this post I will explain in relatively simple terms what a neural network is in the context of machine learning. I’ll sprinkle in some machine learning terminology along the way with links to resources you can learn more from.
We’ll start with a simple statement:
A neural network takes input data and transforms it to a desired output format. It can be trained to do this in a way that reflects the real world.
And now we break it down.
What is the input data?
Numbers, essentially. These numbers can be in various forms and represent various things. An image as input data might be a 3-D array of red, green, and blue color channel values:
A three-dimensional array representation of an image with blue, green, and red color channels.
Input data could also be one-dimensional rows of a spreadsheet, database table, or other table-like structure:
|UserId||Hours||Heart Rate (BPM)||Temperature (F)|
A table row that can be treated as one-dimensional array (vector).
What is the desired output format?
Again, it’s just numbers – but more specifically numbers that answer the problem you’re trying to solve. By “format” I just mean the shape of the numbers (array dimensionality and size). If your input data consists of images of different breeds of dogs and you want to know what kind of dog appears in each image, your desired output format could be a number that correlates to a breed (you would assign each breed a number before training). If you also want to know the location of the dog in the image, the output could then be the breed and the pixel coordinates of the top left and bottom right corners of the bounding box containing the dog (a vector of 5 numbers).
How does the neural network transform the input data?
There are several operations (or layers) commonly used to transform data in neural networks. Some operations are incredibly simple like ReLU, which looks at all of the input values and replaces all those that are less than zero with zero. Some are more complex like convolution layers that could, for example, apply a 2D array called a kernel to a 2D input array similar to a weighted average. Regardless of complexity, all of the layers take numbers in and spit numbers out. Again, these numbers are arrays of varying dimensions.
This is an example of how a layer might transform its input. As you can see the shape changes from (100, 100, 3) to (50, 50, 3). The output array of each layer is called its activations.
The sequence of operations is what comprises the internals of the neural network (often called the “hidden layers”). FYI, the use of neural networks is considered “deep learning” when there are a bunch of hidden layers.
A key part of the layers with regards to training is the weights. Some layers have an array (or multiple) of numbers (dimensionality depends on the type of layer) called weights that they use to perform their transformation. For example, the kernel in the convolutional layer I mentioned earlier would be included in its weights. The values of these weights can be changed during training as we’ll see later. All the weights across the layers of the network are what allows it to store information and “learn.”
The above basic matrix calculation demonstrates how a weight matrix might be used to transform the input in a neural network layer (this specific example is part of what occurs inside a type of neural network layer called a fully-connected layer). For reference, the input matrix above could represent pixels of an image (black and white in this case since the matrix doesn’t have a third dimension for color channels), or more likely, the output activations of some layer before it.
So how do you know what hidden layers to use and how to hook them up? To build your network, you need some understanding of the purpose that the different kinds of layers serve and how they might help you extract the important features from your input data and eventually output an answer in the desired form. That said, there are popular neural network architectures that deal with certain input data types and solve particular kinds of problems well, so you often don’t have to start from scratch. For example, image recognition is often done using a ResNet architecture.
How do we train?
To start with, the neural network, a.k.a. a series of matrix operations, doesn’t know how to do much of anything. All of the weights throughout its layers comprise a “brain” of sorts – a means of storing learned information. But those weights start out as random numbers, so this brain has no idea what it’s doing. The network needs to learn what the values of these weights should be.
During training we pass in input data samples to the network a few at a time. For each sample we pass in, we get an output. How do we take that output and learn from it?
Training relies on an error function which is a measure of how much the output (what the neural network produces when fed an input sample) differs from the actual answer. This “measure” is of course just a number. Error functions, also called loss functions, come in many different forms, but they all tell you how wrong the model is with respect to the training data.
The need for an “actual answer” brings up an important point I skipped over when introducing input data. For each sample of input data you train with, you must have a corresponding correct output data sample. For instance, each picture of a dog in the training data should be labeled with a breed (and possibly coordinates of the bounding box if that’s what you’re interested in). Providing the model with an input data sample and an answer is what makes this approach supervised learning. The network uses these sample and answer pairs to “learn.” That’s the “real world” piece of our original statement about neural networks. We tell the model how the real world behaves through the sample and answer pairs we train it on, and it learns how to emulate them.
So now we know what model uses to learn, but how does the model use it? As you feed in training samples to the neural network, each of the weights in the layers are slightly incremented or decremented to decrease the value of the error function. How do we know whether increasing or decreasing a specific weight will increase or decrease the error? This is possible because the error function and the layer operations are differentiable. In other words, you can take the function’s derivative with respect to a weight and determine if increasing or decreasing that weight will decrease or increase the error. If you’ve tuned out at this point because “derivative” is too much calculus you don’t remember, just think of the learning process as tweaking each of the weights up and down a small amount and keeping the “tweak” that decreases the error. That’s the basic idea behind stochastic gradient descent and backpropagation: scary machine learning terms that describe a relatively simple process. Little by little, sample by sample, this is how the network learns. Provide an input sample to the network, transform it to the output format, calculate the error based on what the output was supposed to be, calculate derivatives (backpropagation), increment or decrement each of the neural network weights, and repeat. Slightly simplified, but this is the basic idea of how neural networks are trained.
Once all of the training input data samples have been passed through the neural network once, a training epoch has completed. Training usually involves several epochs, and after each one the samples are usually shuffled.
This whole training process is automated (the grunt work math is buried in tools like TensorFlow, PyTorch, etc.), but it’s not foolproof. You won’t get a quality model by just training any neural network on any data. Here are a few of the challenges that machine learning engineers face to get a worthwhile model:
- Choosing the right neural network architecture and adjusting it if need be (swapping out layers, introducing additional layers, etc.)
- Providing the model with quality training data to learn from (things like missing data, too little data, all cause problems)
- Tuning a myriad of training parameters and model parameters (the training process has tunable values such as the learning rate and many layers have some as well)
- Changing the training schedule (how many epochs, varying the learning rate as training proceeds, temporarily “freezing” layers so their weights don’t get updated)
After the error is satisfactorily minimized, the neural network can be put to the test on real world data. If it performs well on real world input data it has never seen before, we say that the model generalizes well. If it performs poorly, the model has probably overfit to the training data. In the example of recognizing dog breeds, if the only Beagle images the model saw during training were those of Beagles panting, an overfit model may not be able to generalize to images of Beagles with closed mouths. The model has “learned” that if a dog isn’t panting it’s not a Beagle, but that’s a detail in the training images we didn’t necessarily want it to pick up on. It has overfit and learned too much from the training data, reducing its ability to generalize to the vast variety of Beagle images in the real world. We can’t be expected to train it with pictures of Beagles in every possible orientation, so we want it to learn enough about Beagles to distinguish them from other breeds, but not too much.
The beauty of neural networks is that the right architecture with the right training and model parameters will learn the important features and interactions from the input data on its own. It’s not magic – it’s just math, and there is certainly work required to tune neural networks correctly. However, manually pinpointing the important features and interactions in a data set and translating them into an algorithm is considerable more work than tuning some parameters and letting a neural network do its thing.
This is by no means everything you need to know about neural networks, but it hopefully provides a handhold for someone interested in how they work. There are a lot of details that were not covered here and I encourage you to dive deeper into the types of neural network architectures, what they’re used for, and how the training process works.