Machine learning can be an intimidating field to get into. In the event you want to dive right in, I’m providing a very brief and fairly high-level overview of how machine learning works. I’ll also cover a very simple image recognition project using Tensorflow that you can get up and running with a little bit of Python knowledge.
To begin, let’s discuss some of the basics. Machine learning is largely practiced through the use of neural networks. Neural networks are comprised of nodes that take an input and produce an output. These inputs and outputs are generally n-dimensional arrays. For example, an image you are providing as an input to a neural network would be represented as a 3-dimensional matrix with dimensions (width, height, number of color channels). Each node can be thought of as an operation that manipulates the data that comes into it. Operations can be as simple as addition or multiplication or as complicated as a multidimensional convolution. Each node also has variables (n-dimensional matrices that are often referred to as “weights”) that are used in whatever operation the node performs to transform the input. These weights on each node are the neural network’s “brain.” During training of a classification neural network, the weights are incrementally changed based on an error function using an algorithm like stochastic gradient descent (which attempts to minimize the error). This error function is a node that takes the neural network’s output value and the known correct output value (which is available during training) and calculates the error (often something like cross-entropy). After the training data set has been sent through the neural network many times and the error has been minimized by altering the weights, features of the training data are “remembered” in the weights. After training, when sending in new data that the neural network has not seen before, the network will use what it learned about the training data and apply it to this new data. In a classification neural network, this means detecting familiar features of the data and assigning probabilities that it belongs to each of the known classes. This is a high-level overview and isn’t quite the complete story, but hopefully, it provides some insight into how a neural network functions.
So now let’s actually do some machine learning with a simple script that uses Tensorflow to detect bunnies in images. For the sake of brevity, I will leave it to the user to get set up with Python and Tensorflow and to become familiar with how Tensorflow is structured. Additionally, there is a wealth of example Tensorflow code available in the models Github repo. The classify_image.py file from this repo will be the basis for my bunny-finding script. The classify_image.py script takes an image file, loads a pre-trained neural network trained on the ImageNet data set, and classifies the image, outputting the top 5 probabilities and the corresponding image categories. I will discuss some of the code in this script and explain how it can be repurposed to find bunnies.
This neural network (called the Inception-V3 model) has been pre-trained on the entire ImageNet database, a computationally intensive task that can take days depending on the resources available. With the GraphDef file in the classify_image.py file, the nodes of the Inception neural network and the trained weights associated with each node can be loaded, bypassing the intensive training step. In four lines of code, you have a trained graph capable of classifying images across 1000 categories.
Now a Tensorflow session can be started and inference run on an image. First, provide the image input tensor with a string containing the JPEG image data. Tensors in the loaded graph can be accessed by name. The tensors that need to be accessed here are ‘softmax:0’ and ‘DecodeJpeg/contents:0’.
The two code blocks above comprise essentially all of the “machine learning code” contained in the classify_image.py script. Most of the other code pertains to loading the image file into a string, downloading/extracting the zip file that contains the GraphDef file, and translating the indices of the output softmax tensor to human-readable categories. The predictions variable is a Tensor containing a 1×1000 array of the probabilities that the input image belongs to each of the classes. Translating the indices of this array to actual classes is just a matter of using the imagenet_2012_challenge_label_map_proto.pbtxt and imagenet_synset_to_human_label_map.txt files used in the NodeLookup class (unzipped with the GraphDef file). For simplicity’s sake, I skip that step in my bunny-finding script.
You can find my bunny finder script at the gist here
To run the script, call it through the command line and provide an image url as an argument.
python bunnyfinder.py image_url
The script instantiates the BunnyFinder class and passes in a list of urls to the BunnyFinder.findbunnies(url_list) method, which returns a tuple with the list of bunny-containing image urls and a list of their respective confidences. The script can be tweaked to find any of the ImageNet classes by changing the RABBIT_CATEGORY_IDS array defined at the top of the file to whatever object id’s you would like. As stated before, the label map and the human label map text files found in the GraphDef zip file can be used to find object id’s you might want to detect.
This has been an introductory tutorial on machine learning and neural networks with a focus on seeing the basics in action. From here, I encourage you to explore the Tensorflow documentation on the Tensorflow site. You could also extend the functionality of the bunnyfinder script to detect images of categories not included in the ImageNet classes by retraining Inception-V3 on a new class (here might be a good start).