Amazon Cloud 101

In this post I will explain in relatively simple terms what a neural network is in the context of machine learning. I’ll sprinkle in some machine learning terminology along the way with links to resources you can learn more from.

We’ll start with a simple statement:

A neural network takes input data and transforms it to a desired output format. It can be trained to do this in a way that reflects the real world.

And now we break it down.

What is the input data?

Numbers, essentially. These numbers can be in various forms and represent various things. An image as input data might be a 3-D array of red, green, and blue color channel values:

A three-dimensional array representation of an image with blue, green, and red color channels.


Input data could also be one-dimensional rows of a spreadsheet, database table, or other table-like structure:

UserId Hours Heart Rate (BPM) Temperature (F)
23 8.5 72 96

A table row that can be treated as one-dimensional array (vector).

What is the desired output format?

Again, it’s just numbers – but more specifically numbers that answer the problem you’re trying to solve. By “format” I just mean the shape of the numbers (array dimensionality and size). If your input data consists of images of different breeds of dogs and you want to know what kind of dog appears in each image, your desired output format could be a number that correlates to a breed (you would assign each breed a number before training). If you also want to know the location of the dog in the image, the output could then be the breed and the pixel coordinates of the top left and bottom right corners of the bounding box containing the dog (a vector of 5 numbers).

How does the neural network transform the input data?

There are several operations (or layers) commonly used to transform data in neural networks. Some operations are incredibly simple like ReLU, which looks at all of the input values and replaces all those that are less than zero with zero. Some are more complex like convolution layers that could, for example, apply a 2D array called a kernel to a 2D input array similar to  a weighted average. Regardless of complexity, all of the layers take numbers in and spit numbers out. Again, these numbers are arrays of varying dimensions.

This is an example of how a layer might transform its input. As you can see the shape changes from (100, 100, 3) to (50, 50, 3). The output array of each layer is called its activations.


The sequence of operations is what comprises the internals of the neural network (often called the “hidden layers”). FYI, the use of neural networks is considered “deep learning” when there are a bunch of hidden layers.

A key part of the layers with regards to training is the weights. Some layers have an array (or multiple) of numbers (dimensionality depends on the type of layer) called weights that they use to perform their transformation. For example, the kernel in the convolutional layer I mentioned earlier would be included in its weights. The values of these weights can be changed during training as we’ll see later. All the weights across the layers of the network are what allows it to store information and “learn.”


The above basic matrix calculation demonstrates how a weight matrix might be used to transform the input in a neural network layer (this specific example is part of what occurs inside a type of neural network layer called a fully-connected layer). For reference, the input matrix above could represent pixels of an image (black and white in this case since the matrix doesn’t have a third dimension for color channels), or more likely, the output activations of some layer before it.  

So how do you know what hidden layers to use and how to hook them up? To build your network, you need some understanding of the purpose that the different kinds of layers serve and how they might help you extract the important features from your input data and eventually output an answer in the desired form. That said, there are popular neural network architectures that deal with certain input data types and solve particular kinds of problems well, so you often don’t have to start from scratch. For example, image recognition is often done using a ResNet architecture.

How do we train?

To start with, the neural network, a.k.a. a series of matrix operations, doesn’t know how to do much of anything. All of the weights throughout its layers comprise a “brain” of sorts – a means of storing learned information. But those weights start out as random numbers, so this brain has no idea what it’s doing. The network needs to learn what the values of these weights should be.

During training we pass in input data samples to the network a few at a time. For each sample we pass in, we get an output. How do we take that output and learn from it?

Training relies on an error function which is a measure of how much the output (what the neural network produces when fed an input sample) differs from the actual answer. This “measure” is of course just a number. Error functions, also called loss functions, come in many different forms, but they all tell you how wrong the model is with respect to the training data.

The need for an “actual answer” brings up an important point I skipped over when introducing input data. For each sample of input data you train with, you must have a corresponding correct output data sample. For instance, each picture of a dog in the training data should be labeled with a breed (and possibly coordinates of the bounding box if that’s what you’re interested in). Providing the model with an input data sample and an answer is what makes this approach supervised learning. The network uses these sample and answer pairs to “learn.” That’s the “real world” piece of our original statement about neural networks. We tell the model how the real world behaves through the sample and answer pairs we train it on, and it learns how to emulate them.

So now we know what model uses to learn, but how does the model use it? As you feed in training samples to the neural network, each of the weights in the layers are slightly incremented or decremented to decrease the value of the error function. How do we know whether increasing or decreasing a specific weight will increase or decrease the error? This is possible because the error function and the layer operations are differentiable. In other words, you can take the function’s derivative with respect to a weight and determine if increasing or decreasing that weight will decrease or increase the error. If you’ve tuned out at this point because “derivative” is too much calculus you don’t remember, just think of the learning process as tweaking each of the weights up and down a small amount and keeping the “tweak” that decreases the error. That’s the basic idea behind stochastic gradient descent and backpropagation: scary machine learning terms that describe a relatively simple process. Little by little, sample by sample, this is how the network learns. Provide an input sample to the network, transform it to the output format, calculate the error based on what the output was supposed to be, calculate derivatives (backpropagation), increment or decrement each of the neural network weights, and repeat. Slightly simplified, but this is the basic idea of how neural networks are trained.

Once all of the training input data samples have been passed through the neural network once, a training epoch has completed. Training usually involves several epochs, and after each one the samples are usually shuffled.

This whole training process is automated (the grunt work math is buried in tools like TensorFlow, PyTorch, etc.), but it’s not foolproof. You won’t get a quality model by just training any neural network on any data. Here are a few of the challenges that machine learning engineers face to get a worthwhile model:

  • Choosing the right neural network architecture and adjusting it if need be (swapping out layers, introducing additional layers, etc.)
  • Providing the model with quality training data to learn from (things like missing data, too little data, all cause problems)
  • Tuning a myriad of training parameters and model parameters (the training process has tunable values such as the learning rate and many layers have some as well)
  • Changing the training schedule (how many epochs, varying the learning rate as training proceeds, temporarily “freezing” layers so their weights don’t get updated)

After the error is satisfactorily minimized, the neural network can be put to the test on real world data. If it performs well on real world input data it has never seen before, we say that the model generalizes well. If it performs poorly, the model has probably overfit to the training data. In the example of recognizing dog breeds, if the only Beagle images the model saw during training were those of Beagles panting, an overfit model may not be able to generalize to images of Beagles with closed mouths. The model has “learned” that if a dog isn’t panting it’s not a Beagle, but that’s a detail in the training images we didn’t necessarily want it to pick up on. It has overfit and learned too much from the training data, reducing its ability to generalize to the vast variety of Beagle images in the real world. We can’t be expected to train it with pictures of Beagles in every possible orientation, so we want it to learn enough about Beagles to distinguish them from other breeds, but not too much.

The beauty of neural networks is that the right architecture with the right training and model parameters will learn the important features and interactions from the input data on its own. It’s not magic – it’s just math, and there is certainly work required to tune neural networks correctly. However, manually pinpointing the important features and interactions in a data set and translating them into an algorithm is considerable more work than tuning some parameters and letting a neural network do its thing.

This is by no means everything you need to know about neural networks, but it hopefully provides a handhold for someone interested in how they work. There are a lot of details that were not covered here and I encourage you to dive deeper into the types of neural network architectures, what they’re used for, and how the training process works.

At the end of next month, I’ll be hanging up my project manager hat and embarking on a new journey. Over the past year, I’ve been balancing my role here at Oak City Labs with graduate school as I’m pursuing my Master’s of Arts degree in Teaching from Meredith College. Though I’ve spent the better part of the past decade working in the technology industry, I felt a shift in my goals and interests and, with the amazing support of the Oak City Labs leadership team, I decided to make a career change.

Why tell you this?

Because in some ways, I’m not really leaving the technology industry at all. I’ll just be applying my skills in a different way to a different set of “clients” (read: elementary school students).

In my time spent in graduate school and in field placement positions this past year, it has become increasingly clear to me that there is more of a need for globally-minded, technologically-equipped educators than ever before. The reality is that educators need to be preparing students for jobs that don’t even exist yet. Yes, you read that correctly. According to the World Economic Forum, 65% students entering elementary school now (aka my future “clients” if you will) will hold jobs that don’t even exist yet. And that data is two years old. The numbers have certainly increased since then.

I think about some of my recent blog posts on artificial intelligence, machine learning, computer vision and machine vision. As cutting edge as these technologies are, odds are they will have significantly evolved by the time current primary and younger secondary school students graduate high school in 8-10 years. Therefore, instead of preparing students for specific jobs, we are charged with preparing students with skill sets that will grow with them as this world also grows.

Figuratively, that preparation is a multi-layered, interdisciplinary approach to learning beginning with the earliest grades through high school graduation. It looks different for every student and every teacher. Practically that preparation begins with integrating meaningful technology in the classroom, expanding student learning through social studies and science, as well as enhancing student understanding through the arts.  

The hope is through all of our efforts, we’ll prepare students not for the jobs that artificial intelligence will certainly replace, but for new jobs that work alongside artificial intelligence. While machine vision may eliminate the need for a factory worker to inspect products, machine vision will certainly create the need for software engineers to manage the inspection system. And desirable software engineers will need to possess specific skills in technology, along with soft skills like critical thinking/problem solving, collaboration, communication and creativity/innovation.

So I leave this role, company and industry with a lot of change ahead, but I’m hopeful that my efforts will foster students that are better prepared for those jobs that don’t even exist yet. And maybe even some future employees of Oak City Labs.

PS – Did you hear? We’re currently on the hunt for a project manager and software developer. Check out our Careers page for more information and job details.

In my last post I shared how to train an image classifier on your own image using the fastai library. To continue our machine learning journey, we will look today at how to train a model on structured data like that found in a database or a spreadsheet. Specifically, we will look at how to turn a list of categorical and continuous variables into a prediction for a single response variable.

For example, say you have a spreadsheet with columns Month, State, Average Temperature, and Precipitation in inches. You want to be able to predict Precipitation in inches for a specific Month, State, and Temperature. In this case Month and State would be categorical variables (while their values can be represented as numbers, their relationships to each other are likely more complex than a one-dimensional continuous spectrum can represent), Temperature would be a continuous variable (because its a floating point number), and the precipitation in inches would be the response variable.

The code used in this post is based on Lesson 4 of the deep learning course. You will need the library to run it.

We will use the pandas library to load and manipulate our data into the proper format. It’s already imported in fastai for us.

You may need to clean up your dataframe and/or add some additional features (columns). I’ll leave that up to you, but here’s the the docs for the pandas library to get you started.

We need to convert our categorical variables into category types in the dataframe as well as standardize our continuous variable types to float32:

In order to train the neural network, the dataframe must contain only numeric values and we must separate out the response variable (the variable we are interested in training the model to calculate).

This proc_df function separates out our “y” variable and converts all columns of the dataframe to a numeric type.

Now we get the indexes for our validation set. Note that my example data here is times series data, so instead of randomly sampling the data set to get the validation set I take the last 25% of it. This ensures that the model will get trained to predict future values instead of random values interspersed along the timeline.

Next the model is constructed and trained.

I will briefly go through how the learner is created and what the parameters to the get_learner function mean. Fast AI assembles a model consisting of several fully connected layers (link) based on the data you provide to the get_learner method. Here is the definition of ColumnarModelData.from_data_frame from the source code and a breakdown of some of the parameters:

def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops, y_range=None, use_bn=False, **kwargs)

  • emb_szs – a list of tuples that specifies how the embedding matrices for the categorical variables will be structured. Embedding matrices essentially take each category of a categorical variable and represent it as a vector of numbers. All categories in a categorical variable have vectors of equal length and the combination of all of the category vectors forms the embedding matrix for that variable. These vectors become part of the neural network “weights” that get trained through backpropagation and stochastic gradient descent, resulting in a rich representation of each category as a point in multidimensional space. The tuples hold the number of categories and the length of the embedding vectors as their two terms. The tuples are listed in the same order as the cat_vars variable defined earlier and passed to ColumnarModelData.from_data_frame.
  • n_cont – the number of continuous variables.
  • emb_drop – The third parameter is the dropout rate for the embedding matrices. Dropout is a technique used to avoid overfitting and introduce some randomness by taking some of the model’s weights and “dropping” them, setting their value to 0. In this case a dropout rate of 0.4 means that each embedding weight has a 0.4 chance of being zeroed out.
  • out_sz – the number of output variables of the model. We just have one, the response variable.
  • szs – an array of sizes of the model’s fully connected layers. The number of sizes determines the number of layers the model will have.
  • drops – an array of dropout rates for each fully connected layer.

To use your model to get a prediction after training it, you can use the following code

Here cat_values is a 2D array with column values that match the categorical variable indices (make sure the columns are in the same order as in your pandas dataframe) and cont_values is the same thing just with the continuous variables. Your single number prediction should be in the pred numpy array. You can also pass in multiple rows to get multiple predictions at once.

As an aside, you can get the categorical variable category indices by looking at the attribute, where column_name is the name of the column you’re interested in and df is the dataframe (this will only work after the column has been set as categorical with the df[column_name].astype(‘category’).cat.as_ordered()). Just set the correct column in the cat_values array to the index of the category you want.

Feature engineering is arguably the hardest part of training any model. Getting the best model will rely heavily on how relevant the columns of data you have are to the problem you are trying to solve. If your model is performing poorly, rethink what data your training it on and experiment with extracting new features from your data and introducing other data that might be correlated.

Today we’re back again sharing the basics of two not-so-emergent technology concepts and breaking down the basics between each: computer vision and machine vision.

It is possible that you’re reading this blog and have never heard of computer vision or machine vision. The concepts are well-known and discussed within the technology world, but the same can’t be said for the general public. Despite that unfamiliarity, the general public is already experiencing computer vision and machine vision in ways they may be surprised by. Read on to learn more.

Computer Vision

Computer vision falls under the Artificial Intelligence umbrella just like machine learning does. The goal is to utilize computers to acquire, process, analyze and understand digital images or videos. For instance, computer vision is being employed when a train station has a computer use security camera footage to count the number of people entering and exiting instead of manually counting with a turnstile. Or computer vision is at use when driverless cars use a live video feed to make decisions about turning, braking, speed, etc.

Have you seen the augmented reality capabilities from IKEA? The company encourages you to use your device to video your living room and then they virtually place sofas, coffee tables and chairs in real time for your consideration before making the big purchase. That’s possible because of computer vision. Summed up, computer vision is attempting to use a computer to emulate the human eye, visual cortex and brain when acquiring, processing, analyzing and understanding images.

We’ve talked about computer vision before here on our blog.

Machine Vision

When computer vision is put in place in an industrial (and sometimes non-industrial) setting to inform operations and functions of a machine, we call that machine vision. An inspection task at a manufacturing facility once performed by humans, can now be performed by machine vision.

Machine vision is at use when at a manufacturing facility, a machine will scan (read: computer vision) a bottle to ensure the liquid product (like cleaning solution, soda, medicine, etc.) is correct, the fill level is correct, the container is free of flaws, the correct label is placed (and placed straight!), the expiration date is correct, etc. And when one or more of these conditions aren’t met, machine vision has logic in place to tell the production line to reject the item. The beauty of machine vision is that all of the sample analyses I gave above are performed by one machine, with a high-degree of accuracy, over and over again.

At Oak City Labs, our mission is help businesses and organizations solve daily problems with technology. Utilizing computer vision and machine vision are excellent ways to accomplish that task. Do you have a problem that you need help solving? If so, let us know! We’d love to chat.

At Oak City Labs, we enjoy solving all kinds of problems. Our projects span subject areas from IoT, to mining data from social media to integrating video capture hardware. One of my favorite projects we’ve worked on recently involves computer vision and real-time video analysis of data from a medical device.

Our client, Altaravision, “has developed the most portable, high-definition endoscopic imaging system on the market today”, called NDŌʜᴅ. A Fiberoptic Endoscopic Evaluation of Swallowing or FEES system like this allows a medical professional to observe and record a patient swallowing food. The NDŌʜᴅ system is portable and uses an application running on a MacBook to display the endoscope feed in real time and record the swallowing test to a video file.

After the test is completed on the patient, the video is reviewed to evaluate the efficiency of swallowing. Ideally, the patient will swallow all of the food, but a range of conditions can result in the patient being unable to adequately swallow all the material. Particles that aren’t swallowed may be aspirated and cause pneumonia. When reviewing the test footage, the test administrator has traditionally had to carefully estimate the amount of residual material after swallowing. Not only is this extremely time-consuming, but also introduces human error and compromises the reproducibility of results.

Oak City Labs has been working with Altaravision to tackle this problem. How can we remove the tedious aspect from the FEES test and make the results available faster and with better consistency? As with all our automation projects, we’d like a computer to handle the boring, repetitive parts of the process. Using computer vision techniques, we’d like the NDŌʜᴅ application to process each frame of the FEES test footage, categorize pixels by color and produce a single numerical value representing the residual food material left in the throat after swallowing. We should give the user this feedback in real-time as the test is being performed.

The NDŌʜᴅ application runs on macOS, so we can leverage Core Image (CI) as the basis for our computer vision solution. CI provides an assortment of image processing filters, but the real power lies in the ability to write custom filters. A pair of these custom filters will solve the core of our problem.

Our first task is to remove the very dark and the very bright portions of our image. We’ll ignore the dark portions because we just can’t see them very well, so we can’t classify their color. Very bright portions of the image are just overlit by our camera and we can’t really see the color there either. Our first custom filter looks at each pixel in the image and evaluates its position in color space with respect to the line from absolute black to absolute white. Anything close enough to this grey line should be ignored, so we set it to be transparent. After some testing, it turned out that it was difficult to pick a colorspace distance threshold that worked well at the light end and the dark end, so we use a different value at each end of the grey spectrum and linearly interpolate between the two.

Throat no filter
Throat transparent filter

The top image is the original image data. The lower image is the image after the bright and dark areas have been removed. In particular, the dark area, deeper down the throat, in the bottom center has been filtered out as well as the camera light’s bright reflection in the top right corner.

Now that we have an image with the only the interesting color remaining, we can classify each pixel based on color. In a FEES test, the food is dyed blue or green to help distinguish it from the throat. We need our second pass filter to separate out the reddish pixels from the blueish and greenish pixels. In our second custom CI filter, we examine at each pixel and classify it as either red, green or blue by looking at it’s colorspace distance from the absolute red, green and blue tips of the color cube. We convert each pixel to its corresponding nearest absolute color.

Throat no filter
Throat color filter

The top image is the original image. The bottom image is the fully processed image, sorted into red and green (no blue pixels in this example). Note how the green areas visually match up against the residual material in the original image.

Finally, our image has been fully processed. Transparent pixels are ignored and every remaining pixel is either absolute blue, red or green. Now we use vImage from Apple’s very powerful Accelerate Framework to build a histogram of color values. Using this histogram data, we can easily compute our residual percentage as simply the sum of the green and blue pixel counts over the total number of non-transparent pixels (red + green + blue). This residual value is our single numerical representation of the swallowing efficiency for this frame of data.

In this process, we’ve been very careful to use high performance and highly optimized tools to ensure our solution can perform in real-time. The Core Image framework, including our custom filters, takes advantage of graphics hardware to run very, very quickly. Likewise, vImage is heavily optimized for graphics operations. We also use a little bit of the Metal API to display our CI images on screen, which is very speedy as well. While we’re enhancing NDŌʜᴅ on macOS, these tools are also quite fast on iOS as well.

At Oak City Labs, we love challenging problems. Working with real-time video processing for a medical imaging device has been particularly fun. As Altaravision continues to push NDŌʜᴅ forward, we look forward to discovering new challenges and innovating new solutions.