In my last post I shared how to train an image classifier on your own image using the fastai library. To continue our machine learning journey, we will look today at how to train a model on structured data like that found in a database or a spreadsheet. Specifically, we will look at how to turn a list of categorical and continuous variables into a prediction for a single response variable.
For example, say you have a spreadsheet with columns Month, State, Average Temperature, and Precipitation in inches. You want to be able to predict Precipitation in inches for a specific Month, State, and Temperature. In this case Month and State would be categorical variables (while their values can be represented as numbers, their relationships to each other are likely more complex than a one-dimensional continuous spectrum can represent), Temperature would be a continuous variable (because its a floating point number), and the precipitation in inches would be the response variable.
The code used in this post is based on Lesson 4 of the fast.ai deep learning course. You will need the fast.ai library to run it.
We will use the pandas library to load and manipulate our data into the proper format. It’s already imported in fastai for us.
You may need to clean up your dataframe and/or add some additional features (columns). I’ll leave that up to you, but here’s the the docs for the pandas library to get you started.
We need to convert our categorical variables into category types in the dataframe as well as standardize our continuous variable types to float32:
In order to train the neural network, the dataframe must contain only numeric values and we must separate out the response variable (the variable we are interested in training the model to calculate).
This proc_df function separates out our “y” variable and converts all columns of the dataframe to a numeric type.
Now we get the indexes for our validation set. Note that my example data here is times series data, so instead of randomly sampling the data set to get the validation set I take the last 25% of it. This ensures that the model will get trained to predict future values instead of random values interspersed along the timeline.
Next the model is constructed and trained.
I will briefly go through how the learner is created and what the parameters to the get_learner function mean. Fast AI assembles a model consisting of several fully connected layers (link) based on the data you provide to the get_learner method. Here is the definition of ColumnarModelData.from_data_frame from the fast.ai source code and a breakdown of some of the parameters:
def get_learner(self, emb_szs, n_cont, emb_drop, out_sz, szs, drops, y_range=None, use_bn=False, **kwargs)
- emb_szs – a list of tuples that specifies how the embedding matrices for the categorical variables will be structured. Embedding matrices essentially take each category of a categorical variable and represent it as a vector of numbers. All categories in a categorical variable have vectors of equal length and the combination of all of the category vectors forms the embedding matrix for that variable. These vectors become part of the neural network “weights” that get trained through backpropagation and stochastic gradient descent, resulting in a rich representation of each category as a point in multidimensional space. The tuples hold the number of categories and the length of the embedding vectors as their two terms. The tuples are listed in the same order as the cat_vars variable defined earlier and passed to ColumnarModelData.from_data_frame.
- n_cont – the number of continuous variables.
- emb_drop – The third parameter is the dropout rate for the embedding matrices. Dropout is a technique used to avoid overfitting and introduce some randomness by taking some of the model’s weights and “dropping” them, setting their value to 0. In this case a dropout rate of 0.4 means that each embedding weight has a 0.4 chance of being zeroed out.
- out_sz – the number of output variables of the model. We just have one, the response variable.
- szs – an array of sizes of the model’s fully connected layers. The number of sizes determines the number of layers the model will have.
- drops – an array of dropout rates for each fully connected layer.
To use your model to get a prediction after training it, you can use the following code
Here cat_values is a 2D array with column values that match the categorical variable indices (make sure the columns are in the same order as in your pandas dataframe) and cont_values is the same thing just with the continuous variables. Your single number prediction should be in the pred numpy array. You can also pass in multiple rows to get multiple predictions at once.
As an aside, you can get the categorical variable category indices by looking at the df.column_name.cat.categories attribute, where column_name is the name of the column you’re interested in and df is the dataframe (this will only work after the column has been set as categorical with the df[column_name].astype(‘category’).cat.as_ordered()). Just set the correct column in the cat_values array to the index of the category you want.
Feature engineering is arguably the hardest part of training any model. Getting the best model will rely heavily on how relevant the columns of data you have are to the problem you are trying to solve. If your model is performing poorly, rethink what data your training it on and experiment with extracting new features from your data and introducing other data that might be correlated.