Welcome to the part 2 of our Image Recognition with Neural Networks on Cloud TPUs series. In the previous article we’ve made necessary preparations for 3 important processes that we will run through in this part. That is: training, evaluation of the trained data and finally the prediction on the “previously unseen” data!

You may think that there is quite a lot to neural networks. And to be honest – it’s true! But the basics that will allow you to solve most common problems are not that many. Most of the network parameters are just something you have to experiment with. A lot of the time you’ll be doing what is called an educated guesses. That’s tweaking layers and test data with the 75% certainty that it will improve on previous configuration. You can rarely be 100% sure what the impact will be on your network. They are simply too complex of a beasts.

So don’t get discouraged if you don’t fully understand all the code in this and previous article. Just keep googling and experimenting until you do. I promise you there’s nothing quite like the feeling of building a neural network that finds answers where classical algorithms (and people for that matter) simply break.

Let’s get to work!

Alright, with the proper introductions out of the way, we can continue where we left off. We’ve already prepared the test data and configured layers of our neural network model. What we have to do next is to compile our model. Nothing actually compiles here like in a computer program. We are only adding crucial bits of information on how our model will train to improve its accuracy. There are 3 important choices (input parameters for the compilation function) to make here:

  • Loss Function. This is a function that will execute at the end of each single data pass through the model. It will calculate the quality of the model by comparing the expected result (Y_DataSet) with the one that network predicted on its own. As you can imagine we are expecting this difference to be smaller as the network learns on training data.
  • Optimizer. This is an algorithm which will be executed once the loss function gives us the result on whether we did better or worse than on previous passes. Optimizer will then update weights of connections between neurons, layer by layer trying to “punish” connections that caused lowering of the general accuracy and “reward” those that helped it predict more accurately.
  • Metrics. This is what you as a data scientist will see as an output of training and use to help you decide weather changes you made to your network made a positive or negative impact. It can be one or multiple of available metrics which are calculated by functions very much like the loss function (but obviously their result does not impact model training)

Sometimes network_model.summary() which prints out layers configuration simply isn’t enough to visualize structure of your model. In these cases the plot_model function comes in handy. It creates an image of your layers including number of inputs and outputs for each layer.

Talk to the Keras

Now before our model can be trained/fitted, we need to tell Keras and underlying TensorFlow which hardware interface to use as its computational platform. You don’t talk to CPU, GPU and TPU in the same way so here we use a helper method to prepare our model to be executed on TPU. This pretty much wraps our model in an object containing some information helpful when working on TPUs like the “computation distribution strategy”. We use a TPU cluster made available to us courtesy of GoogleColab, its only 1 instance of K80 TPU with 8 cores (at the time of writing this notebook) but it’s free.

The helper method

Next we use helper method we created earlier to load 4 sets of data (well 2 really but as you remember we split the test and train datasets to test x/y and train x/y). Finally we train our model passing training data to it! Model will use selected optimizer combined with the loss function to try improve on each data pass. As you can see in the output plotted on an image lower in this article – the loss function result gets smaller on each epoch – this means network is improving as its result is the difference (often its percentage, so 0.08 is 8%) between predicted value and the known/correct value from Y_Dataset. Output should show that first pass which always have randomized values of weights is pretty much right half of the time but this quickly improves as neuron weigths are adjusted by the optimizer to levels of 4% of mistakes in predictions!

Beginning of training/fitting process output:

End of the output:

Power of visualization

It is often helpful to visualize some changing value on a graph, so below we you can see how we can do this for our loss function – this shows how it lowers with each epoch as this is what the optimizers goal is, to change neuron links weights on each data pass in such a way to minimize the value of the loss function for the whole model – meaning we are getting closer to the actual results we know are correct for particular train dataset values.

Now it’s time for our evaluation session. In this step we want to see what the result will be for the evaluation data – so the data this model has not yet seen during training. As you remember, we have earlier separated the input test data into 2 groups 80/20 – we have been training with the 80% so our network got really good in predicting values from that dataset. Now we will present it with the remaining 20% to see if it still performs well. For my session the result was slightly lower than in training as the loss function averaged 0.07 which still isn’t bad for a simple model like ours.

If we are happy with our results we can store the very important results of the training – the weights of each node connecting neurons. This way we can later quickly load the model and populate the weights without the need to retrain it.

Finally we will run few actual predictions on our model using .predict function. To do this we need some input data (something the network has not seen in either training or validation sessions) so we will yet again create a DataFrame to hold our input parameters (8 Irys instances, each containing 4 features describing some class of the flower).

print() function will show something like this:

And now, finally the moment you have been waiting for! We will pass our input data to the model and run prediction. Later we will print the network prediction cross-referenced with what we know should be the correct classes of Iris for each input row.

If all went well you should see output similar to this:

 

So there you are, our experiment is done. If you would like to run it on CPU instead of TPU to check the difference in time or anything else, all you need to do is run the sync_to_cpu() method. This basically switches wrapper used for our initial model to the one better suited for CPUs.

You can find all this code with comments in the GoogleColab notebook I have prepared right here. If I did my job right in explaining the above, the one thing still not clear is likely the choice of loss function and optimizer. This could easily be another article on its own – and I am not ruling out writing one. But for now just know this – there are optimizers and loss functions better suited for different problems.

For example our loss function sparse_categorical_crossentropy is useful for classification problems where the result is going to fall under just one category. And our optimizer AdagradOptimizer (Adaptive Gradient) is also well suited for classification problems where only 1 category is going to describe the input data. Understanding why that is will require a bit more math and explanation.

I will leave it for now, but if you are interested here are some materials on the subject which you may find useful:

https://blog.algorithmia.com/introduction-to-optimizers/

https://blog.algorithmia.com/introduction-to-loss-functions/

As always feel free to leave your thoughts or questions in the comments bellow. And stay tuned for my next article. I will write about neural networks that don’t even need labelled training data to do their magic!

Software developer at Aspire Systems Poland. Problem solver. The more complicated the problem is, the more motivated he gets. Whether it’s designing, improving processes, architecture or coding, he will be the first one to jump right in.