Away with ideas

Simple deep learning

2020-10-27T22:34:08+00:00

The field of deep learning is vast. The sheer number of publications on the subject is enough to overwhelm anyone. In this series we’ll be taking a step back. We’ll forget about the latest tips and tricks that are pushing the state of the art. Instead, through the use of simple datasets and toy problems, we’ll explore the fundamentals of deep learning to give you a better understanding of the big picture.

This series currently contains the following posts:

MNIST extended: a dataset for semantic segmentation and object detection

2020-10-27T22:33:58+00:00

Most open source datasets for computer vision are huge and complex. Building a model from scratch using ImageNet or Coco is impossible without days of training on specialised hardware such as GPUs or TPUs. I’ve often found myself in need of a simple and small dataset to test model architectures. I don’t always have a GPU available and I don’t want to wait hours for the results of my experiments.
For image classification I often use MNIST dataset. It’s an incredibly useful dataset of small digts (if you’re not familiar with it don’t worry, we’ll see what it looks like soon.). However, in its raw form it’s really only useful for image classification tasks. For more complex tasks such as semantic segmentation and object detection I created MNIST extended, a dataset as simple as MNIST but that can be used for more than just image classification. In this post, I will describe how to use MNIST extended and share a few details on the simple code that is used to generate it.

This dataset is used in my “Simple deep learning” series in the following posts:

A simple example of semantic segmentation with tensorflow keras

This post won’t go into the details of how the dataset is created, rather we’ll focus on what the dataset is composed of. However, the code is very well documented and easy to understand. You can find all the functions used here and the jupyter notebook version of this post in my github.

MNIST dataset

MNIST is a dataset of handwritten digits. The original dataset can be downloaded from Yann Lecun’s website. However, we do not need to download the data from there since we will be using a Keras function to do that for us. This dataset forms the base of both the semantic segmentation and object detection components of MNIST extended.

import tensorflow as tf

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

print(train_images.shape, train_labels.shape)
print(test_images.shape, test_labels.shape)

(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)

As you can see, there are 70000 images in total. Let’s display a few just to get an idea of what MNIST looks like.

from simple_deep_learning.mnist_extended.mnist import display_digits

display_digits(images=train_images, labels=train_labels, num_to_display=20)

Original MNIST digits

These digits form the base of MNIST extended. Let’s see how we can turn those single digit images into a semantic segmentation dataset.

Semantic segmentation

Semantic segmentation is the task of assigning a label to each pixel of an image. It can be seen as a an image classification task, except that instead of classifying the whole image, you’re classifying each pixel individually.

The input image is created by randomly overlaying digits from the original MNIST dataset on an empty array. The target array is of shape (height, width, num_classes), this corresponds to an output for which each pixel has a class.

Let’s take a look at what this might look like. We’ll generate images of height and width 60 pixels and choose digits 0-4 (i.e num_classes = 5).
We’re just going to use the basic parameters of the create_semantic_segmentation_dataset function. For more customisation, take a look at the documented code or check the “Customisation” section at the end of the article.

import numpy as np
np.random.seed(seed=9)

from simple_deep_learning.mnist_extended.semantic_segmentation import (create_semantic_segmentation_dataset, display_segmented_image,
                                                                       display_grayscale_array, plot_class_masks)

train_x, train_y, test_x, test_y = create_semantic_segmentation_dataset(num_train_samples=100,
                                                                        num_test_samples=10,
                                                                        image_shape=(60, 60),
                                                                        num_classes=5)

Below is a randomly selected example from the dataset and its shape. As you can see, the input is of shape (height, width, 1) which is expected, the input is a simple grayscale image. The output is of shape (height, width, num_classes), there is one channel per class. We’ll see what each channel contains in a bit.

import numpy as np

i = np.random.randint(len(train_x))
print(train_x[i].shape)
print(train_y[i].shape)

(60, 60, 1)
(60, 60, 5)

The following code displays the input image, as already mentioned, this is simply the original digits of MNIST randomly overlaid on a blank canvas.

from simple_deep_learning.mnist_extended.semantic_segmentation import display_grayscale_array

i = np.random.randint(len(train_x))
display_grayscale_array(array=train_x[i])

MNIST extended input example

The target class is a lot more interesting. The target array has a 3rd dimension of length equal to the number of classes to predict. That is, if our input images are composed of MNIST digits 0-4, then our target array will have a shape (width, height, 5).

In the following cell, we have a function that indexes the target array along the third axis (the classes axis) and displays each slice individually.

from simple_deep_learning.mnist_extended.semantic_segmentation import plot_class_masks
plot_class_masks(train_y[i])

MNIST extended semantic segmentation slices

Each slice contains only one type of digit. In our case, the input image is composed of 2 twos and 2 fours, therefore the target array has 2 twos at slice 2 and 2 fours at slice 4.

It’s by separating the digits of a certain class into different slices that we tell our model which pixels correspond to which class. When training a model, we want it to be able to separate pixels of the original image into their respective slice.

By default, in our dataset, classes are not exclusive. That means a pixel can part of more than one digit at a time. This will affect our loss function when building models but is not particularly important. If you want exclusive classes, you can set labels_are_exclusive=True in the create_semantic_segmentation_dataset function, in which case for pixels from multiple digits will only have one class, selected at random.

Below is another way of displaying the digits. This time instead of separating the slices, we give each slice a particular colour. Here’s what that looks like:

from simple_deep_learning.mnist_extended.semantic_segmentation import display_grayscale_array
display_segmented_image(y=train_y[i])

MNIST extended semantic segmentation example

That’s it for the basic information on the semantic segmentation dataset. If you want an example of how this dataset is used to train a neural network for image segmentation, checkout my tutorial: A simple example of semantic segmentation with tensorflow keras

Object detection

Object detection is the task of drawing a bounding box around objects of interest. The input data for the object detection problem is exactly the same as for the semantic segmentation. The target however is different. Instead of classifying each pixel, we want to output the coordinates of a bounding box and a class label for each predicted bounding box.

Generating the target for an object detection task is more complicated than for semantic segmentation. Different models use different target arrays. To remain generic, I have decided to output the bounding boxes and labels as lists. This cannot be used directly as a target for machine learning models but can be processed to produce a suitable target array for a given model.

from simple_deep_learning.mnist_extended.object_detection import create_object_detection_dataset    

train_x, train_bounding_boxes, train_labels, test_x, test_bounding_boxes, test_labels = create_object_detection_dataset(
    num_train_samples=100, num_test_samples=10, image_shape=(60, 60))

The input array (i.e x) is in the same format as for semantic segmentation.

from simple_deep_learning.mnist_extended.semantic_segmentation import display_grayscale_array

i = np.random.randint(len(train_x))
display_grayscale_array(array=train_x[i])

MNIST extended input example

Let’s take a look at the bounding boxes and labels.

print(train_bounding_boxes[i])
print(train_labels[i])

[[ 9  2 37 30]
 [27 11 55 39]]
[4 1]

We see the (xmin, ymin, xmax, ymax) coordinates of each bounding box, as well as the associated label. As mentioned before, this cannot directly be used as a target because different images have a different number of bounding boxes and the output of most neural networks (e.g excluding RNNs) is of fixed size for an input of a given size. For anyone interested in how to construct the target for an object detection model, I recommend checking the architecture of single shot detection (SSD) models. They are very commonly used models for object detection and relatively simple.

I have created a function to draw the bounding boxes on the array and return a PIL image.

from simple_deep_learning.mnist_extended.object_detection import draw_bounding_boxes

a = np.array(draw_bounding_boxes(train_x[i], bounding_boxes=train_bounding_boxes[i], labels=train_labels[i]))
display_grayscale_array(a)

MNIST extended object detection example

MNIST extended customisation

So far we’ve only used the main parameters of the create dataset functions. I recommend checking the code to find how to change things such as the maximum number of digits per image, the maximum IOU (intersection of union) of two digits in the same image and more.

In this post, we’ve been using the end to end functions create_semantic_segmentation_dataset and create_object_detection_dataset.

These perform the following tasks:

Download the original MNIST dataset.
Preprocess the original MNIST images.
Overlay the MNIST digits to create the new input image.
Create the target/output arrays.

These components are all part of the MNIST extended package and are very modular. This provides you with a lot of freedom to customise the dataset as you would like. For instance, if you want to perform additional preprocessing on the original MNIST digits, that’s totally possible. You might want modify the digits by randomly changing their size, in which case you can use the individual functions for downloading and preprocessing the MNIST digits. Then you can add a function to modify their size and finally feed the modified images and labels into the create_object_detection_data_from_digits or create_semantic_segmentation_data_from_digits.

I hope you have a lot of fun playing around with this dataset. I’ve certainly found it very useful for experimenting with model architectures and learning about deep learning more generally. I would love to hear what you’ve done with the dataset so please post a comment below or send me a message via LinkedIn!

Have a great day.

Luke

A simple example of semantic segmentation with tensorflow keras

2020-10-27T22:33:49+00:00

This post is about semantic segmentation. This is the task of assigning a label to each pixel of an images. It can be seen as an image classification task, except that instead of classifying the whole image, you’re classifying each pixel individually. From this perspective, semantic segmentation is actually very simple. Let’s see how we can build a model using Keras to perform semantic segmentation.

This tutorial is posted on my blog and in my github repository where you can find the jupyter notebook version of this post.

We’re going to use MNIST extended, a toy dataset I created that’s great for exploring and playing around with deep learning models. In this post, we won’t look into how the data is generated, for more information on that, you can checkout my post : MNIST Extended: A simple dataset for image segmentation and object localisation

In this post I assume a basic understanding of deep learning computer vision notions such as convolutional layers, pooling layers, loss functions, tensorflow/keras etc.

Import packages

Let’s start by importing a few packages. I’ve printed the tensorflow version we’re importing. We’ll only be using very simple features of the package, so any version of tensorflow 2 should work.

import tensorflow as tf
print(tf.__version__)

import numpy as np
print(np.__version__)

import matplotlib
from matplotlib import pyplot as plt
print(matplotlib.__version__)

0.0
19.1
3.1

Semantic segmentation dataset

from simple_deep_learning.mnist_extended.semantic_segmentation import create_semantic_segmentation_dataset

If you’re running the code yourself, you might have a few dependencies missing. You can either install the missing dependencies yourself, or you can pip install the requirements file from the github repository. It’s also possible to install the simple_deep_learning package itself (which will also install the dependencies). Checkout the README.md in the github repository for installation instructions.

np.random.seed(1)
train_x, train_y, test_x, test_y = create_semantic_segmentation_dataset(num_train_samples=1000,
                                                                        num_test_samples=200,
                                                                        image_shape=(60, 60),
                                                                        max_num_digits_per_image=4,
                                                                        num_classes=3)

Let’s take a quick look at what this input and output looks like.

import numpy as np
from simple_deep_learning.mnist_extended.semantic_segmentation import display_grayscale_array, plot_class_masks

print(train_x.shape, train_y.shape)

i = np.random.randint(len(train_x))

display_grayscale_array(array=train_x[i])

plot_class_masks(train_y[i])

(1000, 60, 60, 1) (1000, 60, 60, 3)

Input image example

Target example

I’ve printed the shapes of the train inputs and targets. As expected the input is a grayscale image. The output is slightly strange however, it’s essentially a grayscale image for each class we have in our semantic segmentation task. Here we chose num_classes=3 (i.e digits 0, 1 and 2) so our target has a last dimension of length 3. If this is strange to you, I strongly recommend you check out my post on the MNIST extended where I explain this semantic segmentation dataset in more detail.

Semantic segmentation modelling

Model architecture

This post is part of the simple deep learning series. My objective here is to achieve reasonably good results with a simple model. This helps understand the core concepts related to a particular deep learning task. It’s then very possible to gradually include components from state of the art models to achieve better results or a more efficient model.

Before I give you the simplest model architecture for semantic segmentation, I’d like you to spend a bit of time trying to imagine what that would be.

Need help? I’ll give you a hint. For semantic segmentation, the width and height of our output should be the same as our input (semantic segmentation is the task of classifying each pixel individually) and the number of channels should be the number of classes to predict.

The simplest model that achieves that is simply a stack of 2D convolutional layers! It’s that simple. If you’re familiar with image classification, you might remember that you need pooling to gradually reduce the input size on top of which you add a dense layer. For semantic segmentation this isn’t even needed because your output is the same size as the input! This very simple model of stacking convolutional layers is called a Fully Convolutional Network (FCN).

Let’s see whether this is good enough. We’ll be using tf.keras’s sequential API to create the model.

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

tf.keras.backend.clear_session()

model = models.Sequential()
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', input_shape=train_x.shape[1:], padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=train_y.shape[-1], kernel_size=(3, 3), activation='sigmoid', padding='same'))

We’re not going to bother ourselves with fancy activations, let’s just go with relu for the intermediate layers and sigmoid for the last layer. I chose sigmoid for the output because it produces and activation between 0 and 1 (i.e a probability) and our classes are non exclusive, otherwise we could use a softmax along the channels axis.

“Same” padding is perfectly appropriate here, we want our output to be the same size as our input and same padding does exactly that.

I’m not going to claim some sort of magical intuition for the number of convolutional layers or the number of filters. When experimenting for this article, I started with an even smaller model, but it wasn’t managing to learn anything. So I gradually increased the size until it started learning.

I’ve got a deep learning hint for you. If you’re ever struggling to find the correct size for your models, my recommendation is to start with something small. If that small model isn’t managing to fit the training dataset, then gradually increase the size of your model until you manage to fit the training set. Unless you’ve made a particularly bad architectural decision, you should always be able to fit your training dataset, if not, your model is probably too small.

Let’s look at how many parameters our model has.

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 60, 60, 16)        160       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 60, 60, 32)        4640      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 60, 60, 16)        4624      
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 60, 60, 3)         435       
=================================================================
Total params: 74,595
Trainable params: 74,595
Non-trainable params: 0
_________________________________________________________________

About 75000 trainable parameters. For reference, VGG16, a well known model for image feature extraction contains 138 million parameters. In comparison, our model is tiny. That’s good, because it means we should be able to train it quickly on CPU.

Let’s choose our training parameters. Adam is my go to gradient descent based optimisation algorithm, I don’t want to go into the details of how adam works but it’s often a good default that I and others recommend.

For the loss function, I chose binary crossentropy. This is a good loss when your classes are non exclusive which is the case here. If your labels are exclusive, you might want to look at categorical crossentropy or something else.

Keras allows you to add metrics to be calculated while the model is training. These don’t influence the training process but are useful to follow training performance. Accuracy is often the default, but here accuracy isn’t very meaningful. Our classes are so imbalanced (i.e a lot more pixels are background than they are digits) that even a model that always predicts 0 will have a great accuracy. For that reason I added recall and precision, those metrics are a lot more useful to evaluate performance, especially in the case of a class imbalance.
I was slightly worried that the class imbalance would prevent the model from learning (I think it does a bit at the beginning) but eventually the model learns.

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy(),
                       tf.keras.metrics.Recall(),
                       tf.keras.metrics.Precision()])

Train and evaluate

Let’s train the model for 20 epochs. This takes about 11 minutes on my 2017 laptop with CPU only. If you have GPU available, then use it. Your model will train a lot faster (approx 10x speed depending on your GPU/CPU). If you’re familiar with Google Colab then then you can also run the notebook version of the tutorial on there and utilise the free GPU/TPU available on the platform (you will need to copy or install the simple_deep_learning package to generate the dataset).

history = model.fit(train_x, train_y, epochs=20,
                    validation_data=(test_x, test_y))

Train on 1000 samples, validate on 200 samples
Epoch 1/20
1000/1000 [==============================] - 39s 39ms/sample - loss: 0.2830 - binary_accuracy: 0.9458 - recall: 0.0266 - precision: 0.0711 - val_loss: 0.0752 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 2/20
1000/1000 [==============================] - 38s 38ms/sample - loss: 0.0709 - binary_accuracy: 0.9598 - recall: 0.0000e+00 - precision: 0.0000e+00 - val_loss: 0.0641 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 3/20
1000/1000 [==============================] - 36s 36ms/sample - loss: 0.0595 - binary_accuracy: 0.9590 - recall: 0.0381 - precision: 0.6183 - val_loss: 0.0568 - val_binary_accuracy: 0.9580 - val_recall: 0.0781 - val_precision: 0.5779
Epoch 4/20
1000/1000 [==============================] - 38s 38ms/sample - loss: 0.0548 - binary_accuracy: 0.9575 - recall: 0.1105 - precision: 0.6330 - val_loss: 0.0527 - val_binary_accuracy: 0.9551 - val_recall: 0.2162 - val_precision: 0.6086
Epoch 5/20
1000/1000 [==============================] - 38s 38ms/sample - loss: 0.0508 - binary_accuracy: 0.9555 - recall: 0.2354 - precision: 0.6553 - val_loss: 0.0486 - val_binary_accuracy: 0.9558 - val_recall: 0.2681 - val_precision: 0.6815
Epoch 6/20
1000/1000 [==============================] - 37s 37ms/sample - loss: 0.0469 - binary_accuracy: 0.9554 - recall: 0.3294 - precision: 0.7167 - val_loss: 0.0447 - val_binary_accuracy: 0.9545 - val_recall: 0.3860 - val_precision: 0.7028
Epoch 7/20
1000/1000 [==============================] - 39s 39ms/sample - loss: 0.0429 - binary_accuracy: 0.9559 - recall: 0.4088 - precision: 0.7670 - val_loss: 0.0404 - val_binary_accuracy: 0.9568 - val_recall: 0.4336 - val_precision: 0.7997
Epoch 8/20
1000/1000 [==============================] - 45s 45ms/sample - loss: 0.0408 - binary_accuracy: 0.9564 - recall: 0.4440 - precision: 0.7983 - val_loss: 0.0384 - val_binary_accuracy: 0.9569 - val_recall: 0.4866 - val_precision: 0.8160
Epoch 9/20
1000/1000 [==============================] - 46s 46ms/sample - loss: 0.0375 - binary_accuracy: 0.9572 - recall: 0.4902 - precision: 0.8368 - val_loss: 0.0371 - val_binary_accuracy: 0.9581 - val_recall: 0.4432 - val_precision: 0.8508
Epoch 10/20
1000/1000 [==============================] - 39s 39ms/sample - loss: 0.0352 - binary_accuracy: 0.9577 - recall: 0.5175 - precision: 0.8581 - val_loss: 0.0364 - val_binary_accuracy: 0.9568 - val_recall: 0.5367 - val_precision: 0.8272
Epoch 11/20
1000/1000 [==============================] - 42s 42ms/sample - loss: 0.0347 - binary_accuracy: 0.9578 - recall: 0.5271 - precision: 0.8666 - val_loss: 0.0345 - val_binary_accuracy: 0.9574 - val_recall: 0.5920 - val_precision: 0.8554
Epoch 12/20
1000/1000 [==============================] - 43s 43ms/sample - loss: 0.0334 - binary_accuracy: 0.9582 - recall: 0.5425 - precision: 0.8822 - val_loss: 0.0332 - val_binary_accuracy: 0.9581 - val_recall: 0.5709 - val_precision: 0.8741
Epoch 13/20
1000/1000 [==============================] - 40s 40ms/sample - loss: 0.0323 - binary_accuracy: 0.9584 - recall: 0.5491 - precision: 0.8882 - val_loss: 0.0351 - val_binary_accuracy: 0.9566 - val_recall: 0.6564 - val_precision: 0.8446
Epoch 14/20
1000/1000 [==============================] - 42s 42ms/sample - loss: 0.0320 - binary_accuracy: 0.9585 - recall: 0.5577 - precision: 0.8914 - val_loss: 0.0316 - val_binary_accuracy: 0.9581 - val_recall: 0.5888 - val_precision: 0.8752
Epoch 15/20
1000/1000 [==============================] - 44s 44ms/sample - loss: 0.0301 - binary_accuracy: 0.9589 - recall: 0.5743 - precision: 0.9084 - val_loss: 0.0329 - val_binary_accuracy: 0.9578 - val_recall: 0.6012 - val_precision: 0.8701
Epoch 16/20
1000/1000 [==============================] - 41s 41ms/sample - loss: 0.0301 - binary_accuracy: 0.9588 - recall: 0.5755 - precision: 0.9048 - val_loss: 0.0298 - val_binary_accuracy: 0.9588 - val_recall: 0.6040 - val_precision: 0.9025
Epoch 17/20
1000/1000 [==============================] - 38s 38ms/sample - loss: 0.0290 - binary_accuracy: 0.9590 - recall: 0.5847 - precision: 0.9143 - val_loss: 0.0295 - val_binary_accuracy: 0.9586 - val_recall: 0.6410 - val_precision: 0.8998
Epoch 18/20
1000/1000 [==============================] - 37s 37ms/sample - loss: 0.0280 - binary_accuracy: 0.9592 - recall: 0.5929 - precision: 0.9206 - val_loss: 0.0301 - val_binary_accuracy: 0.9586 - val_recall: 0.6418 - val_precision: 0.9001
Epoch 19/20
1000/1000 [==============================] - 37s 37ms/sample - loss: 0.0277 - binary_accuracy: 0.9593 - recall: 0.5955 - precision: 0.9240 - val_loss: 0.0280 - val_binary_accuracy: 0.9590 - val_recall: 0.6358 - val_precision: 0.9098
Epoch 20/20
1000/1000 [==============================] - 37s 37ms/sample - loss: 0.0269 - binary_accuracy: 0.9594 - recall: 0.6037 - precision: 0.9294 - val_loss: 0.0271 - val_binary_accuracy: 0.9594 - val_recall: 0.6150 - val_precision: 0.9231

We’ve stopped the training before the loss plateaued, as you can see, both train and validation loss were still going down after 20 epochs which means that some extra performance might be gained from training longer. However we’re not here to get the best possible model.

At the end of epoch 20, on the test set we have an accuracy of 95.6%, a recall of 58.7% and a precision of 90.6%. Remember, these are the metrics for each individual pixel. So the metrics don’t give us a great idea of how our segmentation actually looks. To get a better idea, let’s look at a few predictions from the test data.

test_y_predicted = model.predict(test_x)

from simple_deep_learning.mnist_extended.semantic_segmentation import display_segmented_image

np.random.seed(6)
for _ in range(3):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    i = np.random.randint(len(test_y_predicted))
    print(f'Example {i}')
    display_grayscale_array(test_x[i], ax=ax1, title='Input image')
    display_segmented_image(test_y_predicted[i], ax=ax2, title='Segmented image', threshold=0.5)
    plot_class_masks(test_y[i], test_y_predicted[i], title='y target and y predicted sliced along the channel axis')

Example 138

Example 106

Example 109

These randomly selected samples show that the model has at least learnt something. It does quite a good job of detecting the digits but it has some problems. By looking at a few examples, it becomes apparent that the model is far from perfect. In my opinion, this model isn’t good enough. There’s no overfitting the test dataset so we could train for longer, or increase the size of the model but we can do better than that.

Improvements

We can improve our model by adding few max pooling layers. The first benefit of these pooling layers is computational efficiency. By reducing the size of the intermediate layers, our network performs fewer computations, this will speed up training a bit. However, the number of parameters remains the same because our convolutions are unchanged. The problem with adding the pooling layers is that our output will no longer have the same height and width the input image. To solve that problem we an use upsampling layers. These simple upsampling layers perform essentially the inverse of the pooling layer. A (2, 2) upsampling layer will transform a (height, width, channels) volume into a (height * 2, width * 2, channels) volume simply by duplicating each pixel 4 times. By applying the same number of upsampling layers as max pooling layers, our output is of the same height and width as the input.

Another, more intuitive, benefit of adding the pooling layers is that it forces the network to learn a compressed representation of the input image. It’s not totally evident how this helps, but by forcing the intermediate layers to hold a volume of smaller height and width than the input, the network is forced to learn the important elements of the input image as a whole as opposed to simply passing all information through. As you’ll see, the pooling layers not only improve computational efficiency but also improve the performance of our model!

This idea of compressing a complex input to a compact representation and using that representation to construct an output is a very common idea in deep learning, such models are often called “encoder-decoder” models. They’re not only used in computer vision, in this more advanced deep learning post, I explore the use of encoder-decoders for time series prediction.

tf.keras.backend.clear_session()

model = models.Sequential()
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', input_shape=train_x.shape[1:], padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.UpSampling2D(size=(2, 2)))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.UpSampling2D(size=(2, 2)))
model.add(layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=train_y.shape[-1], kernel_size=(3, 3), activation='sigmoid', padding='same'))

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy(),
                       tf.keras.metrics.Recall(),
                       tf.keras.metrics.Precision()])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 60, 60, 16)        160       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 60, 60, 32)        4640      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 30, 30, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 15, 15, 32)        9248      
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 15, 15, 32)        9248      
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 30, 30, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 60, 60, 32)        0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 60, 60, 32)        9248      
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 60, 60, 16)        4624      
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 60, 60, 3)         435       
=================================================================
Total params: 74,595
Trainable params: 74,595
Non-trainable params: 0
_________________________________________________________________

history = model.fit(train_x, train_y, epochs=20,
                    validation_data=(test_x, test_y))

Train on 1000 samples, validate on 200 samples
Epoch 1/20
1000/1000 [==============================] - 19s 19ms/sample - loss: 0.3355 - binary_accuracy: 0.9403 - recall: 0.0318 - precision: 0.0616 - val_loss: 0.1344 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 2/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0972 - binary_accuracy: 0.9598 - recall: 0.0000e+00 - precision: 0.0000e+00 - val_loss: 0.0818 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 3/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0773 - binary_accuracy: 0.9598 - recall: 0.0000e+00 - precision: 0.0000e+00 - val_loss: 0.0723 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 4/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0694 - binary_accuracy: 0.9598 - recall: 0.0000e+00 - precision: 0.0000e+00 - val_loss: 0.0661 - val_binary_accuracy: 0.9601 - val_recall: 0.0000e+00 - val_precision: 0.0000e+00
Epoch 5/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0648 - binary_accuracy: 0.9598 - recall: 0.0000e+00 - precision: 0.0000e+00 - val_loss: 0.0623 - val_binary_accuracy: 0.9601 - val_recall: 1.3908e-04 - val_precision: 1.0000
Epoch 6/20
1000/1000 [==============================] - 18s 18ms/sample - loss: 0.0599 - binary_accuracy: 0.9597 - recall: 0.0242 - precision: 0.8687 - val_loss: 0.0583 - val_binary_accuracy: 0.9583 - val_recall: 0.1040 - val_precision: 0.6663
Epoch 7/20
1000/1000 [==============================] - 18s 18ms/sample - loss: 0.0541 - binary_accuracy: 0.9581 - recall: 0.1451 - precision: 0.7368 - val_loss: 0.0524 - val_binary_accuracy: 0.9566 - val_recall: 0.2215 - val_precision: 0.6927
Epoch 8/20
1000/1000 [==============================] - 18s 18ms/sample - loss: 0.0502 - binary_accuracy: 0.9578 - recall: 0.1983 - precision: 0.7569 - val_loss: 0.0477 - val_binary_accuracy: 0.9577 - val_recall: 0.2330 - val_precision: 0.7623
Epoch 9/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0474 - binary_accuracy: 0.9573 - recall: 0.2586 - precision: 0.7650 - val_loss: 0.0490 - val_binary_accuracy: 0.9541 - val_recall: 0.3124 - val_precision: 0.6513
Epoch 10/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0464 - binary_accuracy: 0.9563 - recall: 0.3471 - precision: 0.7608 - val_loss: 0.0457 - val_binary_accuracy: 0.9553 - val_recall: 0.4305 - val_precision: 0.7479
Epoch 11/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0397 - binary_accuracy: 0.9576 - recall: 0.4826 - precision: 0.8528 - val_loss: 0.0355 - val_binary_accuracy: 0.9593 - val_recall: 0.4879 - val_precision: 0.9086
Epoch 12/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0346 - binary_accuracy: 0.9585 - recall: 0.5753 - precision: 0.8972 - val_loss: 0.0332 - val_binary_accuracy: 0.9583 - val_recall: 0.5929 - val_precision: 0.8888
Epoch 13/20
1000/1000 [==============================] - 18s 18ms/sample - loss: 0.0308 - binary_accuracy: 0.9592 - recall: 0.6123 - precision: 0.9226 - val_loss: 0.0300 - val_binary_accuracy: 0.9594 - val_recall: 0.5996 - val_precision: 0.9226
Epoch 14/20
1000/1000 [==============================] - 18s 18ms/sample - loss: 0.0283 - binary_accuracy: 0.9596 - recall: 0.6402 - precision: 0.9383 - val_loss: 0.0269 - val_binary_accuracy: 0.9603 - val_recall: 0.6119 - val_precision: 0.9553
Epoch 15/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0258 - binary_accuracy: 0.9600 - recall: 0.6598 - precision: 0.9533 - val_loss: 0.0291 - val_binary_accuracy: 0.9596 - val_recall: 0.6172 - val_precision: 0.9294
Epoch 16/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0256 - binary_accuracy: 0.9601 - recall: 0.6609 - precision: 0.9533 - val_loss: 0.0249 - val_binary_accuracy: 0.9601 - val_recall: 0.7022 - val_precision: 0.9524
Epoch 17/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0243 - binary_accuracy: 0.9603 - recall: 0.6760 - precision: 0.9623 - val_loss: 0.0238 - val_binary_accuracy: 0.9603 - val_recall: 0.7151 - val_precision: 0.9571
Epoch 18/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0230 - binary_accuracy: 0.9605 - recall: 0.6821 - precision: 0.9672 - val_loss: 0.0229 - val_binary_accuracy: 0.9606 - val_recall: 0.6724 - val_precision: 0.9634
Epoch 19/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0223 - binary_accuracy: 0.9605 - recall: 0.6864 - precision: 0.9696 - val_loss: 0.0235 - val_binary_accuracy: 0.9603 - val_recall: 0.7354 - val_precision: 0.9565
Epoch 20/20
1000/1000 [==============================] - 17s 17ms/sample - loss: 0.0223 - binary_accuracy: 0.9605 - recall: 0.6853 - precision: 0.9691 - val_loss: 0.0225 - val_binary_accuracy: 0.9605 - val_recall: 0.7063 - val_precision: 0.9638

Incredibly, this small modification to our model has allowed us to gain 10 percentage points in recall! The training process also takes about half the time.
Let’s see how that looks by displaying the examples we checked earlier.

test_y_predicted = model.predict(test_x)

from simple_deep_learning.mnist_extended.semantic_segmentation import display_segmented_image

np.random.seed(6)
for _ in range(3):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    i = np.random.randint(len(test_y_predicted))
    print(f'Example {i}')
    display_grayscale_array(test_x[i], ax=ax1, title='Input image')
    display_segmented_image(test_y_predicted[i], ax=ax2, title='Segmented image')
    plot_class_masks(test_y[i], test_y_predicted[i], title='y target and y predicted sliced along the channel axis')

Example 138

Example 106

Example 109

The difference is huge, the model no longer gets confused between the 1 and the 0 (example 117) and the segmentation looks almost perfect.

Conclusion

State of the art models for semantic segmentation are far more complicated than what we’ve seen so far. What we’ve created isn’t going to get us on the leaderboard of any semantic segmentation competition… However, hopefully you’ve understood that the core concepts behind semantic segmentation are actually very simple. This post is just an introduction, I hope your journey won’t end here and that I have encouraged you to experiment with your own modelling ideas. You could make the ch Perhaps you could look at the concepts that make state of the art semantic segmentation models and try to implement them yourself on this simple dataset. A good starting point is this great article that provides an explanation of more advanced ideas in semantic segmentation.

I hope enjoyed reading this post. If you have any questions or have done something cool with the this dataset that you would like to share, comment below or reach out to me on Linkedin. I love hearing from you.

Have a great day,
Luke

The optimal python project structure

2020-03-21T14:26:22+00:00

In this post, I will describe a python project structure that I have found extremely useful over a wide variety of projects. We’re going to build this structure from the ground up so that you can better understand the ideas that have lead me to this optimal layout. In this post, I will only include what I consider to be absolutely necessary for any python project. You can find the full project structure in my github repository.

Let’s start with a directory called «my_project». This is the project directory. As you can see, it’s totally empty.

The project vs the package

The python project is everything in the base directory. All files related to your python application will be in the project directory.

The package, on the other hand, is as subdirectory inside the project with the same name as the project itself. This package contains the source code of your application. The reason for having this package directory is to separate source code from other files. When we pip install our project, we will tell pip to only include the files contained in the package directory.

Many people get confused by the distinction between project and package. As a reminder: the project goes to source control, the package gets installed.

Since the source code is the most important part of our project, let’s start by adding a package to our project structure.

Why init.py?

The package needs to contain at least an __init__.py file. This tells python that this directory is indeed a package. When python loads this package, it automatically runs the __init__.py. Therefore, it can be useful to include initialisation steps for the package. In all my projects, I add at least two things to this __init__.py:

A variable called ROOT_DIR containing the absolute path to the location of the package. I have always found it useful for a package to know it’s absolute location. For instance, this can be used to load non-source files contained in the package. Relying on the current working directory is not a good idea. Anyone can modify the current working directory, and your application won’t always be launched from the same location.
The configuration of the logger for my package. Python logging is a broad topic and a post for another time. It’s enough to say that the logger for a package must be initialised once for the whole package. It therefore makes sense to have it inside the __init__.py.

Here is an example of what my __init__.py might look like:

from os.path import dirname, abspath

ROOT_DIR = dirname(abspath(__file__))

###Logging initialisation code would go here###

Our project now looks like this:

Helping git with a .gitignore

Any software project should be version controlled. By far the most widely used version control system is Git. A .gitignore file is a text file describing some files that should not be included in version control. There are many reasons for not wanting to source control certain files. For instance, the file could contain sensitive data such as passwords. You might also want to exclude large data files such as images.

Some examples of .gitignore files can be found in this github repository.
I usually start with a small number of files and build gradually. My basic .gitignore might resemble this:

# Pycache and compiled python files.
__pycache__/
*py[cod]

# Juyter notebook checkpoints
.ipynb_checkpoints

#Egg info files produced by pip installation
*egg-info

Helping yourself and others with a README.md

A readme is a documentation file often written in markdown.

It contains useful information so that others can understand what your project is about. A readme should at least contain a simple description of your project and instructions for how to install and use the package.

# My Project

A simple project providing a useful base structure for python packages.

## Installation
To install this project's package run:

pip install /path/to/my_project


To install the package in editable mode, use:

pip install –editable /path/to/my_project

Helping pip with a setup.py

A setup.py is a python file that contains information about the package you are installing.

A very minimal setup.py contains the following :

import setuptools

setuptools.setup(name='my_project', packages=['my_project'])

Here we are saying that the name of our package should be « my_project ». This name will be used in the package metadata stored by pip. The « packages » parameter takes the name of package directory to install. Previously, I said only the package part of our project structure would be installed, this « packages » parameter is why.

At this point, your project contains a perfectly valid, pip installable package. The huge advantages that come with having a pip installable package might not be immediately apparent. But trust me, making a project pip installable isn’t just about shipping your application to other users. It’s also extremely useful for development reasons. But that’s a discussion for another post.

Tracking requirements with requirements.txt

As your project grows, it will likely include more and more dependencies. A good way of tracking dependencies is through a requirements.txt.

This requirements.txt contains all the packages that your project needs and that are not part of the standard library. We will use this requirements.txt to make pip automatically download and install requirements for us.

Let’s make a very simple requirements.txt and add numpy to it (numpy is a package for scientific computing in python).

numpy==1.18.2

This file alone is not very useful, we need to tell pip about these requirements. For that, we must slightly update our setup.py.

import setuptools

with open('requirements.txt', 'r') as f:
    install_requires = f.read().splitlines()

setuptools.setup(name='my_project',
                 packages=['my_project'],
                 install_requires=install_requires)

As you can see, we have added the packages contained in the requirements.txt to our package setup. Therefore, when pip installs our package, it will search for that version of numpy. If it does not find it, it will download it for us.

There are many more things to say about package dependencies and requirements.txt. For instance, your requirements.txt does not have to specify exact versions, and there are tools for automatically generating requirements files based on source code (e.g pipreqs).

Package installation an dependency management can be hugely improved through the use of and environment manager. Check out my article on python anaconda for more information about using package managers.

The License

Including a license in your python project structure is important. Especially if it is going to be deployed publicly. Others should know what they are entitled to do with your software, or it might prevent them from using it. If you don’t know which license to choose, take a look at choosealicense.com.

Last but not least: tests

Testing is often overlooked in software development. In this post I will not go into the details of testing. However, I always keep my tests in a separate directory from the package source code.

Conclusion

I have just shown you what I consider to be the absolute minimal python project structure. There are many benefits to keeping your python projects well structured. Whether it’s for professional or personal projects, a systematic approach to organising your code will speed up development and bring clarity to your work.

Let’s take a look at our final python project structure:

There are many ways to extend this basic structure. This sample project is stored on githhub, don’t hesitate to fork or clone the repository and play around with it yourself. For potential ideas on extending your base python project, take a look at this repository from Neuraxio. It contains a more detailed version of a setup.py and a small test example.

This post is included in a series on python development fundamentals. Please check out the other posts in the series for more information.

I hope you enjoyed reading. Have a great day.

Python logging – A practical guide

2020-03-20T08:00:00+00:00

Python logging isn’t easy. When I was learning python, I made many attempts to use logging in my applications. Usually I would end up frustrated and thinking that setting everything up correctly wasn’t worth the hassle. It’s only when I started building larger applications and logging became a neccessity, that I finally figured out what was going on. Learning to use python logging is rather like learning to ride a bicycle. It’s difficult to start with, but once it clicks, it’s something you’ll never forget.

This post is a step-by-step guide into the world of python logging. By the end of this guide, you should also have had that “click” moment and will be able to use python logging effectively in your application development.

This article exists in notebook format in my github repository. I highly recommend you fork or clone this respository so that you can run the code and experiment with it yourself. However, this is not a requirement to understand the content of the article.

A simple logger

This is the most basic example of a fully functional logger. There are two main components to this logging example.

A Logger object.
A Handler object.

import logging
import sys

logger = logging.getLogger('package') # Create logger with name 'package'
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)
logger.warning('This is a warning message to stdout')

This is a warning message to stdout

The logger is the primary interface for creating log messages.
We created the logger using the logging.getLogger(name) method and gave the logger the name ‘package’.

We then asked the logger to log a message by calling logger.warning. When this method is called, the logger creates a warning message (also called LogRecord). However, on it’s own, the logger can not deliver the log message. For that it needs a handler.

Handlers are specific objects whose role is to take the log message created by the logger and deliver it to wherever it needs to go. There are many types of handlers, in this example, we used a StreamHandler whose role is to pass the log message to STDOUT which is displayed by the console. For other types of handlers, check the official documentation on handlers.

Last resort

Wait a minute… I just said that a logger needed a handler to deliver messages. In the following example, I can clearly display this message in STDERR (jupyter notebook displays STDERR in red) without using a handler. So what’s going on?

import logging

logger = logging.getLogger('package') # Create logger with name 'package'
logger.warning('This is a warning message')

This is a warning message

Since python 3.2, the logging module has something called logging.lastResort. This is a StreamHandler delivering to STDERR. If no Handlers are found to deliver a message, this lastResort Handler is used. We can checkout out a few details of this lastResort Handler by printing it. As I said, this does seem to be a StreamHandler pointing to STDERR. But what does this WARNING mean? Let’s talk about log levels.

print(logging.lastResort)

&lt;_StderrHandler stderr (WARNING)>

Log levels

The logging module isn’t just capable of logging warning message. By default, there are 4 other types of log messages that can be produced. According to the python docs, these are the functions of each level:

DEBUG: Detailed information, typically of interest only when diagnosing problems.
INFO: Confirmation that things are working as expected.
WARNING: An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected.
ERROR: Due to a more serious problem, the software has not been able to perform some function.
CRITICAL: A serious error, indicating that the program itself may be unable to continue running.

Loggers can be configured to only accept messages of a certain severity, meaning that anything less important will be discarded. The lastResort Handler has a level of WARNING meaning that anything below (DEBUG, INFO) will be discarded. Let’s check this out.

import logging

logger = logging.getLogger('package') # Create logger with name 'package'
logger.debug('This will not be displayed.')
logger.info('Neither will this.')
logger.warning('This will be displayed!')

This will be displayed!

By default loggers also have an effective level of WARNING, meaning that even if we gave it a Handler, we would have to reset the logger level to have our logs displayed.

import logging
import sys

logger = logging.getLogger('package') # Create logger with name 'package'

print(f"This logger's level is: {logger.getEffectiveLevel()}") # 30 means WARNING

handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)

logger.debug('This will not be displayed.')

logger.setLevel(logging.DEBUG)
print(f"This logger's level is: {logger.getEffectiveLevel()}") # 10 means DEBUG

logger.debug('This will be displayed!')

This logger's level is: 30
This logger's level is: 10
This will be displayed!

basicConfig and the root logger

So far we have been creating our own logger object called ‘package’. But the logging module contains a default logger call the “root” logger. This logger can be retrieved simply with getLogger() (without the name parameter). Let’s examine this root logger.

import logging

logger = logging.getLogger()

print(f"Root logger's handlers: {logger.handlers}")
print(f"Root logger's level: {logger.getEffectiveLevel()}")

Root logger's handlers: []
Root logger's level: 30

Nothing surpising so far. The root logger looks just like any other logger: no Handlers and a default level of WARNING. But there are a few things that differentiate the root logger from the other loggers. Look at this:

logger.warning("This will be displayed. But what's this formatting!?")

WARNING:root:This will be displayed. But what's this formatting!?

With any other logger, if no Handlers are available to deliver the message, then the lastResort Handler will be used. With the root logger, if the logging module discovers you are calling a message and there are no Handlers on the root logger, a function called logging.basicConfig() will be called. This function adds a default StreamHandler to the root logger and associates it with what is called a Formatter. We’ll talk more about formatters in a second, but let’s look into what has just happened.

print(f"Root logger's handlers: {logger.handlers}")
[handler] = logger.handlers
print(f"Handler's formatters: {handler.formatter}")

Root logger's handlers: [&lt;StreamHandler stderr (NOTSET)>]
Handler's formatters: &lt;logging.Formatter object at 0x7f8b0bf94310>

A handler does indeed appear to have been automatically added to the root logger.
This makes sense, the logging module is basically making it easy for anyone to write a simple, formatter log without having to understand or know about handlers or formatters. You can even call the log functions directly from the logging module itself without even retrieving a reference to the root logger!

logging.warning('A log created directly on the root logger!')

WARNING:root:A log created directly on the root logger!

Hopefully from the previous examples you have good understanding of loggers and handlers. Creating loggers with different names allows you to add different handlers to your loggers and therefore deal with logging messages differently. For instance, you can attatch a StreamHandler to one logger, and a FileHandler to another so that some messages are sent to the console and others are sent to a file!

Now let’s take a look at Formatters and how you can add formatters to handlers to modify the default message format.

Formatters

The default formatting for log messages (as seen with the StreamHandler in the first example) is to simply display the message. We have also seen how calling logging.basicConfig() adds a formatter to the root logger and which generates the following logs:
LOG_LEVEL:LOGGER_NAME:LOG_MESSAGE

Let’s extend our first example by adding our own formatter:
let’s add two handlers to our logger, one with a handler and another without.

import logging
import sys

logger = logging.getLogger('package') # Create logger with name 'package'

handler_without_formatter = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler_without_formatter)

handler_with_formatter = logging.StreamHandler(stream=sys.stdout)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler_with_formatter.setFormatter(formatter)
logger.addHandler(handler_with_formatter)

logger.warning('This message will be displayed with and without formatting!')

from IPython.core.display import HTML
HTML("&lt;script>Jupyter.notebook.kernel.restart()&lt;/script>")

This message will be displayed with and without formatting!
2020-03-29 16:30:36,910 - package - WARNING - This message will be displayed with and without formatting!

Our format displays some very useful information, it contains the time at which the log was created, the name of the logger that was called, the level (i.e severity) of the log message and the message itself. This information makes our log messages a lot more interpretable.

And that’s really all there is to know about formatting. If you want more information about which attributes can be displayed in the log message, take a look at the official documentation. In particular the list of LogRecord attributes describes how you can generate your own formatting string.

Python logging is powerful: Filters

We’ve seen how we can create an arbitrary number of loggers, and any number of handlers to these loggers and format each message individually using formatters. That’s already a lot! What more might you want? Python provides even more flexibility with filters. Filters are objects that can be attached to loggers or handlers and determine whether a message gets processed or dropped. You can see levels as a sort of automatically applied filter. If the log is above the level of the logger (or handler) it gets processed, otherwise it’s dropped.

Since python 3.2, the simplest way to create a filter is with a function. It must take as input a log message, and output zero, if the LogRecord is to be dropped and non-zero otherwise. Let’s go back to our simple example. I have added type hinting to the function to emphasise the input and output of a valid filter function.

import logging
from logging import LogRecord
import sys

def gandalf_filter(log_record: LogRecord) -> int:
    """A simple filter.

    This filter drops logs that contain
    the string 'you shall not pass' in their message.
    """

    if 'you shall not pass' in log_record.msg:
        return 0
    else:
        return 1


logger = logging.getLogger('package') # Create logger with name 'package'
handler = logging.StreamHandler(stream=sys.stdout)
handler.addFilter(gandalf_filter)
logger.addHandler(handler)
logger.warning('This message will pass!')
logger.warning('This message will not pass: you shall not pass')

This message will pass!

You can pretty much do whatever you want with the LogRecord, you can even modify it in place to change the message.

import logging
from logging import LogRecord
import sys

def gandalf_the_white_filter(log_record: LogRecord):
    """A simple filter.

    Adds ' the white' to a log message if it ends with gandalf.
    """

    if log_record.msg.endswith('gandalf'):
        log_record.msg = log_record.msg + ' the white'

    return 1

logger = logging.getLogger('package') # Create logger with name 'package'
handler = logging.StreamHandler(stream=sys.stdout)
handler.addFilter(gandalf_the_white_filter)
logger.addHandler(handler)
logger.warning('I am gandalf')
logger.warning('You shall not pass!')

I am gandalf the white
You shall not pass!

You are now familiar with the 4 main components of python logging:

Loggers
Handlers
Formatters
Filters

You should be able to set up some pretty fancy logging with all that knowledge. There’s one last thing I would like to tell you about: logger hierarchy.

Logger hierarchy in python logging

There’s something I have been hiding from you. Python loggers are not independent. They are organised in a hierarchy with the root logger at the top. This raises a couple of questions…

How do you create a hierarchy?

Firstly, all loggers are children of the root logger.
Secondly, hierarchy is specified by a dot-separated naming convention.
For instance, a logger named ‘package1’ is a child of the root logger and a logger named ‘package1.module1’ is a child of the ‘package1’ logger and the root logger.

The python logging hierarchy

What does this hierarchy do?

This hierarchy has several effects. When a child logger creates a log message (and that message passes the filters of that logger) all handlers of parent loggers of that child will receive this message. Let’s take a look at an example:

import logging
import sys

handler = logging.StreamHandler(stream=sys.stdout)

logger = logging.getLogger('package') # Create logger with name 'package'
logger.addHandler(handler)

child_logger = logging.getLogger('package.module')
child_logger.addHandler(handler)

child_logger.warning('This will be printed by the parent and child handlers.')

from IPython.core.display import HTML
HTML("&lt;script>Jupyter.notebook.kernel.restart()&lt;/script>")

This will be printed by the parent and child handlers.
This will be printed by the parent and child handlers.

It is possible to disable the propagation of LogRecords to the parent’s handlers by setting “logger.propagate = False” on the logger:

import logging
import sys

handler = logging.StreamHandler(stream=sys.stdout)

logger = logging.getLogger('package') # Create logger with name 'package'
logger.addHandler(handler)

child_logger = logging.getLogger('package.module')
child_logger.addHandler(handler)

child_logger.propagate = False # Prevent propagation of the LogRecord to parent handlers.

child_logger.warning('This will be only be printed by the child logger.')

from IPython.core.display import HTML
HTML("&lt;script>Jupyter.notebook.kernel.restart()&lt;/script>")

This will be only be printed by the child logger.

The python documentation contains an excellent diagram explaining how a LogRecord gets created by a logger and then passed to all parents. I will display that diagram here, but the original can be found in the official python logging documentation.

Flow diagram of python logging

The second major effect of this hierarchy is that the level of a logger can be determined by its parents. For instance, if a child logger does not have a level set (logging.NOTSET), then the logging module will move up the chain of child parents until it finds a level that is set and will use that level. This level is called the “effective level” for that logger. Let’s take a look:

import logging
import sys

logger = logging.getLogger('package') # Create logger with name 'package'
root_logger = logging.getLogger()

print(f"The logger's level is: {logger.level}") # 0 == logging.NOTSET
print(f"The root logger's level is: {root_logger.level}") # 30 == logging.WARNING

print(f"This logger's effective level is: {logger.getEffectiveLevel()}") # 30 == logging.WARNING

from IPython.core.display import HTML
HTML("&lt;script>Jupyter.notebook.kernel.restart()&lt;/script>")

The logger's level is: 0
The root logger's level is: 30
This logger's effective level is: 30

As you can see, the level of the logger is NOTSET, hence the logging module will search the parent loggers until it finds one with a level different from NOTSET. The only parent of ‘package’ is the root logger and we know that this logger has a log level of WARNING by default. Therefore, the logger’s effective level is 30. It’s the effective level (not the logger.level attribute) that determines whether a log message gets passed to the handlers.

The LogRecord being passed to handlers and the effective level of loggers are the main effects of the logging hierarchy. The logging hierarchy is useful for not having to add handlers to every single logger you create. You might be wondering what’s the point in creating child loggers if the messages get passed to the parent handlers anyway. That’s a great question, and to understand, you need to know how naming is generally used in logging for large python applications.

In my graph of “logging hierarchy” you might have noticed that I named the loggers “package” and “module”. This is because, when developping packages, it’s best practice to create a logger for every module and for each logger to be named according to its location in the package. This becomes very useful when you want to find out about the provenance of logging messages because formatters can include the name of the logger in the log message itself. Another advantage of naming loggers with the names of your modules and packages it prevents collisions with other loggers from other packages.

hint: You can use the variable __loader__.name to automatically retrieve the full path to the module in a package. Contrary to __name__ which gets given the value ‘__main__’ when the module is used as an entrypoint, __loader__.name will contain the full module name even if it’s the entry point of the application.

Application logging vs library logging

You now know almost everything you need to build the most amazing logging for your applications.
However, so far I have mainly described the python logging module and its functionality. I have not talked about which logging functionality should be used when.

When deciding how to write your logs, you should ask yourself the question:

Am I writing a library?
Am I writing an application?

A library is intended to be used by other software developers as a foundation. An application is intended to be used by yourself or your clients directly.

Logging for a library

When developing a library, it is recommended not to add handlers to your loggers. Instead, just let the LogRecords flow to the root logger and if the user of your library wants to do something with your logs, they can create the handler and formatters they want. Even better is to add a logging.NullHandler() to the base of your logging hierarchy (i.e getLogger(‘package’)). That way, if the user of your library does not configure logging for their application, your messages won’t be displayed by the lastResort handler. This is described in more detail in the official documentation on configuring logging for a library.

Logging for an application

Your application is a part in a larger multi process application:
The best practice for logging within a large application is to only send messages to STDOUT (i.e only use StreamHandlers(stream=sys.stdout)). By doing that, you are essentially delegating the role of storing logs to your application’s environment. Imagine if your application is one process in a larger multiprocess application with each sending its logs somewhere different. It would be a nightmare for the application environment to compile all the logs and analyse them together. If all processes send their logs to STDOUT, the environment knows where everything is and can decide what to do with the STDOUT stream.
Your application is standalone
If your application is standalone, and not being managed by some external process, then it’s OK to be handling the output of your log yourself. In that case, you can configure logging for your application with a configuration file or directly in code as we have been doing so far.

Conclusion

If you’ve reached this point, well done and thank you! This guide has taken you through all the major components of python logging. We’ve even touched upon best practices when logging in large applications or libraries. The python logging module is extremely powerful. Good logging practices can help immensely with debugging and healthchecking your application, so go ahead and start experimenting.

If you’ve found the content of this article useful, I highly recommend checking out some of my other articles. This post is part of a series on python development fundamentals. If you’re wondering how to structure your python application, then take a look at my article on the optimal python project structure.

The 4 streams of deep learning and how to use them successfully

2020-02-16T17:05:05+00:00

Deep learning and more generally machine learning is a particular paradigm of programming. Rather than telling the computer explicitly each operation it must perform, you tell the computer how to learn the operations it must perform. Despite what some people might say, this is still programming, and many of the best coding practices still apply. Nonetheless, it’s a sufficiently significant change to lead otherwise successful software companies astray.

In this post, I will describe a framework for thinking about deep learning, and will use this term throughout the article, although many of the ideas also apply to machine learning more generally. This framework describes 4 streams of work related to deep learning. Many companies fail to derive value from deep learning because they generally focus on only one or two of the following streams.

This post is for anyone who works in or around deep learning. This framework is useful whether you’re a project manager, a data scientist, a machine learning engineer, or a team leader wondering why deep learning isn’t providing the expected return on investment.

The 4 streams of deep learning are:

Models
Model integration
Data
Infrastructure

If one of these streams is lacking in your deep learning strategy then you’re probably not as effective as you could be.

Models Stream

Models are the most obvious aspect of deep learning: they are the learning elements of the deep learning ecosystem. The models stream’s objective is to implement and train the deep learning algorithms to achieve good performance on a very specific task. For instance, if you were working in the models stream, your role would be to optimise some very specific performance metrics (precision/recall, IOU, RMSE) on a well defined test set. Other important elements of this stream include framing your problem in terms of input and output (or input and loss), selecting deep learning algorithms and model architecture, and hyperparameter tuning.

These are usually the skills taught in an introduction to deep learning course at university or online. Most people claiming to be machine learning engineers are competent in this area: they know how to select, train and evaluate models. Companies usually understand the importance of this stream and hire accordingly. However, the rise of frameworks such as Tensorflow and Pytorch, as well as pretrained models on large open source datasets have arguably made this stream less critical to the success of deep learning projects. Unless your team is working on a very specific supervised learning problem or using a less mature deep learning method such as reinforcement learning, your models stream is probably not your biggest weakness. This isn’t to say that there aren’t some very concrete challenges in modelling, but that’s a discussion for another time.

Key takeaway: A model that performs correctly on a well-defined dataset is a necessity. Nevertheless, huge advances in machine learning open source software packages and datasets have massively reduced the amount of R&D that you will need in this stream. Before spending all your resources for a minimal gain in performance, make sure you are up to speed in the other streams.

Model Integration Stream

The model integration stream’s objective is to take a model that performs correctly on a well defined deep learning dataset and make it useful in the real world. This stream is essential for companies and should be the core of any deep learning initiative. Despite the huge improvements of end-to-end models in areas such as natural language processing or computer vision, most deep learning models do not solve real world problems on their own. They must be integrated into larger applications composed of more traditional coding methods, such as signal processing, rule based decision making systems etc.

When evaluating the value of a deep learning model, many stop at its performance on the well defined labeled dataset that was used to train the model. Far more important is the value this model provides to the application in which it is integrated. This evaluation might be a lot less obvious. Imagine you work for a robotics company and have trained an object detection system. Your model is performing 3% better than your old one. This is probably huge from a modelling perspective, but how important is that for the robot as a whole?

Companies are more or less susceptible to overlooking the importance of the model integration stream depending on their type of business. Companies whose core business is not in machine or deep learning are very likely to overlook the importance of model integration. They might hire one or two data scientists who will focus on the specific task of getting a model to perform very well on a labeled dataset (i.e the models stream). However when it comes to deploying and integrating the model they will likely lack the software development skills, vision and support for their work to impact the business as a whole. On the other hand, companies that were founded on advances in deep learning are more likely to have a better idea of how the performance of individual models impacts the business as a whole. They might also be more technologically oriented and will have an easier time with the software aspect of the integration.

Key takeaway: Never forget that deep learning is simply a method to achieve a broader objective. If deep learning doesn’t help with your end goal or isn’t more effective than simpler, more traditional methods, it’s not worth it.

Data Stream

Data are likely the second most obvious aspect of deep learning. When people think of data in the context of deep learning, they usually imagine a static set of labeled data. For instance, they might think of an open source image dataset such as Coco or ImageNet, or a dataset with labels provided by a wearable device. They view the data as a fixed asset that does not evolve. Business objectives change, therefore for deep learning to promote business objectives, the data that supports it must be evolving and dynamic. Nowadays, no one would ever assume that a software application is a monolithic bloc that never changes. Data is the same.

For your data to be considered dynamic, it must be updated with feedback from the system in production. This allows to constantly reevaluate the impact of the deep learning models on the end goal and answer questions such as: “Is the model still performing as expected? Or has there been a shift in the input data that is affecting performance?”. This feedback from the field is also likely to provide raw material for new training data. For instance, people using a face recognition app would constantly be providing raw images that can be labeled to produce more training data. This data, once labeled, might even become more valuable than the initial datasets since it represents the most recent data from the field.

Another characteristic of dynamic data is the ability to search and join it with other data sources to provide new views and perspectives. Valuable data has metadata, and is indexed to help with searching and joining.

Key takeaway: For data to be valuable, it must be dynamic. A labeled dataset with little in common with the real world is only useful to bootstrap model development. Make sure you are collecting and using data from your production system to better understand how deep learning is helping and how your models can be improved. Collecting data alone is not enough, the data must be indexed and searchable to provide value.

Infrastructure Stream

The infrastructure stream must support the development of the models and data streams by providing adapted data storage and computing. In this context, infrastructure also includes software that supports data flows and the tasks that one might call data engineering. Depending on the type of business and data requirements, this might mean providing other streams with the correct cloud infrastructure, the necessary hardware (e.g GPUs/TPUs) and data APIs. Embedded deep learning systems might add additional constrains.

Infrastructure is particularly relevant for the data stream. Data cannot be dynamic if it is stored in CSVs on a hard drive. For data to be dynamic it must be stored on systems providing fast search capabilities and easy access. The infrastructure stream must select databases adapted to the type of data and the right tools to query and search it.

For small deep learning projects, in which datasets are homogeneous and mostly static, infrastructure might not be a huge concern. But as the impact of deep learning grows in the organisation, the supporting infrastructure becomes fundamental. I have seen deep learning initiatives fall apart and creativity slow to a halt due to a lack of flexibility in how the data could be viewed and analysed. For deep learning solutions to emerge, there must be minimal friction between an idea and the dataset needed to test that idea.

Key takeaway: The infrastructure is the backbone of any large deep learning initiative. Without a well designed infrastructure, deep learning engineers lack the flexibility to test new ideas and be creative.

Conclusion

The importance of focusing on all four of the deep learning streams cannot be overstated. One of the most undeniable demonstrations of the importance of going beyond the models stream and its monolithic datasets, is provided by companies such as Google, Facebook and Microsoft. These companies have open sourced huge labeled deep learning datasets and made available incredibly powerful frameworks for developing neural networks. They would never have done such a thing if they thought open sourcing these resources was a threat to their business. These companies fundamentally believe that the value of deep learning comes not from the models themselves, but from the ability to integrate them into their core business, and to constantly improve these models with real world data.

I hope the 4 Streams of Deep Learning will help you plan your deep learning strategy. Deep learning is an extremely powerful tool. All you need is to know how to use it.

How to use anaconda python effectively

2019-10-20T19:36:35+00:00

In this article, I will talk about effective use of the Anaconda python distribution and all the tools that come with it. I will use a few commands and post screenshots for demonstration purposes, these will be from my Ubuntu machine. All ideas and methods described in this article are also valid on Windows and macOS.

The python programming language is open source software and it’s license allows redistribution of the software by anyone. This means that when you install python, you must pick a specific distribution that you want to install. A commonly used distribution of python is Anaconda. Anaconda bundles together a version of python, a package manager, and environment manager and over 1500 python packages.

If you have not yet installed Anaconda, you can do so from the Anaconda website. If you do not want to install the 1500 packages that you might not need, you can install Miniconda. Miniconda is faster to download and it is easy to install the additional packages once they’re needed. From now on I will use Anaconda to refer to both Anaconda and Miniconda (I have Miniconda on my machine).

Conda environments

A conda environment is a location on your computer managed by the conda package manager.

For all practical purposes, a conda environment is two things:

A directory on your computer where software can be installed without conflicting with other installed software.
An activation step, which simply tells your terminal to search for software installed in that directory before searching anywhere else.

The environment directory

Let’s explore this in a bit more detail. Assuming conda is installed (e.g through Anaconda) and available from the command line, run the command:

conda info --envs

You will see a list of conda environments available. If you have not created any environments yourself, you will simply see the “base” environment.

List of conda environments

I have two environments, “base” and “standard”. Let’s create another one that called “experiment”. You can do that by running:

conda create -n experiment

Now I have three environments:

New list of conda environments

As you can see, next to the environment name is the location of that environment. Conda will use this location for everything we install in this environment. Remember, the first thing I said you needed to understand about a conda environment is that it’s a simple directory. Navigate to that location, you will see everything that the environment contains. If you installed conda through Miniconda, your environment will be empty except for a “conda-meta” directory. If you installed through Anaconda, your environment will be full of things like /bin, /lib etc.

For those using Miniconda, your environment is pretty useless. Let’s install a few things in it. If you’re using Anaconda, you already have everything we’re going to install. Going through the process will be interesting nonetheless as it will help you understand how conda manages it’s environments. Now let’s install a package.

The activation step

But wait… how does conda know in which environment it should install the packages? Should it use “base” or “experiment”? This is where the second important part of the conda environments comes into play: the activation step. For an environment to be useful, your terminal has to know it exists, it has to be “activated”. Essentially, the activation step is the task of prepending the location of the environment onto the PATH environment variable so that the terminal will start by searching there, when looking for software. Activation is done with:

conda activate name_of_environment

The following screenshot shows how activating the experiment environment prepended it’s location to the PATH environment variable. You can also see how conda modified the command line by adding “(experiment)”. This is conda’s way of letting us know that this environment is now activated.

Effect of conda activate command (full path not visible)

Now that the environment is activated, conda knows which environment to use.

Package management with conda

Let’s install python. To do this, run the following command in your activated environment:

conda install python

You’ll see that your environment directory is no longer empty. For instance, it contains a /bin and /lib directory for executable programs and libraries respectively. For any Linux user, this will sound familiar. Conda is essentially creating a directory structure similar to the root directory which makes sense because it’s building an isolated environment.

Installing python has also installed pip, which is the python package manager. Anaconda will generally recommend using “conda install” when installing packages within a conda environment. I have found that a combination of “pip install” and conda commands can be an extremely powerful tool. Additionally, pip installation might be the only choice available when installing personal packages or packages not available through “conda install”.

Python package management with pip

That’s it for a basic description of the internals of conda environments. However, I would like to go a bit deeper. I would like to explore the use of pip to install python packages within a conda environment as this will be very useful for your python development.

To demonstrate this, we will install Numpy which is a very useful python package for manipulating arrays of data.

First we need to understand a tiny bit about how pip installs python packages. If you search for your python directory under the /lib directory of your conda environment you will see where the python is installed (e.g mine is called “lib/python3.7”). In that directory, you will find a directory called “site-packages”. This is where pip puts the python packages it installs.

If you installed Anaconda, Numpy will already be in there. If you installed Miniconda it won’t.

We’re going to use pip to install, and similarly to before we have the question: how does pip know to install into the python site-packages of the correct environment? When we installed python into our environment, pip was installed into that environment as well and is now available in it’s /bin directory. Since the environment is activated, and the environment location was added to the PATH environment variable, the first pip found by the terminal is the one in that directory. Pip will therefore install numpy in the python installed in that environment. In Ubuntu (and macOS), you can check the correct pip is found by running (in Windows a similar command is “where”):

which pip

Now, let’s run:

pip install numpy

If you look into the site-packages, Numpy we be installed.

Conclusion

Conda environments provide a great way of installing multiple versions of the same software (e.g python) on a single computer without the risk of conflicts. Hopefully this small article has given you practical information on how conda environments work. This should help you use conda more effectively in the future, especially for python development. I have not explored all the things you can do with conda, if you want more information, this documentation is excellent.

This article is part of a series on Python development fundamentals.

I hope you enjoyed the article. Have a great day.

Python development fundamentals

2019-10-20T19:36:25+00:00

Python is a wonderful language. It’s also incredibly popular in all areas of software development. This popularity is due, in part, to it’s ease of use and apparent simplicity. The “Hello World” of python is a one liner in a single file. No compilation, no complicated project structure, no boilerplate code. Just a simple print(“Hello World”) saved in hello_world.py.

This simplicity means that you can go a long way without having to think about things like project structure, testing or deployment. However, there comes a point when even python developers need structure. This series is about rigorous python development fundamentals. It will be composed of small, digestible chunks that can be read mostly independently but together will paint a broader picture of effective python development. I will not talk about the language itself, rather I will discuss things such as the use of virtual environments, the directory structure of a python project, making a project pip installable for reuse by yourself or others, and testing.

The following articles are part of this series (more are still come):

Keras implementation of an encoder-decoder for time series prediction using architecture

2019-07-22T22:13:44+00:00

I created this post to share a flexible and reusable implementation of an encoder/decoder model for time series prediction using Keras.

I drew inspiration from two other posts:

“Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Prediction” by Guillaume Chevalier
“A ten-minute introduction to sequence-to-sequence learning in Keras” by François Chollet.

I strongly recommend visiting Guillaume’s repository for some great projects.
François Chollet is the primary author and currently the maintainer of Keras. His post presents an implementation of a seq2seq model for machine translation.

All the code for this post can be found on my github page, feel free to download and use it as you wish.

Context

Time series prediction is a widespread problem. Applications range from price and weather forecasting to biological signal prediction.

This post describes how to implement a Recurrent Neural Network (RNN) encoder-decoder for time series prediction using Keras. I will focus on the practical aspects of the implementation, rather than the theory underlying neural networks, though I will try to share some of the reasoning behind the ideas I present. I assume a basic understanding of how RNNs work. If you need to catch up, a good place to start is the classic “Understanding LSTM Networks” by Christopher Olah.

What is an encoder-decoder and why are they useful for time series prediction?

The simplest RNN architecture for time series prediction is a “many to one” implementation.

A “many to one” recurrent neural net takes as input a sequence and returns one value. For a more detailed description of the difference between many to one, many to many RNNs etc. have a look at this Stack Exchange answer.

How can a “many to one” neural network be used for time series prediction? A “many to one” RNN can be seen as a function f, that takes as input n steps of a time series, and outputs a value. An RNN can, for instance, be trained to intake the past 4 values of a time series and output a prediction of the next value.
Let X be a time series and X_t the value of that time series at time t, then:

f(X_t-3, X_t-2, X_t-1, X_t) = Xpredicted_t+1

The function f is composed of 4 RNN cells and can be represented as following:

Recurrent neural network with 4 cells

If more than one prediction is needed (which is often the case) then the value predicted can be used as input and a new prediction can be made. Following is a representation of 3 runs through a RNN model to produce predictions for 3 steps in the future.

f(X_t-2, X_t-1, X_t, Xpredicted_t+1) = Xpredicted_t+2

Many to one neural network

As you can see, the basis of the prediction model f is a single unit, the RNN cell, that takes as input X_t and the state of the network (not represented in these graphs for clarity) and ouputs a single value (discarded unless all the input values have been input to the cell). The function f described above is evaluated by running the cell of the network 4 times, each time with a new input and the state output from the previous step.

Extending the many to one neural network

There are multiple reasons why this architecture might not be the best for time series prediction, compounding errors is one. However, in my opinion, there is a more important reason as to why it might not be the best method. In a time series prediction problem there are intuitively two distinct tasks. Human beings predicting a time series would proceed by looking at the known values of the past, and use their understanding of what happened in the past to predict the future values. These two tasks require two distinct skillsets:

The ability to look at the past values and create an idea of the state of the system in the present.
The ability to use that understanding of the state of the system to predict how the system will evolve in the future.

By using a single RNN cell in our model we are asking it to be capable of both memorising important events of the past and using these events to predict future values. This is the reasoning behind considering the encoder-decoder for time series prediction. Rather than having a single multi-tasking cell, the model will use two specialised cells. One for memorising important events of the past (encoder) and one for converting the important events into a prediction of the future (decoder).

Many to many neural network without encoder/decoder

Many to many neural network with encoder/decoder

This idea of having two cells (an encoder and a decoder) is used in other maching learning tasks, the most prominent being perhaps machine translation. In machine translation, the idea behind having two separate tasks is even clearer. Let’s say we’re creating a system that translates French to English. First we need an element (encoder) that is capable of understanding French, its only task is to understand the input sentence and create a representation of what that sentence means. Then we need a second system (decoder) that is capable of converting a representation of the meaning of the French sentence to a sentence in English with the same meaning. Instead of having a super intelligent cell that can understand French and speak English, we can create two cells, the encoder understands French but cannot speak English and the decoder knows how to speak English but cannot understand French. By working together, these specialised cells outperform the super cell.

How to create an encoder-decoder for time series prediction in Keras?

Now that we have an explanation as to why an encoder-decoder might work, we are going to implement one.

We will be training our model on an artificially generated dataset. Our time series will be composed of the sum of 2 randomly generated sine waves (random amplitude, frequency, phase and offset). The idea to use such a dataset came from Guillaume Chevalier (link in the beginning of the notebook) although I rewrote his functions to suit my needs. The dataset generators will be imported from utils.py at the root of this repository. This code is python 3 compatible (some things won’t work in python 2).

Import modules/packages

Let’s start by importing some packages and defining a couple of utility functions.
The utility functions (random_sine and plot_predictions) are mostly unimportant for undertanding the encoder/decoder. If you wish, you can jump to the next section.

import keras
import numpy as np
from matplotlib import pyplot as plt

# This section contains code modified licensed under the MIT License:
# Copyright (c) 2017 Guillaume Chevalier # For more information, visit:
# https://github.com/guillaume-chevalier/seq2seq-signal-prediction
# https://github.com/guillaume-chevalier/seq2seq-signal-prediction/blob/master/LICENSE

"""Contains functions to generate artificial data for predictions as well as a function to plot predictions."""

def random_sine(batch_size, steps_per_epoch,
                input_sequence_length, target_sequence_length,
                min_frequency=0.1, max_frequency=10,
                min_amplitude=0.1, max_amplitude=1,
                min_offset=-0.5, max_offset=0.5,
                num_signals=3, seed=43):
    """Produce a batch of signals.
    The signals are the sum of randomly generated sine waves.
    Arguments
    ---------
    batch_size: Number of signals to produce.
    steps_per_epoch: Number of batches of size batch_size produced by the
        generator.
    input_sequence_length: Length of the input signals to produce.
    target_sequence_length: Length of the target signals to produce.
    min_frequency: Minimum frequency of the base signals that are summed.
    max_frequency: Maximum frequency of the base signals that are summed.
    min_amplitude: Minimum amplitude of the base signals that are summed.
    max_amplitude: Maximum amplitude of the base signals that are summed.
    min_offset: Minimum offset of the base signals that are summed.
    max_offset: Maximum offset of the base signals that are summed.
    num_signals: Number of signals that are summed together.
    seed: The seed used for generating random numbers
    
    Returns
    -------
    signals: 2D array of shape (batch_size, sequence_length)
    """
    num_points = input_sequence_length + target_sequence_length
    x = np.arange(num_points) * 2*np.pi/30

    while True:
        # Reset seed to obtain same sequences from epoch to epoch
        np.random.seed(seed)

        for _ in range(steps_per_epoch):
            signals = np.zeros((batch_size, num_points))
            for _ in range(num_signals):
                # Generate random amplitude, frequence, offset, phase 
                amplitude = (np.random.rand(batch_size, 1) * 
                            (max_amplitude - min_amplitude) +
                             min_amplitude)
                frequency = (np.random.rand(batch_size, 1) * 
                            (max_frequency - min_frequency) + 
                             min_frequency)
                offset = (np.random.rand(batch_size, 1) * 
                         (max_offset - min_offset) + 
                          min_offset)
                phase = np.random.rand(batch_size, 1) * 2 * np.pi 
                         

                signals += amplitude * np.sin(frequency * x + phase)
            signals = np.expand_dims(signals, axis=2)
            
            encoder_input = signals[:, :input_sequence_length, :]
            decoder_output = signals[:, input_sequence_length:, :]
            
            # The output of the generator must be ([encoder_input, decoder_input], [decoder_output])
            decoder_input = np.zeros((decoder_output.shape[0], decoder_output.shape[1], 1))
            yield ([encoder_input, decoder_input], decoder_output)

def plot_prediction(x, y_true, y_pred):
    """Plots the predictions.
    
    Arguments
    ---------
    x: Input sequence of shape (input_sequence_length,
        dimension_of_signal)
    y_true: True output sequence of shape (input_sequence_length,
        dimension_of_signal)
    y_pred: Predicted output sequence (input_sequence_length,
        dimension_of_signal)
    """

    plt.figure(figsize=(12, 3))

    output_dim = x.shape[-1]
    for j in range(output_dim):
        past = x[:, j] 
        true = y_true[:, j]
        pred = y_pred[:, j]

        label1 = "Seen (past) values" if j==0 else "_nolegend_"
        label2 = "True future values" if j==0 else "_nolegend_"
        label3 = "Predictions" if j==0 else "_nolegend_"

        plt.plot(range(len(past)), past, "o--b",
                 label=label1)
        plt.plot(range(len(past),
                 len(true)+len(past)), true, "x--b", label=label2)
        plt.plot(range(len(past), len(pred)+len(past)), pred, "o--y",
                 label=label3)
    plt.legend(loc='best')
    plt.title("Predictions v.s. true values")
    plt.show()

if __name__ == '__main__':

    # This is an example of the plot function and the signal generator
    from matplotlib import pyplot as plt
    gen = random_sine(3, 3, 15, 15)
    for i, data in enumerate(gen):
        input_seq, output_seq = data
        for j in range(input_seq.shape[0]):
            plot_prediction(input_seq[j, :, :],
                            output_seq[j, :, :],
                            output_seq[j, :, :])
        if i > 2:
            break

Hyperparameters and model configuration

This model uses a Gated Recurrent Unit (GRU). Other units (LSTM) would also work with a few modifications to the code.

keras.backend.clear_session()

# Number of hidden neuros in each layer of the encoder and decoder
layers = [35, 35] 

learning_rate = 0.01
decay = 0 # Learning rate decay

# Other possible optimiser "sgd" (Stochastic Gradient Descent)
optimiser = keras.optimizers.Adam(lr=learning_rate, decay=decay) 

# The dimensionality of the input at each time step. In this case a 1D signal.
num_input_features = 1 
# The dimensionality of the output at each time step. In this case a 1D signal.
num_output_features = 1 
# There is no reason for the input sequence to be of same dimension as the ouput sequence.
# For instance, using 3 input signals: consumer confidence, inflation and house prices to predict the future house prices.

# Other loss functions are possible, see Keras documentation.
loss = "mse" 

# Regularisation isn't really needed for this application
lambda_regulariser = 0.000001 # Will not be used if regulariser is None
regulariser = None # Possible regulariser: keras.regularizers.l2(lambda_regulariser)

# batch_size * steps_per_epoch = total number of training examples
batch_size = 512
steps_per_epoch = 200
epochs = 15

input_sequence_length = 15 # Length of the sequence used by the encoder
target_sequence_length = 15 # Length of the sequence predicted by the decoder
num_steps_to_predict = 20 # Length to use when testing the model

# The number of random sine waves the compose the signal. The more sine waves, the harder the problem.
num_signals = 2 

Create model

Create encoder

The encoder is first created by instantiating a graph, which is a description of the operations applied to the tensors (that will later hold the data). This is common among many neural network frameworks.

# Define an input sequence.
encoder_inputs = keras.layers.Input(shape=(None, num_input_features))

# Create a list of RNN Cells, these are then concatenated into a single layer
# with the RNN layer.
encoder_cells = []
for hidden_neurons in layers:
    encoder_cells.append(keras.layers.GRUCell(hidden_neurons,
                                              kernel_regularizer=regulariser,
                                              recurrent_regularizer=regulariser,
                                              bias_regularizer=regulariser))

encoder = keras.layers.RNN(encoder_cells, return_state=True)

encoder_outputs_and_states = encoder(encoder_inputs)

# Discard encoder outputs and only keep the states.
# The outputs are of no interest to us, the encoder's
# job is to create a state describing the input sequence.
encoder_states = encoder_outputs_and_states[1:]
   

Create decoder

The decoder is created similarly to the encoder

# The decoder input will be set to zero (see random_sine function of the utils module).
# Do not worry about the input size being 1, I will explain that in the next cell.
decoder_inputs = keras.layers.Input(shape=(None, 1))

decoder_cells = []
for hidden_neurons in layers:
    decoder_cells.append(keras.layers.GRUCell(hidden_neurons,
                                              kernel_regularizer=regulariser,
                                              recurrent_regularizer=regulariser,
                                              bias_regularizer=regulariser))

decoder = keras.layers.RNN(decoder_cells, return_sequences=True, return_state=True)

# Set the initial state of the decoder to be the ouput state of the encoder.
# This is the fundamental part of the encoder-decoder.
decoder_outputs_and_states = decoder(decoder_inputs, initial_state=encoder_states)

# Only select the output of the decoder (not the states)
decoder_outputs = decoder_outputs_and_states[0]

# Apply a dense layer with linear activation to set output to correct dimension
# and scale (tanh is default activation for GRU in Keras, our output sine function can be larger then 1)
decoder_dense = keras.layers.Dense(num_output_features,
                                   activation='linear',
                                   kernel_regularizer=regulariser,
                                   bias_regularizer=regulariser)

decoder_outputs = decoder_dense(decoder_outputs)

Create model and compile

A notable detail here, are the inputs to the model. The train model has two inputs : encoder_inputs and decoder_inputs. What encoder_inputs should be is clear, the encoder_inputs should hold the input series. But what about the decoder inputs?

In machine translation applications (see “A ten minute introduction to sequence-to-sequence learning in keras”) something called teacher forcing is used. In teacher forcing, the input to the decoder during training is the target sequence shifted by 1. This supposedly helps the decoder learn and is an effective method for machine translation. I tested teacher forcing for sequence prediction and the results were bad. I am not entirely sure why this is the case, my intuition is that unlike machine translation, if you feed the decoder the correct sequence shifted by one, your model becomes “lazy” because it only has to look at the value input in the step before and apply a small modification to it. In other words, the gradients of the truncated back propagation beyond the n-1 step will be very small and the model will develop a short memory. However, if the input to the decoder is 0, it forces the model to really memorize the values that are fed to the encoder since it has nothing else to work on. In some sense, teacher forcing might artificially induce vanishing gradients.

If you look into the random_sine function defined earlier, you will see that the decoder input is simply set to zero. You may be asking yourselves why the decoder has an input at all if it’s set to zero… The reason is simple, Keras RNNs must take an input value alongside the state vector.

# Create a model using the functional API provided by Keras.
# The functional API is great, it gives an amazing amount of freedom in architecture of your NN.
# A read worth your time: https://keras.io/getting-started/functional-api-guide/ 
model = keras.models.Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_outputs)
model.compile(optimizer=optimiser, loss=loss)

Fit model to data

I like using the fit_generator in Keras. In this case it’s not really useful/necessary since my training examples easily fit into memory. In cases when the training data doesn’t fit into memory, the fit_generator is definitely the way to go. A standard python generator is usually fine for the fit_generator function, however, Keras provides a nice class keras.utils.Sequence that you can inherit from to create your own generator. This is a requirement to guarantee that the elements of the generator are only selected once in the case of multiprocessing (which isn’t guaranteed with the standard generator). A simple example of using the data generators in Keras is “A detailed example of how to use data generators with Keras” by Shervine Amidi.

# random_sine returns a generator that produces batches of training samples ([encoder_input, decoder_input], decoder_output)
# You can play with the min max frequencies of the sine waves, the number of sine waves that are summed etc...
# Another interesing exercise could be to see whether the model generalises well on sums of 3 signals if it's only been
# trained on sums of 2 signals...
train_data_generator = random_sine(batch_size=batch_size,
                                   steps_per_epoch=steps_per_epoch,
                                   input_sequence_length=input_sequence_length,
                                   target_sequence_length=target_sequence_length,
                                   min_frequency=0.1, max_frequency=10,
                                   min_amplitude=0.1, max_amplitude=1,
                                   min_offset=-0.5, max_offset=0.5,
                                   num_signals=num_signals, seed=1969)

model.fit_generator(train_data_generator, steps_per_epoch=steps_per_epoch, epochs=epochs)

It is now possible to use this model to make predictions:

test_data_generator = random_sine(batch_size=1000,
                                  steps_per_epoch=steps_per_epoch,
                                  input_sequence_length=input_sequence_length,
                                  target_sequence_length=target_sequence_length,
                                  min_frequency=0.1, max_frequency=10,
                                  min_amplitude=0.1, max_amplitude=1,
                                  min_offset=-0.5, max_offset=0.5,
                                  num_signals=num_signals, seed=2000)

(x_encoder_test, x_decoder_test), y_test = next(test_data_generator) # x_decoder_test is composed of zeros.

y_test_predicted = model.predict([x_encoder_test, x_decoder_test])

A limitation of using this model to make the predictions is that we can only predict a sequence of same length as the training data. This can be a problem if we want to predict less or more than the training sequence lengths. In the next section I will show how to create “prediction” models that allow to predict sequences of arbitrary length.

Create “prediction” models

When using the encoder-decoder to predict a sequence of arbitrary length, the encoder first encodes the entire input sequence. The state of the encoder is then fed to the decoder which then produces the output sequence sequentially. Although a new model is being created with the keras.models.Model class, the input and output tensors of the model are the same as those used during training, hence the weights of the layers applied to the tensors are preserved.

As you will see, creating the prediction models also gives the ability to inspect the state of the model at different points throughout the prediction process. We could study how the encoder creates a representation of the input data. For instance, how does the model represent the offset? Or the frequency? Does it decompose the signal into it’s constituent sine waves and represent them as different dimensions of the state vector? These are very interesting questions for another time.

encoder_predict_model = keras.models.Model(encoder_inputs,
                                           encoder_states)

decoder_states_inputs = []

# Read layers backwards to fit the format of initial_state
# For some reason, the states of the model are order backwards (state of the first layer at the end of the list)
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
    # One state for GRU
    decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))

decoder_outputs_and_states = decoder(
    decoder_inputs, initial_state=decoder_states_inputs)

decoder_outputs = decoder_outputs_and_states[0]
decoder_states = decoder_outputs_and_states[1:]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_predict_model = keras.models.Model((
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)

# Let's define a small function that predicts based on the trained encoder and decoder models

def predict(x, encoder_predict_model, decoder_predict_model, num_steps_to_predict):
    """Predict time series with encoder-decoder.
    
    Uses the encoder and decoder models previously trained to predict the next
    num_steps_to_predict values of the time series.
    
    Arguments
    ---------
    x: input time series of shape (batch_size, input_sequence_length, input_dimension).
    encoder_predict_model: The Keras encoder model.
    decoder_predict_model: The Keras decoder model.
    num_steps_to_predict: The number of steps in the future to predict
    
    Returns
    -------
    y_predicted: output time series for shape (batch_size, target_sequence_length,
        ouput_dimension)
    """
    y_predicted = []

    # Encode the values as a state vector
    states = encoder_predict_model.predict(x)

    # The states must be a list
    if not isinstance(states, list):
        states = [states]

    # Generate first value of the decoder input sequence
    decoder_input = np.zeros((x.shape[0], 1, 1))


    for _ in range(num_steps_to_predict):
        outputs_and_states = decoder_predict_model.predict(
        [decoder_input] + states, batch_size=batch_size)
        output = outputs_and_states[0]
        states = outputs_and_states[1:]

        # add predicted value
        y_predicted.append(output)

    return np.concatenate(y_predicted, axis=1)
        

The aim of this tutorial isn’t to present how to evaluate the model or investigate the training. We could plot evaluation metrics such as RMSE over time, compare train and test batches for overfitting, produce validation and learning curves to analyse the effect of the number of epochs or training examples, have fun playing with tensorboard etc… We would need at least a whole other post for this. However, let’s at least make sure that our model can predict correctly… Ask the generator to produce a batch of samples, don’t forget to set the seed to something other than what was used for training or you will be testing on train data. The next function asks the generator to produce it’s first batch.

test_data_generator = random_sine(batch_size=1000,
                                  steps_per_epoch=steps_per_epoch,
                                  input_sequence_length=input_sequence_length,
                                  target_sequence_length=target_sequence_length,
                                  min_frequency=0.1, max_frequency=10,
                                  min_amplitude=0.1, max_amplitude=1,
                                  min_offset=-0.5, max_offset=0.5,
                                  num_signals=num_signals, seed=2000)

(x_test, _), y_test = next(test_data_generator)

y_test_predicted = predict(x_test, encoder_predict_model, decoder_predict_model, num_steps_to_predict)

# Select 10 random examples to plot
indices = np.random.choice(range(x_test.shape[0]), replace=False, size=10)


for index in indices:
    plot_prediction(x_test[index, :, :], y_test[index, :, :], y_test_predicted[index, :, :])
    
# The model seems to struggle on very low frequency signals. But that makes sense, the model doesn't see enough of the signal to make a good estimation of the frequency components.

train_data_generator = random_sine(batch_size=1000,
                                   steps_per_epoch=steps_per_epoch,
                                   input_sequence_length=input_sequence_length,
                                   target_sequence_length=target_sequence_length,
                                   min_frequency=0.1, max_frequency=10,
                                   min_amplitude=0.1, max_amplitude=1,
                                   min_offset=-0.5, max_offset=0.5,
                                   num_signals=num_signals, seed=1969)

(x_train, _), y_train = next(train_data_generator)

y_train_predicted = predict(x_train, encoder_predict_model, decoder_predict_model, num_steps_to_predict)

# Select 10 random examples to plot
indices = np.random.choice(range(x_train.shape[0]), replace=False, size=10)

for index in indices:
    plot_prediction(x_train[index, :, :], y_train[index, :, :], y_train_predicted[index, :, :])

Next steps & Discussion

There are many things that could be done to either extend or improve this model. Here are a few ideas.

There’s no reason why the encoder and decoder should have the same complexity or the same number of layers. As well as doing a simple hyper parameter search, it could be interesting to implement a model with different encoder and decoder sizes. To do this, one would have to add a dense layer after retrieving the states of the encoder to transform them into the correct size.
Encapsulate the encoder-decoder by creating a class with a fit/predict interface. This is actually something I have done, it’s extremely useful as it allows to instantiate seq2seq models as easily as one would instantiate a scikit learn model.
Add the ability to add context vectors to the state output by the encoder. The encoder is able to produce an input vector for the decoder based on the time series. It is possible to add constant features to the model by duplicating them at each input timestep. However, adding the ability to extend the encoder output state with a constant vector that represents context might also be a good idea (for example, if you’re predicting the evolution of housing prices, you might want to tell your model which geographical area you are in, since prices might not evolve in the same manner depending on location). This is not the attention mechanism often used in NLP that also produces what is called a context vector(a context vector that is updated at each step of the decoder). But since adding attention to NLP seq2seq applications has hugely improved state of the art. It might also be worth looking into attention for sequence prediction.
As described above, study how the encoder creates a representation of the input sequence by looking at the state vector.
It appears that our model struggles on signals that have low frequency, one explanation might be that the model must “see” at least a certain number of periods to determine the frequency of the signal. An interesting questions to answer might be: How many periods of the constituent signals are required for the model to be accurate?
Although our model was only train on an output sequence of length 15, it appears to be able to predict beyond that limit, this is something we can exploit with the prediction models.

Thanks for reading 🙂

I welcome questions or comments, you can also find me on LinkedIn.

Author: Luke Tonin
LinkedIn: https://linkedin.com/in/luketonin
Github: https://github.com/LukeTonin/

Away with ideas

Simple deep learning

MNIST extended: a dataset for semantic segmentation and object detection

MNIST dataset

Semantic segmentation

Object detection

MNIST extended customisation

A simple example of semantic segmentation with tensorflow keras

Import packages

Semantic segmentation dataset

Semantic segmentation modelling

Model architecture

Train and evaluate

Improvements

Conclusion

The optimal python project structure

The project vs the package

Why __init__.py?

Helping git with a .gitignore

Helping yourself and others with a README.md

Helping pip with a setup.py

Tracking requirements with requirements.txt

The License

Last but not least: tests

Conclusion

Python logging – A practical guide

A simple logger

Last resort

Log levels

basicConfig and the root logger

Formatters

Python logging is powerful: Filters

Logger hierarchy in python logging

How do you create a hierarchy?

What does this hierarchy do?

Application logging vs library logging

Logging for a library

Logging for an application

Conclusion

The 4 streams of deep learning and how to use them successfully

Models Stream

Model Integration Stream

Data Stream

Infrastructure Stream

Conclusion

How to use anaconda python effectively

Conda environments

The environment directory

The activation step

Package management with conda

Python package management with pip

Conclusion

Python development fundamentals

Keras implementation of an encoder-decoder for time series prediction using architecture

Context

What is an encoder-decoder and why are they useful for time series prediction?

Extending the many to one neural network

How to create an encoder-decoder for time series prediction in Keras?

Import modules/packages

Hyperparameters and model configuration

Create model

Create encoder

Create decoder

Create model and compile

Fit model to data

Create “prediction” models

Next steps & Discussion

Thanks for reading 🙂

Why init.py?