In the age of Big Data and AI, every company wants every bit of info from their data. What can they do when huge datasets aren't available for training?
These days, every company is trying not only to analyze their data, but to squeeze every bit of information from it. The problem is that companies often don't have huge datasets available for proper training. Recently, my colleague Michał Izworski and I had to derive a relative area compared to the image of a specific cosmetic. We had a dataset of about 3000 images, with about 200 positives. For this article, we will suppose that the cosmetic is Bioderma Micellar Water, but please note that this is solely for the article.
The problem to be solved
We wanted to determine the area occupied by Bioderma Micellar Water bottles. Specifically, we wanted to know how many bottles there are and the area occupied by every bottle, in order to create an image summary. For clarification, here’s an example – we'll call the types of bottles that we are interested in "Micellar" and every other bottle "Nonmicellar", even if it's a Micellar Water of a different brand.
In the example above, we have one bottle as a result because there's only one Bioderma Micellar Water bottle. The second bottle (on the left side of the image) is not the type we are concerned with. We can then find statistics for how much area is occupied by every Bioderma Micellar Water bottle.
Our solution can be described as a pipeline combined of smaller steps. We made extensively use of deep learning in many ways in these steps. Convolutional Neural Networks were present in almost all of them, as they are unexcelled in state of the art computer vision.
It would be tremendously hard to train an end-to-end neural classifier that has one neural network which creates a mask and statistics from a raw image. And if you consider the size of our training set, it becomes impossible. To make the task feasible, we split it into smaller parts, so that every part would be easier to grasp and more atomic. Splitting the task into smaller parts was also a way of injecting domain knowledge into the solution. The knowledge that every micellar water bottle is actually a bottle may sound trivial, but it significantly reduces the complexity of the problem.
- Bottle Detection – To find the Bioderma Micellar Water bottles, we first identified every bottle in the image and used these for later processing. For every bottle in the image, we found a bounding box surrounding it and cropped it. This makes the task of classification much easier, as the head classifier can assume that he deals with a bottle (which occupies essentially the whole image) and only needs to decide whether this is a Bioderma Micellar Water bottle or not. Mathematically speaking, we narrow down the space of all possible inputs to bottles.
- Classification - This is the classic task for a Convolutional Neural Network, the input being a bottle (occupying the whole image) so that the network needs to decide only whether it is a Bioderma Micellar Water bottle or not. The problem is a very small training set – in a standard CNN training case the training set contains tens of thousands of images per class. But there are some workarounds, which we happily used.
- Segmentation - After we classified the image as a type of bottle that concerned us, we needed to find the area it occupied. As the bounding boxes are not rotated, we cannot assume that a bounding box approximates the area. We considered approximating the area by mathematically calculating the size of the box the bottle would use if it were placed upright, but ultimately decided that segmentation works better.
In the previous section, I presented the overview of the solution. Here, I'd like to discuss problems that appeared in the smaller tasks and look at these from a more technical perspective.
Without a doubt, the biggest advantage of creating the pipeline and isolating 3 separate tasks was a generalization of the first part. We had to find bottles, not specific types of them, so this became a general problem and had already been solved with some heavy machinery. Object detection is one of the canonical problems in computer vision. It is described as finding bounding boxes (described by the vertices of a rectangle) that surround the objects of our interest. The classic way of dealing with object detection is RCNN. It consists of two parts. The first is to create region proposals, which are candidates for being something. The second is to crop these candidates and process them with trained CNN to check if this is a bottle/car/airplane, or maybe nothing in particular.
The most important region proposal algorithm for RCNN is selective search. It looks for blobs of low contrast and then groups them hierarchically to create bigger blobs. We are using Faster-RCNN, which has comparable results to RCNN in terms of result metrics, but is faster. It is a different architecture, but the progress from RCNN can be described as squeezing the engineering bits as much as possible from RCNN. For further investigation, I highly recommend this blog post from Dhruv Parthasarathy at Athelas. We used "Faster RCNN Inception ResNet V2" – the winner of the 2016 COCO competition for detection. It is available in the TensorFlow model zoo. When given the bounding box from the network, we cropped the boxes that were classified as bottles and used only these for further processing.
Classification with transfer learning
We had the bottles, so the rest should have been easy. However, this is where we had to deal with a small training set, which gave us a challenge. If we had a 100x larger training set, then we could have trained a CNN from scratch. Unfortunately, that wasn’t possible. We used two types of weapons against the small training set – Transfer Learning (using pretrained models to transfer the things it learned to our classifier) and Data Augmentation (artificially enlarging a training set by creating images from our dataset with transformations like zooming, rotating or "shearing").
Transfer learning is one of the most important concepts in applied computer vision. Because training convolutional neural networks from scratch requires huge datasets and lots of computing time on super machines, it is invaluable to be able to use results from other teams and transfer them to different problems. With that knowledge, we could again make use of our “heavy artillery”, but in a different way than before. The whole concept is founded on the idea that features learned in layers near the input can be used in other tasks. Putting it another way, the neural networks consist of many layers and they process images so that the first layers learn some basic concepts about the image, such as edges and so-called "Gabor filters" – at that point, it learns shapes and features like eyes or body parts in recognizing people. The closer the layer is to the top, the more specific the features learned on this layer are for the training task. Therefore, an outstanding thing about transfer learning is the fact that it's highly adjustable to the training set size.
With that knowledge, we could make use of another pretrained neural network hit – ResNet50, which won an ILSRVC 2015 competition and surpassed the human performance on the ImageNet dataset. Our initial training set was very small, so we took off only the last layer and froze every previous layer, then trained our classifier on top of that using the features that the network generated. That kind of preprocessing is a mapping of images into vector space. Vector corresponding to the image is therefore an embedding, and what we did was embed every image into this vector space, then train our classifier using the vector representations. What we got from transfer learning is a guarantee that this embedding is useful, because it already served as an embedding space (being a hidden layer of a network) for a classification task.
The embedding of input images into 3-dimensional space was chosen in such a way that it is possible to separate Bioderma Micelar Water bottles from other bottles. As you can see, there also emerged clusters of other products (but not as separable). More technically, it’s a T-SNE on one of the layers after flattening.
On top of that, we could train some other neural networks, but extracted features were on a very high level of abstraction, so much so that a simple linear classifier like Logistic Regression worked as well as a neural network. That's what we used on top.
With the above setup, we got the following results:
- Accuracy: 98.35%
- AUC: 99.12%
- Recall: 89.47%
- Precision: 91.63%
These stats were measured for the crops of bottles and classifying bottles on the output of bottle detection, so to get end-to-end stats we would have to include the bottle cropping network error (both hit-stats errors and bounding box errors). However, we hadn’t made use of our second weapon. With Data Augmentation, we could enlarge our training set by about 20 times. The quality of this data set would not be as good as if we had 20 times more pictures, but it was still beneficial to augment the data. The idea is very straightforward. Use simple transformation to create new training samples. These transformations contain rotations, zooms, moves, horizontal flips (not in our case), changing colors, etc. Of course, it’s possible to mix these transformations used to augment new data generated uniformly with some boundaries (such as min and max angle for rotation). An important thing about augmentation is that with more data, it’s possible to be a bit bolder about transfer learning and choose to not only take off the top layer and train Logistic Regression on top, but to freeze everything apart from the last few fully connected layers, training these few layers instead of just the top one.
With augmented data and 4 layers trained on top, we achieved the following:
- Accuracy: 98.89%
- AUC: 99,38%
- Recall: 90.47%
- Precision: 92.04%.
Half of a percentage point might not seem like substantial progress, but it decreased the accuracy error by about 30%!
After we had the crops and classified what was a Bioderma Micellar Water and what was not, the only thing left was to calculate the relative area that each bottle occupied. This can be described as image segmentation. Segmentation is like object detection, but instead of finding boxes surrounding images the objective is to classify every pixel to a class, in order to get a mask representing objects on the image. There are few architectures for semantic segmentation, one of which (Mask-RCNN) is presented in the blog post mentioned above. We used CRF-RNN.
Training and transition to Cloud
As mentioned before, we used Google Cloud in two ways:
- Google Dataflow – a system for parallel computing that flourishes when it deals with the processing of independent batches, as in our case. We used it for detecting bottles on images (the first part of processing).
- ML Engine – With ML Engine, we trained networks with and without augmented images. We estimated that ML Engine let us increase the speed by 20 times. Considering that training on an augmented training set took us over 12 hours on ML Engine, it would take about 10 days of non-stop work to train the network on a single machine.
- Segmentation overestimates the area when there are many bottles on one crop, so we calculated “expected area” based on the ratio of the sides of a rectangle. If the area calculated from segmentation differed too much we used the “expected” one (which had greater probability of error but was easier to control).
- Because our training set was small, our results (accuracy, recall, etc.) were calculated with a confidence threshold, especially considering that there might be a kind of product that was extremely difficult for our classifier that was not even in our training.
- Because we used a pretrained detection model (faster-rcnn, in our case) we were dependent on the fact that our class belonged to the set of all the classes that could be detected by the model. If there had been no bottle in the training set of this model, it would have made the problem significantly more difficult.