YOLO - You only look once
YOLO!!! So do we only live once? I sure do not know. What I know is that we only have to LOOK once. Wait what?
That’s right. If you want to detect and localize objects in an image, there is no need to go through the whole process of proposing regions of interest, classify them and correct their bounding boxes. If you recall from my previous article, this is exactly what models like RCNN and Faster RCNN do.
Do we really need all that complexity and computation? Well if we want top-notch accuracy we certainly do. Luckily there is another simpler way to perform such a task, by processing the image only once and output the prediction immediately. These types of models are called Single shot detectors.
Single shot detectors
Instead of having a dedicated system to propose regions of interests, we have a set of predefined boxes to look for objects, which are forwarded to a bunch of convolutional layers to predict class scores and bounding box offsets. Then for each box we predict a number of bounding boxes with a confidence score assigned to each one, we detect one object centered in that box and we output a set of probabilities for each possible class. Once we have all that, we simply and maybe naively keep only the box with a high confidence score. And it works. With very impressive results actually. To elaborate the overall flow even better, let’s use one of the most popular single shot detectors called YOLO .
You only look once (YOLO)
There have been 3 versions of the model so far, with each new one improving the previous in terms of both speed and accuracy. The number of predefined cells and the number of predicted bounding boxes for each cell is defined based on the input size and the classes. In our case, we are going to use the actual numbers used to evaluate the PASCAL VOC dataset.
First, we divide the image into a grid of 13x13, resulting in 169 cells in total.
For every one of the cells, it predicts 5 bounding boxes (x,y,w,h) with a confidence score, it detects one object regardless the number of boxes and 20 probabilities for the 20 classes.
In total, we have 169*5=845 bounding boxes and the shape of output tensor of the mode is going to be (13,13,5*5+20)= (13,13,45). The whole essence of the YOLO models is to build this (13,13,45) tensor. To accomplish that, it uses a CNN network and 2 fully connected layers to perform the actual regression.
The final prediction is extracted after keeping only the bounding boxes with a high confidence score( higher than a threshold such as 0.3)
Because the model may output duplicate detections for the same object, we use a technique called Non-maximal suppression to remove duplicates. In a simple implementation, we sort the predictions by the confidence score and as we iterate them we keep only the first appearances of each class.
As far as the actual model is concerned, the architecture is quite trivial as it consists of only convolutional and pooling layers, without any fancy tricks. We train the model using a multiple loss function, which includes a classification loss, a localization loss and a confidence loss.
The most recent versions of YOLO have introduced some special tricks to improve the accuracy and reduce the training and inference time. Some examples are batch normalization, anchor boxes, dimensions clusters and others. If you want to get into more details, you should definitely check the original papers.
The power of YOLO is not its spectacular accuracy or the very clever ideas behind it, is its superb speed, which makes it ideal for embedded systems and low-power applications. That’s why self-driving cars and surveillance cameras are its most common real-world use cases.
As deep learning continues to play along with computer vision (and it will sure do), we can expect many more models to be tailored for low-power systems even if they sometimes sacrify accuracy. And dont forget the whole Internet of Things kind of thing. This is where these models really shine.