What is YOLO and how version 3 detects objects? We’ll discuss these points now. This lecture is a brief introduction into YOLO version 3. Hi everyone. My name is Valentyn Sichkar. Let’s get started. Lecture is organized by following content. Firstly, we will identify YOLO in general. Next, we will look at YOLO version 3 architecture. After that, we will discuss what is input to the network. Next, we will compare detections at different scales. Then, we will define how actually network produces output We will also describe how network is trained, what is anchor boxes, and how predicted bounding boxes are calculated. Finally we will explain what is objectness score and make a conclusion. Let’s start now from the first topic YOLO is a shortened form of “You Only Look Once”. And it uses Convolutional Neural Networks for Object Detection. YOLO can detect multiple objects on a single image. It means that apart from predicting classes of the objects, YOLO also detects locations of these objects on the image YOLO applies a single Neural Network to the whole image. This Neural Network divides image into regions and produces probabilities for every region. After that YOLO predicts number of Bounding Boxes that cover some regions on the image and chooses the best ones according to the probabilities To fully understand principle idea how YOLO version 3 works, following terminology needed to be known: Convolutional Neural Networks, Residual Blocks, Skip connections, Up-sampling, Leaky ReLU activation function, Intersection over Union, Non-maximum suppression. We will cover these topics in separate lectures. Let’s turn now to the next topic and have a look at architecture of YOLO version 3 YOLO uses convolutional layers And YOLO version 3, originally, consists of 53 convolutional layers that are also called Darknet-53. But for detection tasks, original architecture stacked with 53 more layers that give us 106 layers of architecture for YOLO version 3. That’s why when you start any command in Darknet framework, you will see the process of loading architecture that consists of 106 layers. The detections are made at three layers: 82, 94 and 106. We will talk about detections in a few minutes This latest version 3 incorporates some of the most essential elements, that are Residual Blocks, Skip connections and Up-sampling. Each convolutional layer is followed by batch normalization layer and Leaky ReLU activation function. There are no pooling layers, but instead, additional convolutional layers with stride 2, are used to down-sample feature maps. Why? Because the use of additional convolutional layers to down-sample feature maps prevents loss of low-level features that pooling layer just exclude. As a result,

capturing low-level features helped to improve ability for detection small objects. A good example of this is on the images, where we can see that pooling exclude numbers, but convolution takes into account all numbers Let’s look now at input to the network How does input to the Network looks like? The input is a batch of images of following shape (n, 416, 416, 3), where n is a number of images. Next two numbers are width and height. The last one is a number of channels – red, green and blue. The middle two numbers, width and height, can be changed and set as 608, or any other number that is divisible by 32 without leaving a remainder (832, 1024). Why this number must be exactly 32 we will consider in a few moments. Increasing resolution of input might improve model’s accuracy after training In the current lecture, we will assume that we have input of size 416 by 416. These numbers are also called input network size. Input images themselves can be of any size, there is no need to resize them before feeding to the network. They all will be resized according to the input network size. And there is a possibility to experiment with keeping or not keeping aspect ratio by adjusting parameters when training and testing in original Darknet framework, in Tensorflow, Keras or any other framework you want to use Then you can compare and choose what approach best suites your custom model. Now we’ll move on to the next topic and discuss detections at 3 scales How the Network detects objects? YOLO version 3 makes detections at three different scales and at three separate places in the Network. These separate places for detections are layers 82, 94 and 106. Network downsamples input image by following factors: 32, 16 and 8 at those separate places of the Network accordingly. These three numbers are called stride of the network and they show how the output at three separate places in the Network is smaller than input to the Network For instance, if we consider stride 32 and input network size 416 by 416, then it will give us the output of size 13 by 13. Consequently, for the stride 16 the output will be 26 by 26 and for the stride 8 the output will be 52 by 52. 13 by 13 is responsible for detecting large objects; 26 by 26 is responsible for detecting medium objects and 52 by 52 is responsible for detecting small objects That is why few moments before we discussed that input to the Network must be divisible by 32 without leaving a reminder. Because if it is true for 32, then it is true for 16 and 8 as well. The next topic I’d like to focus on is detection kernels To produce output YOLO version 3 applies 1 by 1 detection kernels at these three separate places in the Network. 1 by 1 convolutions applied to downsampled input images: 13 by 13, 26 by 26 and 52 by 52. Consequently, resulted feature maps will have the same spatial dimensions. The shape

of detection kernel also has its depth that is calculated by following equation. “b” here represents number of bounding boxes that each cell of the produced feature map can predict YOLO version 3 predicts 3 bounding boxes for every cell of these feature maps. That is why, “b” is equal to 3. Each bounding box has 5 + c attributes that describe following: centre coordinates of bounding box; width and height that are dimensions of bounding box; objectness score; and list of confidences for every class this bounding box might belong to We will consider that YOLO version 3 trained on COCO dataset that has 80 classes. Then, “c” is equal to 80 and total number of attributes for each bounding box is 85. Resulted equation is as following: 3 multiplied by 85 which gives us 255 attributes Now we can say: each feature map produced by detection kernels at three separate places in the Network, has one more dimension depth that incorporates 255 attributes of bounding boxes for COCO dataset. And the shapes of these feature maps are as following: 13 by 13 by 255; 26 by 26 by 255 and 52 by 52 by 255. Let’s move now to the next topic of grid cells We already know that YOLO version 3 predicts 3 bounding boxes for every cell of the feature map Each cell, in turn, predicts an object through one of its bounding box if the centre of the object belongs to the receptive field of this cell. And this is the task of YOLO version 3 while training: identify this cell that falls into the centre of the object. Again, this is one of the feature map’s cell produced by detection kernels that we discussed before. When YOLO version 3 is training, it has one ground truth bounding box that is responsible for detecting one object. That’s why and firstly, we need to define which cells this bounding box belongs to. And to do that let’s consider first detection scale where we have 32 as stride of the Network The input image of 416 by 416 is downsampled into 13 by 13 grid of cells as we calculated few moments ago. This grid now represents produced output feature map. When all cells, that ground truth bounding box belongs to, are identified, the centre cell is assigned by YOLO version 3 to be responsible for predicting this object. And objectness score for this cell is equal to 1. Again, this is one of the corresponding feature map’s cell that now is responsible for detecting lemon. But during training all cells, including this one, predict 3 bounding boxes each. Which one to choose then? Which one to assign as the best predicted lemon’s bounding box? We will open these questions in the next topic To predict bounding boxes YOLO version 3 uses pre-defined default bounding boxes that are called anchors or priors. These anchors are used later to calculate predicted bounding box’s real width and real height. In total, 9 anchor boxes are used. Three anchor boxes for each scale. It means that at each scale every grid cell of the feature map can predict 3 bounding boxes by using

3 anchors. To calculate these anchors, k-means clustering is applied in YOLO version 3. Width and height of 9 anchors for COCO dataset are as following. They are grouped according to the scale at three separated places at the Network Let’s consider graphical example on how one of the 3 anchor boxes is chosen to calculate later real width and real height of predicted bounding box. We have input image of shape 416 by 416 by 3. Image goes through YOLO version 3 deep CNN architecture till the first separate place that we discussed earlier and that has stride 32. Input image is downsampled by this factor to the dimension 13 by 13 and 255 depth of the feature map produced by detection kernels as we calculated earlier. Since we have 3 anchor boxes, then each cell encodes information about 3 predicted bounding boxes. Each predicted bounding box has following attributes: centre coordinates; predicted width and predicted height; objectness score; and list of confidences for every class this bounding box might belong to As we use COCO dataset as an example, this list has 80 class confidences And now we need to extract probabilities among 3 predicted bounding boxes of this cell to identify that this box contains certain class. To do that, we compute following elementwise product of objectness score and list of confidences. Then, we find maximum probability and can say that this box detected class lemon with probability 0.55 These calculations are applied to all 13 by 13 cells across 3 predicted boxes and across 80 classes. The number of predicted boxes at this first scale in the Network is 507. Moreover, these calculations are also applied to other scales in the Network giving us 2028 predicted boxes and 8112 predicted boxes. In total, YOLO version 3 predicts 10 647 boxes that are filtered with non-maximum suppression technique Let’s move to the next topic and identify how predicted bounding boxes are calculated We already know that anchors are bounding box’s priors and they were calculated by using K-means clustering. For COCO dataset they are as following. To predict real width and real height of the bounding boxes, YOLO version 3 calculates offsets to predefined anchors This offset is also called log-space transform. To predict centre coordinates of the bounding boxes, YOLO version 3 passes outputs through sigmoid function. Here are equations that are used to obtain predicted bounding box’s width, height and centre coordinates bx, by, bw, bh are the centre coordinates, width and height of the predicted bounding box. tx, ty, tw and th are outputs of the Network after training. To better understand these outputs let’s again have a look at how YOLO version 3 is training. It has one ground truth bounding box and one centre cell to be responsible for this object. Weights of the Network are trained to predict as accurate as possible this centre cell and bounding box’s coordinates

After training and after forward pass, Network outputs coordinates tx, ty, tw and th. Next, cx and cy that are the coordinates of the top left corner of the cell on the grid of the appropriate anchor box. Finally, pw and ph are the anchor’s boxes width and height YOLO version 3 doesn’t predict absolute values of width and height. Instead, as we discussed above, it predicts offsets to anchors. Why? Because it helps to eliminate unstable gradients during training. That’s why values cx, cy, pw and ph are normalized to the real image width and real image height. And centre coordinates tx, ty are passed through sigmoid function that gives values between 0 and 1 Consequently, to get absolute values after predicting we simply need to multiply them to the real and whole image width and height Let’s move now to the next topic and interpret what is objectness score We already discussed that for every cell YOLO version 3 outputs bounding boxes with their attributes. These attributes are tx, ty, tw, th, p0 and 80 confidences for every class this bounding box might belong to. And these outputs are used later to choose anchor boxes by calculating scores and to calculate predicted bounding box’s real width and real height by using chosen anchors. p0 here is so called objectness score. Do you remember that YOLO version 3 when training assigns centre cell of the ground truth bounding box to be responsible for predicting this object? Consequently, this cell and its neighbours have objectess score nearly 1, whereas corner cells have objectness score almost 0. In other words, objectness score represents probability that this cell is a centre cell responsible for predicting one particular object and appropriate bounding box contains object inside The difference between objectness score and 80 probabilities for class confidences is that class confidences represent probabilities that detected object belongs to particular class like person, car, cat etcetera. Whereas, objectness score represents probability that bounding box contains object inside. Mathematically objectness score can be represented as following. Where P object is a predicted probability that bounding box contains object and IoU is intersection over union between predicted bounding box and ground truth bounding box (make pause for showing image) The result is passed through sigmoid function that gives values between 0 and 1. To summarise this lecture let’s move to the final topic We discussed all major points of how YOLO version 3 works. Now we can come back to the definition and update it giving more details. YOLO version 3 applies convolutional neural networks to the input image. To predict bounding boxes, it downsamples image at three separate places of this Network that are also called scales. While training it uses 1 by 1 detection kernels that are applied to the grid of cells at these three separate places of the Network. Network is trained to

assign only one cell to be responsible for detecting one object if that cell falls into the centre of this object. 9 predefined bounding boxes are used to calculate spatial dimensions and coordinates of predicted bounding boxes. These predefined boxes are called anchors or priors 3 anchor boxes for each scale. In total, YOLO version 3 predicts 10 647 bounding boxes that are filtered with non-maximum suppression technique leaving only the right ones That was extended definition according to what we covered during the lecture Are you interested in training your own detector based on YOLO version 3? Then join the course! You will create your custom dataset, build your model and train detector to use it on image, video and by camera. Find the link in the description right below this video. Thank you very much for watching I hope you found this lecture useful and motivating to keep studying. Please be sure to like this video and subscribe to the channel if you haven’t already See you soon with more great stuff