Object Detection on CityScapes Dataset
Author: Tamás Illés
Motivation
I was working for Continental for 3 months as part of my summer internship, and during this time, I developed two separated tasks, namely semantic segmentation and depth estimation. These two tasks were part of a larger model, which was a multi-task model. My goal was to implement a robust framework with different tasks and functionality. All of them get the same input data and produce various outputs, according to their specification. In my autumn semester I wanted to implement a third task, which can detect and identify objects, in other words a standard object detection task. The future goal with these tasks was to achieve a well-performing multi-task model, and the base works were to implement tasks with the same structure, in order to ease the task integration to the model.
Requirements
As I mentioned, the aimed model was a multi-task model. These networks are similar to the standard single-task models, they produce task specified outputs, for example the identified count and name of objects, or the distances from the viewpoint. The main difference is the structure. These models usually have determined architecture and the built-in tasks are just part of them, I separated branch or a decoder head. In my case, I also used a determined pattern, based on Kendall et al. (2018) [1].
Figure 1 shows the structure of my model and the desired pattern. In this case, there is an input image, which flows through an encoder unit, which is connected to the separated task decoders. These decoder heads produce the desired output and the task losses, which are used to calculate the multi-task loss and the gradient vector to train our global model.
This pattern is must to follow, my other two tasks are also designed this way. Namely, I had to use CityScapes [2] as input dataset, Resnet [3] as shared encoder and my own, task specified decoder head, with its own loss value.
Instance segmentation approach
In CityScapes dataset, there are many annotations for different tasks, sadly object detection is not involved in this set. That is why I decided to approach the problem from instance segmentation side. It seemed perfect, because these annotations were available and there were public papers [1][4] with this topic, too, so I could easily compare the performance metrics.
The annotations for instance segmentation in CityScapes was a JSON file, which contains the pixels related to objects and the object’s name. Although, it was a detailed annotation, in this form this was barely usable. The problem was the fact, that the computer vision can easily identify pixels from objects, because they are very different from the adjacent pixels. Discriminate pixels, for example, pixels from cars, which are parking near and there are also overlap in the picture, is another problem.
The solution was a transformation, I did not use the pixels of each object, but I tried to determine the center point of them. As I mentioned, the annotations were coordinates of pixels, so I got the mean of X and Y axis of one object, so in this way, the mean of these coordinates denotated the center point of each objects. After that, all the other pixels were transformed, every value in that matrix was a 2-dimension vector, which points to the center point. So, the model did not focus on pixels, but tried to assign a vector to each pixel. If the labeled vector and the predicted vector points not exactly the same point, then I could calculate the error value and used it to the model’s loss value. The visual explanation of the method is shown in Figure 2.
Although, the method seems to perform very well, for object detection is not perfect. I used RMSE as performance metric as public papers did and the best training’s value was 11.45 pixel error, which is approximate the state-of-the art solutions (15.19 pixel in Kendall [1] and 11.34 in Sener’s work [4]).
As u can see, as an instance segmentation model performs very well, but for object detection, the vectors need to be grouped to identify the real pixels. For this purpose, I used KMeans cluster method, but with the default cluster count parameter, the algorithm can not identify the right number of objects. If I set the right count of objects, then the output was much better, but this number is not available without the label file, so in predictions, where none of them is available, the predictions was failure. Because this was a functional instance segmentation task, but I wanted an object detection, I tried a different approach.
Mask R-CNN approach
After I found the documentation of Mask R-CNN [5], I decided to try it, of course, I had to fine-tune the model. AS I mentioned, the determined structure was declared, but fortunately, this tool is also used encoder — decoder structure, and the encoder part is easily specifiable, for example choosing ResNet is also available. The decoder part is unique, so there is no difficulty to use the implemented one. For the model, also available pretrained weights, trained on ImageNet [6], which is larger and more various, than CityScapes. After I set the parameters, I had to fine-tune the model with some epochs and the output was perfect.
References
[1] Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7482–7491).
[2] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., … & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213–3223).
[3] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
[4] Sener, O., & Koltun, V. (2018). Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems (pp. 527–538).
[5] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
[6] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). Ieee.