Deep Learning-based Semantic Segmentation in Simulation and Real-World for Autonomous Vehicles
Author: Zsombor Tóth
People think that designing a self-driving car is not a convoluted task, but driving is one of the more complicated activities humans routinely do. Following a list of rules of the road is not enough to drive as well as a human does, because we do things like handle unexpected situations, react to weather conditions or make decisions that are against the rules in order to avoid endangering human life. In the age of artificial intelligence, one of the most actively researched areas is self-driving in the automotive industry. Many multinational companies are investing huge human and material capital in research and development of hardware and software that support this function. These components like lidars, cameras, sensors and the algorithms running behind them, must work closely together to achieve fully autonomous driving.
Our most important “sensors” in road driving are our visual sensors, our eyes, with which we can identify our environment fast and accurately. Based on this observation, we can correctly say that the success of self-driving vehicles is largely based on well-functioning image recognition. The bulk of available evidence indicates that this task is most easily accomplished using deep neural networks.
Such a complex system is based on several camera-based components that are required for a fully autonomous vehicle. These components cover many aspects, of which the semantic segmentation is a core element. Semantic segmentation is a special image processing step in which we try to separate two or more elements of an image and define the boundaries of individual entities. From the aspect of the technical side, this process clusters the image generally on pixel-level and assigns one of the predefined classes to each pixel. The major difficulty in developing is data gathering, as training a complex deep learning algorithm requires vast quantities of input data. Application of simulators, which can be used to automatically generate training data, might assist in solving this problem.
Dataset from simulator
Of the simulators, I first chose Duckietown because it provides a completely open-source and a comprehensible environment to the development. The advantages of Duckietown are that the simulator is transparent and modifiable, and it is still under development. However, its biggest drawback is that it does not support to collect semantically segmented data, so I had to solve this problem.
The first task was to invent and implement a tool which is capable of saving original and labeled images at the same time. To accomplish this, I used multiple cameras to show the original and segmented images at the same time. To create the segmented image, I applied custom textures to all objects of the simulated world. I created unique, pixel-precise textures using Photoshop, where I colored everything black except the parts I wanted to segment. This means a total of 45 unique textures. I chose the colors (red, green, blue and black) that they can be easily distinguished from each other, so I can use color filter on the image of the secondary camera (semantic segmented) to separate the objects from each other and assign the appropriate class labels. I identified 4 different classes that define the most important entities of the roadway. These are the solid white line indicating the edge of the roadway, the yellow dashed centerline, the red line indicating a stop line, and 4th category includes everything else that is not listed above i.e. the unlabeled class. Using this tool, I generated a training and a validation set. The training set contains 5000 samples, while the validation set contains 1000 and all of images are colorful and 640x480 in size.
Unet
Over the years, a number of deep neural network architectures have been discovered for semantic segmentation, however, in addition to efficiency, the most important factor is execution time. I emphasize, that real-time performance is often necessary since semantic labeling is usually employed only as preprocessing step of other time-critical tasks. Taking these aspects into account, I chose U-Net architecture. U-Net was first designed especially for biomedical image segmentation. Compared to the original article, I reduced the number of convolutional layers and then trained the model with my generated dataset. The most commonly used loss function for the task of image segmentation is a pixel-wise cross-entropy loss, furthermore, I used stochastic gradient decent (SGD) as optimizer with momentum 0.9, learning rate 0,01 and weight decay 10–4.
On simulated images, validation pixel accuracy was 98.42% by the end of training. I emphasize, that there are more suitable metrics for measuring accuracy (e.g., mean IoU), the implementation of which will be part of my future work. I tested the model on both simulated and real images, the results are shown in the images below.
In summary, the results are encouraging, although real-world images are much more diverse. This work performed allows me to pay more attention on model designs and optimization problems in the future.
For a better understanding of U-Net, more information can be found in this article:
Olaf Ronneberger, Philipp Fischer and Thomas Brox „U-Net: Convolutional Networks for Biomedical Image Segmentation”, CoRR, arXiv: 1505.04597, 2015