Author: Fábián Füleki
As self-driving is going to be part of everyday life, safety has to be guaranteed in every possible driving condition. One key component of self-driving is the perception of the environment, as planning is based on the perception and acting is based on planning. For sensing, the collection of 3-dimensional data is mandatory for precise localization of the objects in the scene. This task can be solved easily using high precision and high-resolution lidars, but they are quite expensive. One way to resolve this issue is to utilize cameras for 3D perception as well. As the camera sensors are only able to return color information from the scene in 2D, some type of 3D space estimation is needed. The evident solution is to determine a distance for every pixel in the RGB image, which is also called depth estimation.
Depth estimation can be addressed using deep neural networks trained in a fully supervised manner with the RGB image(s) as input and the estimated depth as output. As no dense depth information can be collected in the real-world, a synthetic dataset called Synthia has been utilized for training, which provided RGB images, depth maps and semantic segmentation from stereo cameras. The architecture of the deep convolutional neural network was encoder-decoder with skip connections. During the research, four different approaches have been tested:
- Monocamera setup: Single RGB image as input and a single depth output (although both left and right camera images were used for training)
- Stereo cameras setup: RGB images from the left and right camera were concatenated, which resulted in a 6-channel input
- Monocamera with weighted loss: Using semantic segmentation for loss calculation (the main contribution, explained below)
- Multitask setup: Predicting depth and semantic segmentation at the same time
The main contribution of the research was addressing the problem of weakly represented objects. On the camera images, some objects are represented using only a really small number of pixels compared to the full image resolution, but this property does not mean that an object is further away or is less relevant. A great example of this effect can be seen at pedestrians: they are highly involved in the traffic scenes and they are also really fragile, so the precise distance detection for them is a safety requirement. The proposed method provides a solution to compensate for the size of the relatively small objects: with the help of the semantic segmentation data, the loss function for some classes — pedestrians and vehicles — are taken into account with a bigger weight. For other unimportant classes, such as the sky or trees, the loss is reduced.
Another promising approach is solving multiple self-driving related tasks at the same. This is called multitask learning, which usually results in some performance gain over the single task equivalents. Semantic information for the corresponding pixels might help the task of depth estimation as this two information are heavily correlated.
The research with the proposed four approaches revealed the following results:
Based on this research, depth estimation appears to be a problem that can be solved using a monocular camera which is quite promising and appealing. My assumption is that with some improvements the proposed approaches could be utilized in the automotive industry. The novel idea of weighting the loss according to the semantic segmentation appears to be helpful. Based on the structural similarity results, the best solution is to utilize a multitask learning approach with semantic segmentation.
In the image, sample input and output of the depth estimator neural network can be seen in the case of multitask learning.
This research was done with the supervision of Dr. Bálint Gyires-Tóth and Róbert Zsolt Kabai.