Disjoint Datasets in Multi-task Learning with Deep Neural Networks for Autonomous Driving
Author: Tamás Illés
Introduction
In Machine Learning (ML), we typically care about optimizing for a particular metric, whether this is a score on a certain benchmark or a business Key Performance Indicator (KPI). In order to do this, we generally train a single model or an ensemble of models to perform our desired task. We then fine-tune and tweak these models until their performance no longer increases. While we can generally achieve acceptable performance this way, by being laser-focused on our single task, we ignore information that might help us do even better on the metric we care about. Specifically, this information comes from the training signals of related tasks. By sharing representations between related tasks, we can enable our model to generalize better on our original task. This approach is called Multi-Task Learning (MTL).
My goal
In a deep learning project, data and datasets are elementary for development. Usually, there is a complete dataset with annotations or a pre-trained model with weights based on another dataset. Annotating the data is often a time-consuming and expensive process. In this case, that would be advantageous, if we must annotate only a subset of the whole data for each task, not the total dataset.
My goal was to develop a multi-task learning method, which is trained on disjoint datasets. These datasets are disjunct subsets for different tasks. In this case, it’s obvious that some performance loss will occur. The question is that, how much is this loss, is it possible to minimalize and is there any solution, which can achieve the performance of the standard multi-task way.
Dataset and the model
There are plenty of datasets for autonomous driving scenarios. I had to choose one of them, which is up to date, high quality and well-annotated for my tasks. CityScapes meets the expectations perfectly. This is a relatively small database, so I can develop and test my model fast and well-annotated for my interests, namely semantic segmentation and depth estimation.
For the model, I could not use the standard multi-task structure, because I have separated input datasets. In this case, I used two single-task model with shared weights.
Knowledge distillation
The main part of my work is focused on training method. Because of the disjoint datasets, the standard learning way was not usable. I developed many naive ways to train my model. This means, that the training was iterative, so there were iterations, where the segmentation model was trained and there were others where another model. These were only partially suitable, because of the “forgetting effect”. This means a negative effect during training, the iterations, which are bond to semseg, pull the model’s weights to their own way and vice versa, so the model learns something about one task and then another task makes to forget immediately. The minimize this effect, I used a modified way of knowledge distillation.
Knowledge distillation means the way, where usually there is a student and a teacher model. The teacher is a robust, pretrained model, which has knowledge about the problem and much more. Basically, it has expanded functionality but we don’t need that much, only a small part of it. That’s why we use the student model, which is much smaller and simple. During the training, the teacher gives its knowledge about the task to the student, who can use that knowledge to perform better.
In my case, I did not have a teacher model with knowledge, only student models. The idea was that I used the model’s previous state as a teacher, so the model from the previous iteration taught the model in the current state. Basically, I reduced the forgetting effect this way, because the model from earlier had the knowledge and gave it to itself later. Although this solution was far the most complex, it performed the best of all.
Results
In this section, I will introduce my results, compared to baseline methods.
In this figure, there are metrics from Semseg and from Depth tasks. The goal is to reach the top right corner. The blue point is represented the standard multi-task model, where there is a fully annotated dataset and standard learning way. This is the first baseline solution because I tried to reach this performance with the modified methods. The second baseline is the orange points, where everything is the same as blue, but only used half of the dataset. This is a perfect example, how essential is the suitable amount of training data. So my goal was to achieve the blue’s performance with the training data of orange.
My work was successful because most of my developed methods are taking place between orange and blue. With perfecting the algorithms and fine-tuning the ideas I got better and better solutions and with knowledge distillation, I could approach the baseline. So the answer to the question, that loss can be minimalized, is absolutely YES!