Deep Reinforcement Learning for Vehicle Control based on Segmentation

9 min readFeb 25, 2021

Author: Márton Tim

In Deep Reinforcement Learning (DRL), convergence and low performance of the resulting agent is often an issue that just intensifies as the problem becomes increasingly complex and complicated. This is one reason why end-to-end applications of DRL might become soon outperformed by solutions tackling the original problem by breaking it down into smaller, meaningful subtasks.

In the last semester, I have worked on a solution for a robust, obstacle-avoiding lane follower agent that was using segmentation to simplify observations. The idea, motivation and details of my solution will all be discussed below, so keep on reading :)

The concept

As it was stated in the intro, the end-to-end application of DRL might have issues grasping the fine details of a problem in case its complexity is beyond a reasonable limit. If we were to develop an automated driving software based on a camera stream without ML, we surely would not experiment directly linking pixel values to control output. This would go against our own intuition and hierarchical view of the world, where everything consists of smaller pieces just as anything might be a building block of something greater.

Instead, we would tackle the problem in much smaller, meaningful pieces where individual parts have a clear aim within the context of the main application. For instance, in the driving problem, we could separate concerns as observation processing and lane-keeping are two distinct functionalities within the application. That makes them easier to develop and we can hope for realistically better performance.

So this is exactly what we should aim for with ML, as well. Therefore I, along with my supervisors Róbert Moni and Márton Szemenyei, was working on a solution, where the observation was reduced to a right lane segmented image, on top of which a DRL agent was trained. It also featured a simulator-to-real transfer learning situation, as I was working in the Duckietown ecosystem and wanted to make use of its simulator, called Gym-Duckietown.

Domain-independent Segmentation Network Training

The first step was the robust processing of input images in order to create an environment-independent abstraction of measurements. This would not only allow the agent to learn from an already domain-invariant input, but also lighten the complexity of tasks by making use of separating purposes within the processing pipeline, as discussed above.

We used a modified version of Gym-Duckietown, where the added functionalities included right-lane labelled observation generation, and recording drive runs manually in parallel with labelled camera images. For a start, only right lane areas were targeted, as for the initial solution a robust lane follower agent was our goal. We recorded about 55 drive runs covering different maps, obstacle setups and other randomized parameters, which resulted in 11,500 simulator images. Parameter randomization was a form of domain randomization, which we hoped would help in creating a robust tool against domain-specific features when training a segmentation network solely on simulator images.

An example of a simulator image and its label.

We also wanted to apply domain adaptation techniques to increase performance in the target domain (the real environment), however, first we needed a basis of comparison. It ended up being training on the source domain data set(the simulator images), without any information from the real domain.

However, this did not stop us from applying a bunch of tricks and tweaks in the training process. First, data augmentation was used excessively to prepare our network for a different type of anomalies (HSV-shifting, random cropping, noise or motion-like blur). Another technique employed was cyclical learning rate scheduling for avoiding local minima. After training a Fully Convolutional DenseNet (FC-DenseNet) three and a half cycles using AdamW optimizer and L2 regularization, the result was truly promising, with over 95 IoU achieved on real images. Despite the high IoU, in an empirical evaluation using real videos, it was evidently lacking in prediction stability and accuracy.

Domain adapting the segmentation network

Talking about real images, there is a great selection of unlabelled videos on Duckietown Logs, of which several dozen were selected for unsupervised adaptation, consisting of over 30,000 pictures. But we also have labelled 100 images of them, for quantitative testing and also for supervised adaptation.

The field of domain adaptation is currently under active research, which means there are relatively few go-to solutions. In a good portion of our work, we experimented with four of them:

Training on combined Source & Target domain labelled set (S&T)
Histogram Matching source images with target samples (HM)
Domain converting source set using CycleGAN
Semi-supervised DA via Minimax Entropy (MME)

As we can see, these algorithms cover a wide range both in simplicity and in utilizing available data (unsupervised — semi-supervised — supervised). All of them have some advantageous property compared to the others, so let’s go through each and explain them.

S&T combined domain training — The possibly simplest solution, as we perform the same training as in the simulator-only case, but with a merged dataset. Or maybe not so simple? 80 real images (train set of labelled real data) would likely have no effect if it would be simply merged with more than 11k simulated ones. That is why we applied a 1:1 expected sampling ratio for the training batches.

Histogram Matching — Another uncomplicated solution, this time an unsupervised one. The aim is to achieve the numeric similarity of input data by equalizing the per-channel cumulative distribution functions of every source image with a randomly selected target domain image. Sounds complicated? I can assure you it’s not, just check out the original article, titled Keep it Simple.

Domain transformation using CycleGAN — You have probably already heard of CycleGANs, a source of several funny applications like the classic horse-zebra or human face-ramen soup conversion. However, it is basically a paired domain converter, which allows us to transform simulator images to better resemble real ones.

Unfortunately, we faced a significant problem, namely the conversion had a great deal of trouble preserving road geometry, as sometimes curves were cut down, or replaced with a straight section. This in turn caused the original labels to be invalid for the transformed image. We tried to get around this by recording another 2500 simulator images without domain randomization or optical distortion.

Semi-supervised DA via Minimax Entropy — And finally, the most complex procedure, utilizing all available data. In practice, it is an extension of S&T training which in this algorithm corresponds to the entropy minimization part. An adversarial learning scenario is achieved by adding an entropy maximization step on unlabelled samples, resulting in — hopefully — domain invariant features.

Comparison of domain adaptation results on three random images.

Having trained with all the mentioned algorithms, the results were mixed, with some methods producing impressive results, and some increased performance only in certain aspects. The quantitative evaluation was obtained using the remainder of labelled real images, while we also tested empirically by running the algorithm on videos not present in the real training data set.

Numerical evaluation results. Note that the test set count was relatively small.

Qualitative evaluation results based on a prediction on two real-domain videos.

We concluded that only two solutions were able to significantly increase performance in our case: S&T and MME, the simplest and the most complicated method. Their performance was also quite similar, which is unsurprising given MME is an extension of our S&T implementation. S&T turned out in the end as the best performer, but the low amount of images for testing possibly could not capture the domain generalization performance of minimax entropy.

HM and CycleGAN could also increase stability but it came at a cost of increasing the number of false positives. The reason for this could be, in the case of histogram matching that the synthetic source images were too simple, consisting only of a few elements and colours, which caused the processed images to be ugly for human perception — and therefore for machines, as well. And the explanation of why domain conversion with CycleGAN failed to increase performance is probably the already mentioned effect of geometric distortion, which could not be entirely eliminated even with the simpler dataset.

Oh, and something worth mentioning: Semi-supervised DA via minimax entropy was never used in segmentation before our solution, as far as we knew. This leads us to write an article about that as well as about the comparison of DA techniques. I also had the chance to present it at the ISMCR 2020 conference, held in Budapest, which was an exceptional experience for me. I suggest to check it out if you are curious about theoretical and implementation details.

Deep Reinforcement Learning-based Agent Training

Now we finally have a domain-independent, processed observation, the only thing left is to learn driving based on it. As I was completely new to reinforcement learning, I wanted to choose an algorithm that is the most likely to produce meaningful results, before jumping onto additional experimenting.

And I chose Proximal Policy Optimization (PPO), as it is both a reliable and well-performing method. Coding my own version of PPO would have been cumbersome and computationally ineffective, and as I have already used Ray Tune for hyperparameter training, I rather opted for the turnkey solution of Ray Rllib. Unfortunately, coding a working training setup was really time-consuming even so, and as I wanted to compete in AIDO 5, I had only a month left for experimenting with different hyperparameters.

The preprocessing pipeline consisted of resizing the input image, before feeding it into the loaded segmentation network, then the segmented image was finally processed by the agent. In my trials, the Rllib default network was used, which was a simple convolutional network that output a heading direction value at a fixed speed (which was the fastest possible).

The first converging training was really promising, as the agent stayed in its lane, took turns and slaloms without a problem. However, a significant issue was wiggling, which is both a source of insecurity and reduces the effective speed of the robot. This is presumably the result of the agent lacking information of dynamics, as at every step, control decision was made based on only a single image. Also, in the official evaluation on the Duckietown servers, the performance was unacceptable, with the agent instantly leaving its lane and crashing. As the official evaluation was still performed in a Gym-Duckietown environment, the only possibility was that our agent was not robust for the differently parametrized dynamics.

So we wanted to both include dynamic information, and prepare our agent for dynamics uncertainty. By providing the last three observations for the agent, the first aim was achieved. We also added dynamic randomization, which included lower fps (10–20), randomly repeating observations, and delaying the actions. A final addition was the population-based training from Ray Tune, in order to parallelly train several agents and prefer those that perform better.

We were surprised to see that only a week before the AIDO 5 event, these modifications resulted in agents whose performance was significantly above the other performers in simulation. The wiggling was completely eliminated, and the agent was as cautious as possible at the fixed maximum speed. It managed to stay in its lane while storming ahead, though it also failed to get around obstacles.

The final result of our agent. As you can see, wiggling is practically eliminated and we are rushing ahead along the track.

In the last days, some of my university teammates were able to achieve even better performance, but even so, my solution was featured in the official event and was presented briefly at NeurIPS, which I cannot express how much it meant to me.

Future plans

So a lot of work is behind me, it took a great deal to create an amazing domain-independent segmentation network, just as I had to invest heavily of my time into training a driving agent. Still, there is so much more to do and experiment with, starting with the only thorn I have from last semester, that I wasn’t able to test my agent on a real device, given the time I was ready for it, my university was forced to close its buildings because of the pandemic. Besides that, I have only new ideas and plans to test, for instance training a more complicated segmentation network that would allow an agent to overtake using the left lane, or allowing it to set its own speed.

If you have any questions, or possibly some ideas to share with me, I would be happy to hear them! Until then, take a look at my article or feel free to browse some of the GitHub repositories I have written so far during this project.