Using Transfer Learning to solve the simulator-to-real problem in the Duckietown environment
Author: Zoltán Lőrincz
Throughout the previous semesters, I used Imitation Learning to carry out lane following in the Duckietown¹ environment. The agents were trained in the Duckietown simulator using different Imitation Learning methods such as Behavioral Cloning², Dataset Aggregation³ (DAgger) and Generative Adversarial Imitation Learning⁴ (GAIL). Even though the models performed well in the simulated environment, they could not generalize well enough in the real-world environment and failed to succeed in the lane following task.
In this semester I focused on applying Transfer Learning techniques to solve the simulator-to-real problem. I will present my work in the next sections.
The term of Transfer Learning has multiple meanings in the area of Deep Learning.
Probably the most well-known one is the technique applied in the field of Supervised Learning, where a pretrained network is used to train a model for a different task or on a different dataset (with different labels).
Another type of Transfer Learning is Curriculum Learning, which is quite similar to the previous technique. First, the model is trained on one task (e.g. lane following), then the model is fine-tuned to perform a related but more difficult objective (e.g. lane following with other vehicles).
Domain Transfer Learning is also a commonly used method, where the model is trained in one domain (e.g. in a simulator), and it is tested and used in a different domain (e.g. in the real world). The task of the model remains the same. The aim of this approach is to apply methods during the training phase that bridge the differences between the training and the testing domains. During my work, I have used this form of Transfer Learning to ensure that the models trained in the simulated environment have equally good performance in the real-world domain.
During my work, I have used three different methods to solve the simulator-to-real problem: Domain Randomization, Image Thresholding and Visual Domain Adaptation using UNIT networks (Unsupervised Image-to-Image Translation Networks). I will present these solutions in the following sections.
Domain Randomization is a commonly used technique to perform simulator-to-real domain transfer. Instead of training the model in a single simulated environment, different parameters of the simulator are randomized to expose the model to a wide range of environments at training time. With enough variability, the real world may appear to the model as just another variation of the simulator. This way the model will learn general features that are applicable to the real world as well. The randomized variables of the simulator are usually either visual parameters (e.g. textures, lighting conditions, camera parameters, etc.) or physical parameters (e.g. friction coefficients, the gravitational acceleration, masses, sizes or other attributes of objects, etc.).
The Duckietown simulator has a built-in Domain Randomization functionality, which changes the parameters of the simulator each time the simulator is reset. I applied this technique by simply turning on this feature of the simulator during the process of collecting demonstrations, so that the agent is trained on domain randomized observations.
In case of vision-based algorithms (which is our case), a feasible way to perform the domain transfer is by applying Visual Domain Adaptation. The aim of this technique is to transfer the observations from the training and testing domains to a common domain, which is then used to train the agent to perform the given task. Due to recent advances in image-to-image translation, this approach is becoming more and more popular.
Transferring the images of the Duckietown simulator and the real-world environment to a common domain can be easily achieved by using image thresholding. By finding the right thresholding values for each domain, it is possible to extract the significant parts (e.g. driving lane markings) from the observations. This way the RGB images from both environments can be converted into similarly looking binary images. By using the binary observations during both training and testing time, the two different environments can appear similar for the model, therefore its performance can be equally good in both domains.
Visual Domain Adaptation using UNIT networks
In this work, I also used a Visual Domain Adaptation method. This approach utilizes Unsupervised Image-to-Image Translation Networks⁵ (UNIT) to transfer the observations from the simulated (𝑋𝑠𝑖𝑚) and the real (𝑋𝑟𝑒𝑎𝑙) domains into a common latent space (𝑍). After the UNIT network is properly trained and the quality of the image-to-image translation is satisfactory, the control policy is trained from this common latent space 𝑍 using the labels/demonstrations 𝑐 from the expert in the simulator. The method is demonstrated in the figure below.
The main advantage of this method is that it does not require pairwise correspondences between images in the simulated and real-world training sets to perform the image-to-image translation. Furthermore, it does not require real-world labels either, the lane-following agent can be trained by using only the demonstrations from the simulator.
As can be observed in the figure above, the UNIT network achieves high image-to-image translation quality. The network even managed to learn how to remove the background that is above the horizon and replace it with the sky when performing the real-to-sim translation. This is also true for the other way around: during the sim-to-real translation, the network removes the sky and replaces it with background objects.
I used the following procedure to evaluate real-world algorithms. For each model, two episodes were run, each for 60 seconds (or less if the robot left the track). During the first episode, the robot was placed in the outer loop, while in the second episode, the robot started from the inner loop. The starting positions were the same for every model in each episode. These were valid starting positions: the robots were placed in the middle of a specific straight, inside the right driving lane. In each episode, the survival time was measured and the road tiles visited by the robot were counted. Finally, the metrics during the 2 episodes were averaged. The table below presents the best results for each Transfer Learning method.
The best performing method was Image Thresholding. It is important to note, however, that this method has its limitations as well. It only works well if the thresholding values are properly set. I fine-tuned these values to fit the conditions of the Duckietown environment at home, and as a result, the driving policy has a decent performance in this setup. On the other hand, it is possible that in an environment with different visual conditions the method could function poorly.
Both Domain Randomization and UNIT network-based Visual Domain Adaptation achieve good results as well. By their nature, these methods are also more robust to the environment changes than Image Thresholding.
In conclusion, all three Transfer Learning methods managed to successfully solve the simulator-to-real problem, as the real-world robots could properly follow the right driving lane, without committing crucial mistakes. It is also straightforward, that in the real environment these techniques are not only useful but necessary, as the model without any form of Transfer Learning completely failed at the lane-following task.
 L. Paull et al.,” Duckietown: An open, inexpensive and flexible platform for autonomy education and research,” 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 1497–1504, doi: 10.1109/ICRA.2017.7989179.
 Bain, M., Sammut, C.,” A Framework for Behavioural Cloning,” Machine Intelligence 15, 15:103, 1999.
 S. Ross, G. J. Gordon, and D. Bagnell.,” A reduction of imitation learning and structured prediction to no-regret online learning,” In AISTATS, pages 627–635, 2011.
 J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised Image-to-Image Translation Networks. In Advances in Neural Information Processing Systems (NIPS), 2017.