Imitation Learning in the Duckietown environment

SmartLab AI
4 min readMay 30, 2020

--

Author: Zoltán Lőrincz

Duckietown is an open-source, inexpensive platform for autonomy education and research. The platform consists of two main parts: Duckiebots and Duckietowns. Duckiebots are differential wheeled robots that serve as small autonomous vehicles, Duckietowns are the tiny cities where these vehicles operate. The platform is also featured in the AI Driving Olympics (AI-DO), which are global competitions organized every half year since December 2018. AI-DO focuses on AI for self-driving cars and robotics. The Duckietown competition has 3 main challenges: Lane following (LF), Lane following with vehicles (LFV) and Lane following with vehicle and intersections (LFVI).

The aim of this project was to solve the Duckietown lane following challenge using different imitation learning techniques. Imitation learning is a deep learning approach. It assumes, that we have access to an expert, which can solve the given problem efficiently, optimally. This expert provides us with demonstrations of the task, which we use during the agent’s training. The agent eventually learns the expert’s policy, which allows him to “imitate” the expert, to behave in the environment just like him.

Imitation Learning has two main areas: Direct Policy Learning and Inverse Reinforcement Learning. In the first approach the agent learns the policy directly from the expert’s demonstrations, hence the name of the method. The aim of the second approach is to uncover a reward function by the means of the demonstrations, which is then used to learn the policy using reinforcement learning. During the project I have experimented only with Direct Policy Learning methods. These were the following: Behavioral Cloning and DAgger.

Firstly, I have tried to solve the lane following task using Behavioral Cloning, which is the simplest form of imitation learning. It is a direct policy learning method, which learns the expert’s policy using supervised learning. Given the expert’s demonstrations, we divide these into state-action pairs, we treat these pairs as i.i.d. examples and finally, we apply supervised learning.

Although in some simpler applications this algorithm can work excellently, it is bound to fail in case of tasks where long-term planning is required. The main reason of this is the following: errors made in different states add up, therefore a few mistakes made by the agent can easily put him into a state that the expert has never visited and the agent has been never trained on. As a result, the agent does not know how to recover from such states. I have experienced the same problematic behavior during the lane following task as well: after a certain time, the robot made an error, could not recover from it and ended up leaving the track. Due to this, I could not create a proper lane following application using Behavioral Cloning.

The next algorithm I have tried out to solve the problem was called DAgger (Dataset Aggregation). The operation of this algorithm is illustrated by Figure 1. This is an iterative method, which assumes we have access to the expert even at training time. First, we start with the initial policy that we have uncovered from the initial demonstrations using supervised learning (just like in Behavioral Cloning). Then, we execute a loop until we converge. In each iteration, we collect trajectories (multiple state-action pairs) by rolling out the current policy (which we obtained in the previous iteration). Then, for every state we collect feedback from the expert (what would he have done in the same state). Finally, we train a new policy using this feedback. For the algorithm to work efficiently, it is important to use all the previous training data during the teaching, so that the agent “remembers” all the mistakes it made in the past. DAgger eliminates the previously mentioned problem, because each time the agent makes an error, it is taught how to correct it. Due to this, we can efficiently use this approach for tasks where long-term planning is needed.

Figure 1: The DAgger algorithm

In the following sections, I will present my DAgger implementation. I have created a pure pursuit P controller that served as the expert. We could run this controller in the simulator environment to collect demonstrations. At each given location (state) we could also query this expert, and it would provide us with the calculated actions (the PWM signals of the driven wheels).

I used a simple neural network consisting of four convolutional layers and one fully-connected layer. I used batch normalization and ReLU activation functions.

The input images (at each state) were also preprocessed the following way. Firstly, they were resized to the resolution of 80x60. The top third of each picture was then cropped, as this area did not contain valuable information for the driving task. The cropped image was then transformed into the HSV color representation, and adaptive image thresholding was applied to extract the lane markers from the image. The preprocessing procedure is demonstrated in Figure 2.

Figure 2: The results of the image preprocessing procedure

Before the training, the agent was pretrained with the pre-collected demonstrations using Behavioral Cloning. During the training phase, in each iteration, the agent was operated in the simulator environment, while at each state the expert was queried and its actions were logged. After each iteration, the agent was retrained on both the newly acquired and previous demonstrations. After a certain time, the training converged and the agent achieved expert-like behavior.

Even though Behavioral Cloning failed to solve the Duckietown lane following challenge, I could achieve an excellent result with the DAgger algorithm. This allowed me to place first in AI-DO 3 in the Lane following — Simulator — Testing category.

--

--

SmartLab AI
SmartLab AI

Written by SmartLab AI

Deep Learning and AI solutions from Budapest University of Technology and Economics. http://smartlab.tmit.bme.hu/

No responses yet