Reinforcement Learning on Quadcopter

An image of the playfield, our autonomous quadcopter, and 2 modified Roombas.

I spent the last year of my undergraduate studies working on a reinforcement learning based solution for my university’s new robotics team which aimed to compete in the International Aerial Robotics Competition which challenged teams internationally to complete mission 7. Mission 7 challenged teams to build an autonomous aerial vehicle that would “herd” 10 Roombas across a goal line by either bumping into them to cause them to turn 180 degrees or by landing on top of them to cause them to turn 45 degrees. This had to be done amidst 4 moving obstacles, on a 20x20 meter playfield, in a GPS-denied environment, and in under 10 minutes.

My Role

The team was interested in applying reinforcement learning to the problem and it fell to me to perform research on the feasibility of approaching this complex problem with end-to-end reinforcement learning wherein the observation space of the model would be raw camera input and the action space would be target coordinates.

Approach

I used OpenAI’s Gym to create a training environment and began by training a simplified version of the problem wherein the simulated vehicle had a significant speed advantage, there were no obstacles, and the vehicle had an unobstructed view of the field at all times. This seemed a logical first step in familiarizing myself with the capabilities and pitfalls of reinforcement learning before graduating to the full problem space.

Screenshot of simulation for simplified version of game. (white circles represent goal Roombas, red circles represent moving obstacles, and show in center is the vehicle)

After iterating through many flavors of reinforcement learning, I eventually settled on “Proximal Policy Optimization” for its balance of simplicity and performance.

Progress

In training the model I quickly learned the importance of reward engineering and introduced additional incentives on top of the points earned from successfully herding a roomba across the goal line. The most important of these incentives was the “direction incentive” wherein I rewarded the model if the average roomba’s heading was pointed towards the goal line. Below is an figure showing an increase in points earned over the course of a training session.

After months of iterating on hyperparameters and reward engineering schemes, I was eventually able to get the model to achieve an average of 75% of the maximum points available in the game. Below is a recording of the model playing a perfect game.

Simulation of game. Notice aerial vehicle interacting with goal Roombas and how each goes across the goal line on the right side of the screen.

Unfortunately, progress had to stop on account of me graduating but I felt that I had successfully evaluated the feasibility of an end-to-end reinforcement learning solution. My conclusions were that an end-to-end reinforcement learning solution was not appropriate. My advice to the team was that machine learning could be used but only for very specific tasks within the larger algorithm because machine learning is inherently difficult to debug when it doesn’t perform as expected and even more so when it comprises the entirety of your algorithm.

Overall, I learned a tremendous amount about reinforcement learning and machine learning in general and I credit this experience with my persistent and deep fascination with the field. It was this experience that inspired me to go into Data Science and I hope to return to performing research as intriguing and as satisfying as this was.