Efficient deep learning-based monocular 3D perception for agile robot navigation

Research output: Book/anthology/dissertation/reportPh.D. thesis


Aerial robots, such as multirotor drones, have the potential to revolutionize many aspects of our lives thanks to their excellent agility and maneuverability. However, in real-world missions, their navigation speeds are often hindered by their inability to understand their surroundings sufficiently fast and precisely in complex environments. This challenge is further exacerbated due to the constrained computational resources and limited sensing options available onboard micro aerial vehicles. Therefore, it is crucial to study efficient methods for perception as they enable aerial robots to travel at high speeds to maximize their capabilities in vital missions such as search-and-rescue operations and critical infrastructure inspections.

In this thesis, a lightweight convolutional neural network (CNN)-based method for 3D object perception and representation is proposed to facilitate high-speed navigation of agile drones in autonomous drone racing contexts. The proposed method uses only a monocular camera with a wide field-of-view fish-eye lens, but can precisely and robustly map racing gates in 3D space. We evaluate our approach on difficult real-world scenarios and show that it significantly outperforms existing methods in terms of inference speed and accuracy.

Obtaining sufficient data to train deep learning-based perception models is a general challenge for any data-driven approach. For agile robotics, collecting data is even more expensive and less possible in many applications. Simulation is thus a natural consideration, and while realistic simulators with high real-world fidelity exist, the sim-to-real gap still negatively impacts the success of transferring trained models to reality. This thesis proposes a deep learning-based method using a morphological filtering abstraction that significantly increases the success rate of zero-shot sim-to-real transfer learning while also improving the robustness of the perception system in challenging conditions such as poor lighting and high motion blur.

Traditional pixel-based cameras though commonly used in most actual applications are hardly fit for agile navigation, as they suffer from visual degradation such as motion blur or high dynamic range effects. This thesis is set to investigate the feasibility of safe navigation with a single event camera for autonomous drone racing. The proposed method combines a sparse recurrent CNN with sim-to-real transfer learning to successfully perceive racing gates at high speeds under challenging lighting conditions in our experiments, further encouraging the paradigm shift from pixel-based sensing to event-based neuromorphic sensing for more efficient and more robust robot perception.

Overall, the key contributions of this thesis are i) a comprehensive study of a lightweight monocular 3D perception model to suggest that it can outperform more complex models for a task-specific scenario, ii) data augmentation with morphological filters can be beneficial for perception models to improve their robustness and success rates in sim-to-real transfer learning, and iii) a feasibility demonstration of a robot navigation model using neuromorphic cameras for agile robotics. We hope that this work would lay another stepping stone toward a generalized navigation framework for agile aerial robots with minimal sensing and computing capabilities that can carry out complex missions at ultra-high speeds in meaningful real-world applications.
Original languageEnglish
PublisherÅrhus Universitet
Number of pages105
Publication statusPublished - Mar 2023


Dive into the research topics of 'Efficient deep learning-based monocular 3D perception for agile robot navigation'. Together they form a unique fingerprint.

Cite this