Unifying Audio-Visual Machine Perception – Tasks & Architectures

Videos
ML4MI
UW-Madison
Healthcare
Multimodal learning
Contrastive learning
Perception
Early-fusion
Autoencoder
Presenter

Pedro Morgado

Date

July 12, 2023

About this resource

Accurately recognizing, localizing, and separating sound sources is essential for effective audio-visual perception. Traditionally, these tasks have been approached independently, with separate methods developed for each. However, the interdependencies between source localization, separation, and recognition make it clear that independent models may yield suboptimal performance. To address this, our research focuses on unifying audio-visual learning tasks and architectures to integrate audio and visual cues for joint localization, separation, and recognition. In this talk, I will present our recent progress in this field. I will introduce a unified pretraining framework that enables simultaneous learning of audio-visual recognition, localization, and separation. Additionally, I will showcase a novel early fusion architecture that incorporates local audio-visual interactions, which can be efficiently pre-trained using an audio-visual masked autoencoding framework. The objective of unified pre-training of early fusion models is to replicate human-like multi-modal perception, promising a deeper and more sophisticated understanding of audio-visual interactions, crucial for these true multimodal applications. Throughout the talk, I will share a sequence of compelling findings that demonstrate the strong positive transfer between these tasks. Furthermore, I will highlight the substantial benefits that early audio-visual fusion can provide in enhancing model expressivity and consequently performance on challenging audio-visual applications.

Bio: Pedro is an Assistant Professor at the University of Wisconsin-Madison in the department of Electrical and Computer Engineering, and affiliated with the Computer Sciences department. His research interests are at the intersection of computer vision and machine learning, focusing on developing systems that continuously learn to perceive the world through multiple sensory modalities without direct human supervision. Prior to joining UW-Madison, he was a post-doctoral fellow at Carnegie Mellon University, working with Abhinav Gupta. He earned his Ph.D. degree from the University of California San Diego advised by Prof. Nuno Vasconcelos, and his B.Sc. and M.Sc. degrees from Universidade de Lisboa, Portugal.

A netID is required to view ML4MI videos: View 2023-07-12 recording.

See also

  • ML4MI: Explore other talks from the ML4MI group at UW-Madison.