SANE 2025 - Speech and Audio in the Northeast

November 7, 2025

Building at 111 8th Ave, New York, NY

SANE 2025, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Friday November 7, 2025 at Google, in New York, NY.

It is the 12th edition in the SANE series of workshops, which started in 2012 and is typically held every year alternately in Boston and New York. Since the first edition, the audience has steadily grown, with a new record of 200 participants and 53 posters in 2024.

SANE 2025 will feature invited talks by leading researchers from the Northeast as well as from the wider community. It will also feature a lively poster session, open to both students and researchers.

Details

  • Date: Friday, November 7, 2025
  • Venue: Google (Chelsea), New York, NY.

Schedule

8:30-9:05Registration and Breakfast
9:05-9:10Welcome
9:10-10:00 Dan Ellis (Google Deepmind)
Recomposer: Event-roll-guided Audio Editing
10:00-10:50 Leibny Paola Garcia Perera (Johns Hopkins University)
The Step-by-Step Journey of a Spontaneous Speech Dataset Toward Understanding
10:50-11:20Coffee break
11:20-12:10 Yuki Mitsufuji (Sony AI)
AI for Creators: Pushing Creative Abilities to the Next Level
12:10-13:00 Julia Hirschberg (Columbia University)
Code-Switching in Multiple Languages in Speech and Text
13:00-13:30Lunch
13:30-16:00Poster Session + Coffee
16:00-16:50 Yoshiki Masuyama (MERL)
Neural Fields for Spatial Audio Modeling
16:50-17:40 Robin Scheibler (Google Deepmind)
Generative Methods for Speech Enhancement and Separation
17:40-17:45Closing remarks
17:45-.........Drinks nearby

Registration

SANE is now full. If you would like to be added to the waitlist, please send an email to with your name and affiliation.

Directions

The workshop will be hosted at Google, in New York, NY. Google NY is located at 111 8th Ave, and the closest subway stop is the A, C, and E lines' 14 St station. The entrance we will use is still TBD.

 

Organizing Committee

 

Sponsors

SANE remains a free workshop thanks to the generous contributions of the sponsors below.

MERL Google

Talks

 

Recomposer: Event-roll-guided Audio Editing

Dan Ellis

Google Deepmind

Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., "enhance Door'") and a graphical representation of the event timing derived from an "event roll" transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates "recomposition" is an important and practical application.

Dan Ellis

Daniel P. W. Ellis received the Ph.D. degree in electrical engineering from the Massachusetts Institute of Technology, Cambridge, where he was a Research Assistant in the Machine Listening Group of the Media Lab. He spent several years as a Research Scientist at the International Computer Science Institute, Berkeley, CA. In 2000, he took a faculty position with the Electrical Engineering Department, Columbia University, New York. In 2015, he left for his current position as a Research Scientist with Google in New York. His research is concerned with all aspects of extracting high-level information from audio, including speech recognition, music description, and environmental sound processing. He also runs the AUDITORY email list of over 4000 worldwide researchers in perception and cognition of sound.

 

The Step-by-Step Journey of a Spontaneous Speech Dataset Toward Understanding

Leibny Paola Garcia Perera

Johns Hopkins University

The ultimate goal of working with speech data is to understand what is being said in a recording. This may appear like a simple goal, but it hides a complex journey. In this talk, we will go through the process, from building a spontaneous speech dataset to uncovering the layers of understanding it can provide. At the same time, we will reflect on the challenges that arise along the way. Collected from everyday phone conversations between familiar speakers, the dataset captures laughter, hesitations, interruptions, and overlaps that mirror natural dialogue. Metadata such as age, gender, and accent enriches the recordings, while a two-channel design supports accurate diarization for disentangling speakers and analyzing interactions. However, the spontaneous nature of the dialogues offers a distinct perspective, as the data exhibits noise across multiple dimensions—for example, imperfect annotations, long-form audio, frequent disfluencies, unbalanced speaker contributions, background noise, and occasional divergence from suggested topics, among others. Automatic processing further exposes limitations: diarization is not perfect, and automatic speech recognition (ASR) introduces errors. These imperfections highlight the inherent difficulty of working with spontaneous speech and the need for more robust tools. As the dataset evolves into part of a benchmark, evaluations reveal that even advanced audio-language models struggle with reasoning, comprehension, and speaker attribution when tested on spontaneous speech data. In tracing this step-by-step process, we highlight both the value of spontaneous speech for benchmarking and the challenges that remain for achieving deeper understanding.

Leibny Paola Garcia Perera

Leibny Paola Garcia Perera (PhD 2014, University of Zaragoza, Spain) joined Johns Hopkins University after extensive research experience in academia and industry, including highly regarded laboratories at Agnitio and Nuance Communications. She led a team of 20+ researchers from four of the best laboratories worldwide in far-field speech diarization and speaker recognition under the auspices of the JHU summer workshop 2019 in Montreal, Canada. She was also a researcher at Tec de Monterrey, Campus Monterrey, Mexico, for ten years. She was a Marie Curie researcher for the Iris project in 2015, exploring assistive technology for children with autism in Zaragoza, Spain. Recently, she has been working on children’s speech, including child speech recognition and diarization in day-long recordings. She collaborates with DARCLE.org and CCWD, which analyze child-centered speech. She is also part of the CHiME steering group. She has been part of JHU SRE teams since 2018. She has been an awardee of the AI2AI Amazon Awards in 2024 and 2025. Her interests include multimodal speech representation, understanding and reasoning, diarization, speech recognition, speaker recognition, machine learning, and language processing.

 

AI for Creators: Pushing Creative Abilities to the Next Level

Yuki Mitsufuji

Sony AI

This talk explores how cutting-edge generative AI is transforming creative workflows in music, cinema, and gaming. Led by Dr. Yuki Mitsufuji, the Music Foundation Model Team at Sony AI has developed multimodal frameworks such as MMAudio, which generate high-quality, synchronized audio from video and text inputs. Their research, recognized at top venues like NeurIPS, ICLR, and CVPR, has contributed to both content creation and protection, with practical demos integrated into commercial products. The session will highlight key innovations, including sound restoration projects and the future of AI-powered media production.

Yuki Mitsufuji

Yuki Mitsufuji is Lead Research Scientist and Vice President of AI Research at Sony AI, and a Distinguished Engineer at Sony Group Corporation. He holds a PhD from the University of Tokyo and leads the Creative AI Lab and Music Foundation Model Team, focusing on generative modeling for creative media. His work has been featured at CVPR, ICLR, and NeurIPS, and he has delivered tutorials on audio diffusion models at ICASSP and ISMIR. As an IEEE Senior Member, he also contributes to the academic community as an associate editor for a leading IEEE journal. From 2022 to 2025, he served as a specially appointed associate professor at Tokyo Institute of Technology, where he lectured on generative models.

 

Code-Switching in Multiple Languages in Speech and Text

Julia Hirschberg

Columbia University

For people who speak more than one language, code-switching (CSW) is a common phenomenon. However, spoken language recognition systems, including voice assistants, find it difficult to understand and appropriately reacting to this multilingual speech. We are studying how spoken and written CSW interacts with other aspects of communication, including the production of named entities and dialogue acts and the influence of entrainment, empathy, prosody, formality, and information load. Our goals are to improve prediction of when, why, and to what effect CSW occurs as well as how to produce appropriate code-switched responses to inform further development of voice assistants and their ability to successfully interact with multilingual users. We have studied many aspects of CSW to date: Does the degree of formality of a conversation influence the degree of CSW in it? 2) What is the role of information load in predicting and explaining CSW? 3) Do speakers entrain on strategies of CSW in speech?: 4) Is there a quantifiable relationship between CSW and empathy in speech? We are current examining: 5) Which dialogue acts tend to be produced most often in CSW? 6) Does the presence of named entities prime CSW? 7) How do speakers produce intonational contours (ToBI) when they perform CSW -- Do these match either of their languages or is it different from both? We are testing all these topics on speech and lexical features of: Standard American English with Spanish, Mandarin Chinese, and Hindi.

Julia Hirschberg

Julia Hirschberg is a Columbia CS Percy K. and Vida L.W. Hudson Professor and was previously at Bell Laboratories/AT&TLabs working on TTS. She currently studies spoken language: false information and intent on social media, radicalization/de-radicalization in online videos and social media; conversational entrainment, emotion and empathy; code-switching, and deceptive, trusted/mistrusted speech. She served on the ACL, CRA, IEEE SLTC, NAACL, ISCA (president 2005-7) Executive Boards, and AAAI Council, was editor of Computational Linguistics and Speech Communication, and is an AAAI, ISCA, ACL, ACM, and IEEE fellow and a member of the National Academy of Engineering, the American Academy of Arts and Sciences, and the American Philosophical Society, and received the IEEE Flanagan Award, the ISCA Medal for Scientific Achievement, and the ACL Distinguished Service Award. She has 6 PhD students and many research project students in Columbia Computer Science.

 

Neural Fields for Spatial Audio Modeling

Yoshiki Masuyama

MERL

Spatial audio is a long-standing research field concerned with recording, modeling, and generating sound in a 3D world. While traditional methods have typically relied on the physics of sound propagation and/or compressed sensing, the research field has now witnessed a paradigm shift with the rapid advances of deep learning. In particular, neural fields (NFs) have gained much attention for spatial interpolation of impulse responses due to their flexibility, where the network characterizes the sound field as a function of time, source position, and/or microphone position. This talk will focus on NFs for head-related transfer functions and room impulse responses. I will also discuss how we incorporate the physics of sound propagation into NFs under the concept of physics-informed neural networks.

Yoshiki Masuyama

Yoshiki Masuyama is a Visiting Research Scientist at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, Massachusetts. He received his B.E. and M.E. degrees from Waseda University and his Ph.D. from Tokyo Metropolitan University. His research interest is in integrating signal processing and machine learning technologies for efficient and robust audio processing. He is a recipient of the Best Student Paper Award at the IEEE Spoken Language Technology Workshop 2022.

 

 

Generative Methods for Speech Enhancement and Separation

Robin Scheibler

Google Deepmind

This talk presents a comprehensive overview of recent breakthroughs in generative speech enhancement and separation. These methods represent a paradigm shift from conventional regression techniques, mitigating long-standing issues like regression-to-the-mean and artifact leakage that often result in muffled or unnatural audio. A key advantage of the generative approach is its ability to provide a holistic framework for speech restoration, simultaneously addressing degradations that were historically treated as separate tasks, such as denoising, dereverberation, and bandwidth extension.
The presentation is structured in three parts. We will begin by examining generative models for single-channel enhancement and restoration, with a focus on influential architectures like Miipher and Universe. Next, the discussion will transition to the more complex task of speech separation, highlighting how diffusion models can be adapted to generate the multiple, distinct outputs required to isolate individual sources. To conclude, we will discuss the profound impact of these models on evaluation, arguing that their ability to generate highly plausible—yet not identical—outputs challenges the validity of traditional signal-fidelity metrics and necessitates a new paradigm in speech quality assessment.

Robin Scheibler

Robin Scheibler, as a research engineer at Google DeepMind, works on teaching machines how to listen. His research tackles the computational cocktail party problem, exploring how to teach algorithms to focus on one voice in a crowd (extraction), untangle simultaneous conversations (source separation), and digitally remove echoes and other distortions (restoration).
He is the creator of Pyroomacoustics, a popular open-source tool that lets researchers and hobbyists build virtual rooms to test these very ideas. With a PhD from EPFL, his work blends classic signal processing with modern machine learning to separate signal from noise—a skill he finds equally useful in audio processing and crowded conference halls. When not improving the hearing of AI, he enjoys the much clearer signals of good food, music, and the great outdoors.