SANE 2025 - Speech and Audio in the Northeast
November 7, 2025

SANE 2025, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Friday November 7, 2025 at Google, in New York, NY.
It is the 12th edition in the SANE series of workshops, which started in 2012 and is typically held every year alternately in Boston and New York. Since the first edition, the audience has steadily grown, with a new record of 200 participants and 53 posters in 2024.
SANE 2025 will feature invited talks by leading researchers from the Northeast as well as from the wider community. It will also feature a lively poster session, open to both students and researchers.
Details
- Date: Friday, November 7, 2025
- Venue: Google (Chelsea), 76 9th Avenue, New York, NY.
- Check-in: Check-in will only be open from 8:15-10:00, 10:50-11:20, and 13:00-13:30. Please bring a physical copy of a government-issued Photo ID. Only people who are registered will be allowed in.
Schedule
| 8:30-9:05 | Registration and Breakfast |
| 9:05-9:10 | Welcome |
| 9:10-10:00 | Dan Ellis (Google Deepmind) Recomposer: Event-roll-guided Audio Editing |
| 10:00-10:50 | Leibny Paola Garcia Perera (Johns Hopkins University) The Step-by-Step Journey of a Spontaneous Speech Dataset Toward Understanding |
| 10:50-11:20 | Coffee break |
| 11:20-12:10 | Yuki Mitsufuji (Sony AI) AI for Creators: Pushing Creative Abilities to the Next Level |
| 12:10-13:00 | Julia Hirschberg (Columbia University) Code-Switching in Multiple Languages in Speech and Text |
| 13:00-13:30 | Lunch |
| 13:30-16:00 | Poster Session + Coffee |
| 16:00-16:50 | Yoshiki Masuyama (MERL) Neural Fields for Spatial Audio Modeling |
| 16:50-17:40 | Robin Scheibler (Google Deepmind) Generative Methods for Speech Enhancement and Separation |
| 17:40-17:45 | Closing remarks |
| 18:15-20:15 | After-party at a secret taproom nearby (location will be disclosed to registrants) |
Registration
SANE is now full, and the wait list is closed.
Please consider joining next year: to be among the first to know about future SANE workshops, sign up to the SANE News Group.
Directions
The workshop will be hosted at Google, in New York, NY. We will enter Google NY through the entrance located at 76 9th Avenue. The closest subway stop is the A, C, and E lines' 14 St station. Please remember to bring a physical copy of a government-issued Photo ID
As was the case in 2024, check-in will only be open from 8:15-10:00, 10:50-11:20, and 13:00-13:30. It will not be possible to attend SANE if you do not arrive during these times. Only people who are registered will be able to enter the building.
Organizing Committee
- Jonathan Le Roux (MERL)
- Quan Wang (Google)
- John R. Hershey (Google)
Sponsors
SANE remains a free workshop thanks to the generous contributions of the sponsors below.
Talks
Recomposer: Event-roll-guided Audio Editing
Google Deepmind
Editing complex real-world sound scenes is difficult because individual sound sources overlap in time. Generative models can fill-in missing or corrupted details based on their strong prior understanding of the data domain. We present a system for editing individual sound events within complex scenes able to delete, insert, and enhance individual sound events based on textual edit descriptions (e.g., "enhance Door'") and a graphical representation of the event timing derived from an "event roll" transcription. We present an encoder-decoder transformer working on SoundStream representations, trained on synthetic (input, desired output) audio example pairs formed by adding isolated sound events to dense, real-world backgrounds. Evaluation reveals the importance of each part of the edit descriptions -- action, class, timing. Our work demonstrates "recomposition" is an important and practical application.
Daniel P. W. Ellis received the Ph.D. degree in electrical engineering from the Massachusetts Institute of Technology, Cambridge, where he was a Research Assistant in the Machine Listening Group of the Media Lab. He spent several years as a Research Scientist at the International Computer Science Institute, Berkeley, CA. In 2000, he took a faculty position with the Electrical Engineering Department, Columbia University, New York. In 2015, he left for his current position as a Research Scientist with Google in New York. His research is concerned with all aspects of extracting high-level information from audio, including speech recognition, music description, and environmental sound processing. He also runs the AUDITORY email list of over 4000 worldwide researchers in perception and cognition of sound.
The Step-by-Step Journey of a Spontaneous Speech Dataset Toward Understanding
Johns Hopkins University
The ultimate goal of working with speech data is to understand what is being said in a recording. This may appear like a simple goal, but it hides a complex journey. In this talk, we will go through the process, from building a spontaneous speech dataset to uncovering the layers of understanding it can provide. At the same time, we will reflect on the challenges that arise along the way. Collected from everyday phone conversations between familiar speakers, the dataset captures laughter, hesitations, interruptions, and overlaps that mirror natural dialogue. Metadata such as age, gender, and accent enriches the recordings, while a two-channel design supports accurate diarization for disentangling speakers and analyzing interactions. However, the spontaneous nature of the dialogues offers a distinct perspective, as the data exhibits noise across multiple dimensions—for example, imperfect annotations, long-form audio, frequent disfluencies, unbalanced speaker contributions, background noise, and occasional divergence from suggested topics, among others. Automatic processing further exposes limitations: diarization is not perfect, and automatic speech recognition (ASR) introduces errors. These imperfections highlight the inherent difficulty of working with spontaneous speech and the need for more robust tools. As the dataset evolves into part of a benchmark, evaluations reveal that even advanced audio-language models struggle with reasoning, comprehension, and speaker attribution when tested on spontaneous speech data. In tracing this step-by-step process, we highlight both the value of spontaneous speech for benchmarking and the challenges that remain for achieving deeper understanding.
Leibny Paola Garcia Perera (PhD 2014, University of Zaragoza, Spain) joined Johns Hopkins University after extensive research experience in academia and industry, including highly regarded laboratories at Agnitio and Nuance Communications. She led a team of 20+ researchers from four of the best laboratories worldwide in far-field speech diarization and speaker recognition under the auspices of the JHU summer workshop 2019 in Montreal, Canada. She was also a researcher at Tec de Monterrey, Campus Monterrey, Mexico, for ten years. She was a Marie Curie researcher for the Iris project in 2015, exploring assistive technology for children with autism in Zaragoza, Spain. Recently, she has been working on children’s speech, including child speech recognition and diarization in day-long recordings. She collaborates with DARCLE.org and CCWD, which analyze child-centered speech. She is also part of the CHiME steering group. She has been part of JHU SRE teams since 2018. She has been an awardee of the AI2AI Amazon Awards in 2024 and 2025. Her interests include multimodal speech representation, understanding and reasoning, diarization, speech recognition, speaker recognition, machine learning, and language processing.
AI for Creators: Pushing Creative Abilities to the Next Level
Sony AI
This talk explores how cutting-edge generative AI is transforming creative workflows in music, cinema, and gaming. Led by Dr. Yuki Mitsufuji, the Music Foundation Model Team at Sony AI has developed multimodal frameworks such as MMAudio, which generate high-quality, synchronized audio from video and text inputs. Their research, recognized at top venues like NeurIPS, ICLR, and CVPR, has contributed to both content creation and protection, with practical demos integrated into commercial products. The session will highlight key innovations, including sound restoration projects and the future of AI-powered media production.
Yuki Mitsufuji is Lead Research Scientist and Vice President of AI Research at Sony AI, and a Distinguished Engineer at Sony Group Corporation. He holds a PhD from the University of Tokyo and leads the Creative AI Lab and Music Foundation Model Team, focusing on generative modeling for creative media. His work has been featured at CVPR, ICLR, and NeurIPS, and he has delivered tutorials on audio diffusion models at ICASSP and ISMIR. As an IEEE Senior Member, he also contributes to the academic community as an associate editor for a leading IEEE journal. From 2022 to 2025, he served as a specially appointed associate professor at Tokyo Institute of Technology, where he lectured on generative models.
Code-Switching in Multiple Languages in Speech and Text
Columbia University
For people who speak more than one language, code-switching (CSW) is a common phenomenon. However, spoken language recognition systems, including voice assistants, find it difficult to understand and appropriately reacting to this multilingual speech. We are studying how spoken and written CSW interacts with other aspects of communication, including the production of named entities and dialogue acts and the influence of entrainment, empathy, prosody, formality, and information load. Our goals are to improve prediction of when, why, and to what effect CSW occurs as well as how to produce appropriate code-switched responses to inform further development of voice assistants and their ability to successfully interact with multilingual users. We have studied many aspects of CSW to date: Does the degree of formality of a conversation influence the degree of CSW in it? 2) What is the role of information load in predicting and explaining CSW? 3) Do speakers entrain on strategies of CSW in speech?: 4) Is there a quantifiable relationship between CSW and empathy in speech? We are current examining: 5) Which dialogue acts tend to be produced most often in CSW? 6) Does the presence of named entities prime CSW? 7) How do speakers produce intonational contours (ToBI) when they perform CSW -- Do these match either of their languages or is it different from both? We are testing all these topics on speech and lexical features of: Standard American English with Spanish, Mandarin Chinese, and Hindi.
Julia Hirschberg is a Columbia CS Percy K. and Vida L.W. Hudson Professor and was previously at Bell Laboratories/AT&TLabs working on TTS. She currently studies spoken language: false information and intent on social media, radicalization/de-radicalization in online videos and social media; conversational entrainment, emotion and empathy; code-switching, and deceptive, trusted/mistrusted speech. She served on the ACL, CRA, IEEE SLTC, NAACL, ISCA (president 2005-7) Executive Boards, and AAAI Council, was editor of Computational Linguistics and Speech Communication, and is an AAAI, ISCA, ACL, ACM, and IEEE fellow and a member of the National Academy of Engineering, the American Academy of Arts and Sciences, and the American Philosophical Society, and received the IEEE Flanagan Award, the ISCA Medal for Scientific Achievement, and the ACL Distinguished Service Award. She has 6 PhD students and many research project students in Columbia Computer Science.
Neural Fields for Spatial Audio Modeling
MERL
Spatial audio is a long-standing research field concerned with recording, modeling, and generating sound in a 3D world. While traditional methods have typically relied on the physics of sound propagation and/or compressed sensing, the research field has now witnessed a paradigm shift with the rapid advances of deep learning. In particular, neural fields (NFs) have gained much attention for spatial interpolation of impulse responses due to their flexibility, where the network characterizes the sound field as a function of time, source position, and/or microphone position. This talk will focus on NFs for head-related transfer functions and room impulse responses. I will also discuss how we incorporate the physics of sound propagation into NFs under the concept of physics-informed neural networks.
Yoshiki Masuyama is a Visiting Research Scientist at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, Massachusetts. He received his B.E. and M.E. degrees from Waseda University and his Ph.D. from Tokyo Metropolitan University. His research interest is in integrating signal processing and machine learning technologies for efficient and robust audio processing. He is a recipient of the Best Student Paper Award at the IEEE Spoken Language Technology Workshop 2022.
Generative Methods for Speech Enhancement and Separation
Google Deepmind
This talk presents a comprehensive overview of recent breakthroughs in generative speech enhancement and separation. These methods represent a paradigm shift from conventional regression techniques, mitigating long-standing issues like regression-to-the-mean and artifact leakage that often result in muffled or unnatural audio. A key advantage of the generative approach is its ability to provide a holistic framework for speech restoration, simultaneously addressing degradations that were historically treated as separate tasks, such as denoising, dereverberation, and bandwidth extension.
The presentation is structured in three parts. We will begin by examining generative models for single-channel enhancement and restoration, with a focus on influential architectures like Miipher and Universe. Next, the discussion will transition to the more complex task of speech separation, highlighting how diffusion models can be adapted to generate the multiple, distinct outputs required to isolate individual sources. To conclude, we will discuss the profound impact of these models on evaluation, arguing that their ability to generate highly plausible—yet not identical—outputs challenges the validity of traditional signal-fidelity metrics and necessitates a new paradigm in speech quality assessment.
Robin Scheibler, as a research engineer at Google DeepMind, works on teaching machines how to listen. His research tackles the computational cocktail party problem, exploring how to teach algorithms to focus on one voice in a crowd (extraction), untangle simultaneous conversations (source separation), and digitally remove echoes and other distortions (restoration).
He is the creator of Pyroomacoustics, a popular open-source tool that lets researchers and hobbyists build virtual rooms to test these very ideas. With a PhD from EPFL, his work blends classic signal processing with modern machine learning to separate signal from noise—a skill he finds equally useful in audio processing and crowded conference halls. When not improving the hearing of AI, he enjoys the much clearer signals of good food, music, and the great outdoors.
Posters
Instructions for posters: poster boards are 32"x40" and can be placed in portrait or landscape orientation.
- Representational similarity analysis of EEG reveals multiple spatiotemporal dynamics of auditory selective attention
Jinhee Kim, Wenkang An, Abigail L. Noyce, Barbara G. Shinn-Cunningham
Carnegie Mellon UniversityAbstract- Auditory selective attention requires coordination of multiple neural processes that operate at different timescales, each implemented by different neural architectures. Listeners can achieve the same outcome by attending to a spatial location or a specific talker, but the underlying mechanisms remain partially understood. We applied representational similarity analysis (RSA) to EEG data collected during auditory attention to examine time courses and topographies of these different attention processes. Each trial began with a visual cue of attention type (spatial, talker, or no-attention), followed by an auditory cue specifying the exact target (spatial: location; talker: voice). Four overlapping syllables (/ba/, /da/, or /ga/) then played, one of which matched the cued target, and participants reported its identity. 21 conditions included spatial attention (left, right), talker attention (male, female), and passive listening. EEG power spectra were computed and averaged within five frequency bands (delta: 2–4 Hz, theta: 4–8 Hz, alpha: 8–14 Hz, beta: 14–20 Hz, gamma: 20–50 Hz) to assess oscillatory activity. Two epochs were extracted from each trial: “cue” for preparatory attention, and “stimulus” for attention during sensory input. For each frequency band and the time-domain EEG signal, a cross-validated linear support vector machine (SVM) trained at each subject and time point classified between each pair of conditions. SVM accuracies yielded a time series of representational dissimilarity matrices for each frequency band and the time-domain signal. The channel-space weighting of each classifier allowed us to probe the spatial structure of neural information. We observed clear effects of attention vs. passive listening in all frequency bands in both the cue and stimulus epochs. In the time-domain signal, cue effects were transient (possibly sensory-driven responses), while oscillatory effects arose gradually and persisted across the trial. The alpha band showed the strongest effect during stimulus presentation, with time-domain signal, theta, beta, and gamma showing moderate differences. Topographies of the most informative channels show posterior contributions of alpha power, with two distinct time dynamics: one reflecting trial-long attention and another that faded after the target onset. The information encoded by alpha power and the time-domain signal appears largely independent of one another, suggesting that this approach captures multiple neural mechanisms. This work shows that RSA is a powerful analytical approach for revealing the dynamics of endogenous attention, advancing our understanding of cognitive control.
- DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model
Massa Baali, Rita Singh, Bhiksha Raj
Carnegie Mellon UniversityAbstract- Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62\% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
- OpusLM: A Family of Open Unified Speech Language Models
Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe
Carnegie Mellon University, LY Corporation, NVIDIA ResearchAbstract- This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research.
- Causal Tracing of Audio-Text Fusion in Large Audio Language Models
Wei-Chih Chen, Chien-yu Huang, Shinji Watanabe, Hung-yi Lee
Carnegie Mellon University, National Taiwan UniversityAbstract- Large Audio Language Models (LALMs) have shown strong multi-modal abilities. Yet it is still unclear how these models combine audio and text inside their networks or which components are responsible for this fusion. To address this gap, we present the first causal map of how LALMs process and merge audio with text. We apply causal tracing to LALMs in the task of audio-aware question answering using two main techniques. First, to identify key neuron activations, we create a "corrupted run" by changing the audio input. We then patch specific hidden states from a correct “clean run” into the corrupted one and check if the original prediction returns, which indicates the parts of the model that are most important. Second, to trace how audio influences text, we perform layer-by-layer patching of all text token states to find the earliest layer where the text representation begins to reflect audio information.
- Probing Invariances in Auditory Event Categorization Using Model Metamers
Hee So Kim, Abigail L. Noyce, Malinda J. McPherson-McNato, Jenelle Feather
Carnegie Mellon University, Purdue UniversityAbstract- Real-world acoustic inputs contain rich sensory information that we parse into discrete auditory objects and categories. Computational modeling helps bridge the transformation from waveforms to higher-level representations. Recent work suggests that while deep neural networks (DNNs) predict many aspects of human behavior and brain responses, these models often have invariances that are not shared with humans. Here, we examine categorical representations of non-speech stimuli and test whether the invariances of DNNs trained to categorize non-speech sounds are aligned with the invariances of human observers. We developed an experimental paradigm to compare humans and DNNs using forced-choice classification across 25 non-speech sound categories. We generated "model metamers"—stimuli whose activation produces the same response as the natural stimulus at a given model stage—and tested whether humans could recognize them as well as the originals. We tested DNNs pretrained on three different tasks: (1) word recognition, (2) auditory event recognition, and (3) a multi-task model for word, speaker, and event recognition. We further compared standard task-optimized models to models that were trained to be robust to small “adversarial” perturbations in the input cochleagram, a technique that was previously shown to help align the invariances of models with the invariances of human observers in the speech domain. We found that humans and models could reliably categorize natural sounds, demonstrating that our paradigm is well-suited for testing invariances in auditory categories. Human recognition accuracy for model metamers dropped significantly for all models, but metamers generated for the adversarially trained auditory event recognition model were the most recognizable. Our results suggest that both task-specific optimization and adversarial training play a significant role in aligning model and human invariances.
- Pretraining Semantic Music Representations Through Multi-Expert Supervision
Qirui Wang*, Christodoulos Benetatos*, Randal Leistikow, Yongyi Zang
Carnegie Mellon University, University of Rochester, Smule LabsAbstract- While current music representation pretraining adopts masked autoencoder token prediction from speech methods, this approach requires a discrete tokenization step that can be time-consuming. We propose a tokenization-free pretraining method using multi-task learning with expert supervision. Our approach simultaneously trains a shared encoder with multiple prediction heads, each supervised by specialized MIR experts: local temporal tasks (beat tracking, structure segmentation, pitch transcription, etc.) use classification heads with expert-generated pseudo-labels, while global semantic understanding uses knowledge distillation from CLAP. To handle potentially conflicting gradients from diverse musical tasks, we employ gradient projection to compute a unified update direction that preserves positive transfer across all objectives. This eliminates the information bottleneck of tokenization while learning representations grounded in multiple aspects of musical understanding.
- is ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior
Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury
University of Illinois Urbana-Champaign, Southern University of Science and Technology, Columbia UniversityAbstract- Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem,i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic,and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlikeDPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR.
- Sci-Phi: A Large Language Model Spatial Audio Descriptor
Xilin Jiang, Hannes Gamper, Sebastian Braun
Columbia University, MicrosoftAbstract- Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment.
- Teaching Audio Language Model to Hear and Think with Visual Language Model
Xilin Jiang, Qiaolin Wang, Junkai Wu, Linyang He, Vishal Choudhari, Nima Mesgarani
Columbia University, University of WashingtonAbstract- We present our efforts to teach an audio language model (ALM) to hear and to think under the guidance of a strong visual language model (VLM) teacher. We first empirically establish a premise: given the same video, state-of-the-art VLMs (vision-only) typically outperform ALMs (audio-only) in sound recognition and in audio-visual question answering (AVQA). Building on this, our WASPAA 2025 paper, Bridging Ears and Eyes, introduces a cross-modal LLM distillation framework for sound recognition: for the same acoustic scene, an ALM student learns to match a VLM teacher’s generated answer tokens, yielding substantial gains on both seen and unseen sound classes. Our follow-up paper, SightSound-R1, extends distillation from answers to reasoning. We propose a structured pipeline in which a VLM generates step-by-step chains of thought; these traces are validated against the true audio to filter hallucinations; the cleaned-up thinking is distilled into an ALM via supervised fine-tuning followed by reinforcement learning. Beyond improving AVQA accuracy, the ALM acquires interpretable reasoning traces. Critically, because sound videos are abundant on the Internet, our cross-modal framework can scale using automatic VLM generation with minimal human supervision.
- The Cockatiel Party Problem: Array Processing Challenges in Bioacoustics
Irina Tolkova
Cornell UniversityAbstract- Passive acoustic monitoring has emerged as a highly effective approach for studying vocal wildlife and informing data-driven conservation efforts. As recording hardware becomes increasingly affordable, and machine-learning-based classification increasingly efficient, there is a rising interest in deploying sensor arrays rather than individual sensors. Similarly to commercial applications, acoustic arrays are used for localization, direction-of-arrival estimation, source separation, noise reduction, and more. However, problems in bioacoustics are characterized by distinct challenges: highly complex and uncertain environments, sparse ground-truth, and a need for practitioner accessibility. This poster will present two case studies utilizing array processing – sound source separation of birdsong within a dawn chorus, and acoustic abundance estimation of migrating bowhead whales -- with an emphasis on the key technical challenges in both domains. Lastly, we highlight opportunities for cross-disciplinary collaboration to advance both audio research and wildlife conservation efforts.
- Enhancing Baleen Whale Conservation Through Advancements in Distributed Acoustic Sensing
Léa Bouffaut, Eric Snyder, Britney Pepper, Holger Klinck
Cornell UniversityAbstract- Distributed Acoustic Sensing (DAS) repurposes existing fiber optic cables into vast, high-density listening arrays. Instrumented from land, DAS can detect and track low-frequency baleen whales at regional scales with near real-time data delivery—a breakthrough for dynamic, responsive marine ecosystem management.
In contrast to usual passive acoustics, DAS provides a spatial dimension with high-resolution, typically a channel every few meters over tens of kilometers, generating several TBs of data/day/array. Initial approaches for whale monitoring have focused on data visualization, SNR improvement, and localization. Meanwhile, deep learning approaches have proven successful in similar high-dimensional acoustic datasets, making them particularly promising for DAS. Specifically, applications range from data exploration and compression to the operationalization of DAS as a real-time conservation pipelines capable of real-time species recognition, localization, and tracking.
By uniting advanced signal processing and machine learning, engineering, and ecological monitoring, DAS is an opportunity to bridge cutting-edge technology and actionable insights, paving the way for dynamic, data-driven marine mammal management. This transformative tool holds the potential to redefine how we monitor and protect whales in a rapidly changing ocean.
- Distributed Acoustic Sensing (DAS) repurposes existing fiber optic cables into vast, high-density listening arrays. Instrumented from land, DAS can detect and track low-frequency baleen whales at regional scales with near real-time data delivery—a breakthrough for dynamic, responsive marine ecosystem management.
- Automatic Live Level Balancing
Matthew Keating, Michael Casey
Dartmouth CollegeAbstract- We present the task of automatic live level balancing (ALLB) for optimizing the volume levels of amplified instruments in live performance. Formally, for an ensemble of instruments each with a controllable gain/volume level, we seek to adjust the volume controllers such that the audience can clearly perceive each instrument. ALLB is similar to related work in automatic mixing but differs mainly because ALLB only has the total ensemble audio, not isolated stems, and must perform the task quickly in situations such as a live sound check. Therefore, we formulate ALLB as a reinforcement learning problem with a central agent receiving the audio state from a microphone. This agent takes actions by adjusting the volume knobs by a delta through a policy function, which is optimized through gradient ascent on a reward function. We optimize for immediate reward, which is calculated by averaging a clarity metric, where a high average clarity yields a high reward. We calculate clarity by performing template-based matching on Constant-Q Transformed audio with templates from a few seconds of isolated instrument audio for each instrument. We perform initial experiments with a custom gymnasium environment using synthesized audio with FluidSynth for midi-to-audio and Spotify Pedalboard for fx and gain control. We use a model-based approach with a simple linear policy and a two-layer neural network model approximator trained on ~30 seconds of generated audio, and observe loss optimization and high reward. Future work will involve creating a physical controller to test this framework with live bands, performing user listening studies on clarity metrics, optimizing the gymnasium environment audio synthesis, and experimenting with other policy optimization/pre-training strategies.
- OOV Injection: Fast UNK-based Adding of OOVs for Streaming ASR
Seyyed Saeed Sarfjoo, Kevin Sanders, Fred Mailhot, Jonas Robertson
DialpadAbstract- An optimized way for recognizing the words that were not seen during the ASR training, called out-of-vocabulary (OOV) words, is still an open research problem. Here, we propose a method for adding OOV words to the word-based language model by estimating higher n-gram probabilities from the n-grams that contain unknown unigram [unk]. We investigate the effectiveness of the proposed method on beam search decoding models trained with subword units. In addition, this method is complementary to other contextual biasing techniques. The results on in-house and public LibriSpeech datasets show the effectiveness and generalizability of this method for streaming ASR, achieving comparable or improved F1 scores and word error rate (WER) performance relative to retrained LMs. This method complements the hotword-based contextual biasing techniques and improves the keyword boosting accuracy. This method is well-suited for streaming ASR systems, offering a scalable solution.
- Singing Voice Separation using Video Input as Privileged Information during Training
Teng (Aleksandra) Ma, Alexander Lerch
Georgia Institute of TechnologyAbstract- Singing voice separation remains a challenging problem in music information retrieval, particularly in the presence of backing vocals. Recent advances in audiovisual speech recognition demonstrate that visual cues, such as lip movements, provide complementary information for separating vocal activity. However, audiovisual models are often too large and require video input during inference, which limits their practicality. We propose a knowledge distillation framework that leverages visual information as privileged guidance during training while producing a compact audio-only student model for inference. Pretrained embeddings from AV-HuBERT are used as intermediate supervisory signals in a FitNet-style setup, aligning the student’s latent representations with multimodal teacher embeddings. We evaluate the approach on MUSDB18-HQ and compare its performance against existing audio-only source separation models.
- A model of speech recognition reproduces behavioral and neural signatures of human speech perception
Gasser Elbanna, Josh McDermott
Harvard University, MITAbstract- Humans excel at extracting linguistic content from highly variable speech signals. Despite major advances in automatic speech recognition (ASR), it remains unclear whether machine systems can explain the neural and perceptual mechanisms of human speech perception. We introduce PARROT, a model of continuous speech recognition that combines a simulation of the human ear with convolutional and recurrent neural network modules. PARROT maps an acoustic signal into sequences of sub-word units serving as a working proxy for the perceptual code of speech. We trained PARROT on 7.5 million utterances superimposed on noisy backgrounds. We then compared PARROT—along with off-the-shelf ASR models—to humans on a suite of behavioral and neural evaluations: (i) a novel large-scale benchmark of nonword transcription and discrimination, (ii) a battery of established signatures of speech perception (categorical perception, neighborhood density, phonotactic probability, formant-based vowel spaces, and benefits of talker familiarity), and (iii) in-silico fMRI experiments targeting human auditory cortex. PARROT closely tracked human performance, exhibiting human-like patterns of recognizability and confusions, consonant categorical perception, sensitivity to corpus statistics, an emergent F1–F2 vowel space, and performance benefits from talker consistency. PARROT also recapitulated human auditory cortical response patterns to speech, unlike other ASR models. By providing both a candidate model and a comprehensive evaluation suite, this work enables systematic assessment of theories of speech perception and exposes critical gaps between off-the-shelf ASR models and brain responses.
- Acoustic Analysis of Patient-provider Conversations During Primary Care Visits to Screen for Cognitive Impairment
Joseph T. Colonel, Cara Faherty, Carolyn Hagler, Jacqueline Becker, Lili Chan, Laura Curtis, Juan Wisnivesky, Alex Federman, Baihan Lin
Icahn School of Medicine at Mount SinaiAbstract- Question: Can acoustic features extracted from audio recordings of patient–physician conversations during routine primary care visits be used to screen for cognitive impairment?
Findings: In this study including 787 older adults without diagnosis of cognitive problems, machine learning models trained on acoustic features from audio segments of recordings of primary care visits achieved area under the receiver operating characteristic curve values of 0.72 for predicting cognitive impairment. The algorithm achieved a sensitivity of 81%, specificity of 54%, and positive predictive value of 31% - identifying a subset of primary care patients at higher risk for cognitive impairment. Models performed similarly on an external validation dataset of 179 participants. Interpretability analyses highlighted pause duration and pitch-related features as salient indicators of cognition status.
Meaning: These findings suggest that short segments of naturalistic clinical dialogue may contain useful acoustic signals for passively screening patients for cognitive impairment.
- Question: Can acoustic features extracted from audio recordings of patient–physician conversations during routine primary care visits be used to screen for cognitive impairment?
- From Geometry to Auralization: Neural Prediction of Energy Decay and Perceptually Valid Room Responses
Imran Muhammad, Gerald Schuller
Ilmenau University of Technology, Fraunhofer IISAbstract- We present a two-stage learning approach that turns room descriptions into listening-ready acoustics for speech and audio applications. First, a Long Short-Term Memory (LSTM) model predicts energy decay curves (EDCs) directly from room geometry, source–receiver positions, and frequency-dependent surface absorption. Trained on 6,000 simulated shoebox rooms, the predictor yields accurate decay parameters—e.g., EDT and T20 with mean absolute errors around 0.017–0.023 s and clarity C50 within ≈0.9 dB—demonstrating robust generalization across diverse conditions. Building on these EDCs, we reconstruct full room impulse responses (RIRs) via reverse differentiation with a “random sign-sticky” scheme that preserves temporal and spectral structure. Objective evaluations show strong EDC fidelity and RIR similarity (correlations ≈0.37–0.52; spectral MSE ≈55–58 dB), and a MUSHRA listening test finds no significant perceptual differences between reconstructed and reference RIRs, supporting practical use in auralization and real-time rendering. Together, these results outline a scalable pipeline—from room features to perceptually validated RIRs—that can accelerate design, simulation, and interactive audio for the “Speech and Audio in the Northeast” community, including VR/AR, speech enhancement, and room tuning tools.
- Synthetic Speech Detection Under Distribution Shifts: Benchmarking and Few-shot Adaptation
Ashi Garg, Zexin Cai, Lin Zhang, Henry Li Xinyuan, Leibny Paola García-Perera, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews
Johns Hopkins UniversityAbstract- The problem of synthetic speech detection has received considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions often exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods.
To enable systematic benchmarking of model performance under distribution shifts, we introduce ShiftySpeech, a large-scale benchmark comprising over 3,000 hours of synthetic speech across 7 source domains, 8 TTS systems including both open-source and commercial systems, 12 vocoders, and 3 languages. ShiftySpeech is specifically designed to evaluate model generalization under controlled distribution shifts while ensuring broad coverage of modern synthetic speech generation techniques.
Few-shot learning methods offer a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples---achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.
These results highlight the need for both robust evaluation benchmarks and adaptive detection methods to ensure reliability under evolving real-world conditions.
- The problem of synthetic speech detection has received considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions often exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods.
- Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages
Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Jeff Hwang, Vineel Pratap, Ju Lin, Ming Sun, Florian Metze
MetaAbstract- Conventional speech-to-text forced alignment typically operates at the utterance level. In practice, however, we do not usually have short segments (e.g., 10 seconds) of audio with exact, verbatim transcriptions (e.g., the LibriSpeech corpus) as in lab conditions. Instead, audio often comes in long-form (e.g., an hour-long lecture recording), and the available transcription may be non-verbatim or include unspoken annotations, making it misaligned with the actual speech. This motivates the need for long-form fuzzy speech-to-text alignment, which has practical applications -- for example, preparing segmented supervised audio data for training machine learning models. We demonstrate the Torchaudio long-form aligner, which supports such use cases. Moreover, it can be equipped with any CTC model that predicts frame-wise labels, turning the model into a robust and powerful aligner.
- USAD: Universal Speech and Audio Representation via Distillation
Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu
MIT CSAILAbstract- Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types–speech, sound, and music–into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
- Midi-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation
Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang
MIT CSAILAbstract- We present Midi-LLM, an LLM for generating multitrack MIDI music from free-form text prompts. Our approach expands a text LLM's vocabulary to include MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI abilities. By preserving the LLM’s weight signature, we can directly leverage the vLLM library for accelerated inference. Experiments show that Midi-LLM achieves higher quality, better text control, and faster inference compared to the recent Text2midi model. Live and static demos at https://midi-llm-demo.vercel.app.
- Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass
MIT, National Taiwan University (NTU), Academia SinicaAbstract- Conversational Spoken Language Models (SLMs) promise natural, real-time interaction, yet their ability to manage temporal dynamics, such as timing, tempo, and simultaneous speaking, remains underexplored. To address this gap, we introduce the Game-Time Benchmark, a framework that systematically evaluates SLMs on both basic instruction-following tasks and advanced tasks with temporal constraints, including tempo adherence and synchronized responses. Our results reveal that while state-of-the-art models handle simple tasks reasonably well, nearly all struggle under temporal conditions, highlighting persistent challenges in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding research toward more temporally fluent conversational AI.
- Robust Audio Deepfake Detection with Layer Weighted XLS-R
Sawyer Sacks, Angel Carrillo Bermejo
Modulate.aiAbstract- Recent advances in artificial intelligence have heightened concerns around the misuse of synthetic voices, where malicious actors can deploy highly realistic audio deepfakes to deceive individuals or organizations. To address this threat, we present a robust synthetic voice detection model built on top of XLS-R, a state-of-the-art multilingual speech foundation model pre-trained on nearly 500,000 hours of unlabeled speech spanning more than 100 languages. Our approach introduces a learned weighting mechanism across the model’s hidden layers, enabling the integration of features ranging from low-level acoustic cues in early layers, through mid-level phonetic and prosodic patterns, to high-level semantic and contextual representations in later layers. We fine-tune this architecture on a large-scale, balanced dataset of over half a million audio samples, combining diverse real and synthetic speech from public and internal sources, with variation in speakers, languages, recording conditions, and generation techniques. To further enhance generalization, we incorporate an advanced augmentation pipeline that simulates realistic distortions such as background noise, compression artifacts, and telephony codecs. As a result, our detector achieves strong robustness in real-world scenarios, including phone-call conditions, and demonstrates superior performance in Equal Error Rate (EER) compared to leading open-source baselines as well as competitive proprietary systems.
- Compositional Audio Representation Learning
Sripathi Sridhar, Mark Cartwright
New Jersey Institute of TechnologyAbstract- Large language models (LLMs) are demonstrating remarkable emergent capabilities and are already widely applied in various audio understanding tasks, owing to their extensive world knowledge. Yet, their fundamental semantic understanding of audio events remains underexplored. Our work addresses this by investigating the ability of LLMs to infer audio semantics purely from textual acoustic descriptions, focusing on environmental sound classification with the ESC50 dataset. We prompt LLMs with varying levels of detail, from statistical acoustic attributes to rich natural language descriptions, and examine the influence of class context within prompts. We evaluate four prominent OpenAI models–GPT-4o, GPT-4o mini, o4-mini, and GPT-3.5 Turbo–to determine the impact of training scale, multimodal pre-training, and dedicated reasoning capabilities. Our findings show that while standard acoustic features are discriminative, LLMs struggle to utilize them effectively without contextual information. Contextual cues boost performance, particularly for larger models, but underperform predictions from LLM-generated class descriptions without semantic information. These insights benchmark text-based audio understanding within LLMs and highlight important interactions between prompt design, model scale, and reasoning ability, paving the way for further work at the intersection of LLMs and audio understanding.
- VoiceFX: CLAP-Based Audio Quality Improvement for Singing and Speech
Elena Georgieva, Pablo Ripollés, Brian McFee
New York UniversityAbstract- We introduce an automatic method for enhancing vocal audio quality in both singing and speech. Using recordings from the LibriSpeech and Smule DAMP dataset, we applied a set of degradations and tested a set of audio effect “reme- dies” designed to reverse them: a high shelf filter, de-esser, noise reduction, and high-pass filter. We used the CLAP (Contrastive Language-Audio Pre- training) model to estimate recording quality and recommend corrections by comparing audio clips to descriptive text prompts in the shared embedding space. To evaluate our method, we conducted a large-scale listener study with 234 participants and 4,600 ratings. While CLAP-based scores often favored remedies like noisereduce, listeners sometimes preferred the original, unpro- cessed clips—suggesting that perceptual artifacts introduced by enhancement may offset technical improvements. Our findings underscore the value of hu- man judgment: embedding models can guide enhancement, but perceptual validation remains essential.
- Investigating Modality Contribution in Audio LLMs for Music
Giovana Morais, Magdalena Fuentes
New York UniversityAbstract- Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model’s prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.
- Balancing Information Preservation and Disentanglement in Self-Supervised Music Representation Learning
Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello
New York UniversityAbstract- Recent advances in self-supervised learning (SSL) methods offer a range of strategies for capturing useful representations from music audio without the need for labeled data. While some techniques focus on preserving comprehensive details through reconstruction, others favor semantic structure via contrastive objectives. Few works examine the interaction between these paradigms in a unified SSL framework. In this work, we propose a multi-view SSL framework for disentangling music audio representations that combines contrastive and reconstructive objectives. The architecture is designed to promote both information fidelity and structured semantics of factors in disentangled subspaces. We perform an extensive evaluation on the design choices of contrastive strategies using music audio representations in a controlled setting. We find that while reconstruction and contrastive strategies exhibit consistent trade-offs, when combined effectively, they complement each other; this enables the disentanglement of music attributes without compromising information integrity.
- Parametric Acoustic Field Learning for Efficient Scene-Aware Acoustic Rendering
Yi Wu, Christopher Ick, Agnieszka Roginska, Magdalena Fuentes
New York UniversityAbstract- Room acoustics are central to immersion in AR/VR and interactive media, yet real-time rendering across diverse environments remains a major challenge. Existing approaches face trade-offs: geometric methods are efficient but inaccurate at low frequencies, wave-based simulations are accurate but prohibitively expensive, and learning-based models either predict full room impulse responses (RIRs), which are unstable and costly to store, or rely on implicit fields that require per-scene training and fail to generalize. Inspired by parametric wave field coding, we introduce Parametric Acoustic Fields (PAFs), a learned representation of four perceptually grounded acoustic parameters: direct sound loudness (LDS), early reflection loudness (LER), early decay time (TER), and late reverberation time (TLR), expressed as dense, scene-aware fields. Our method predicts these parameter maps directly from point cloud inputs, leveraging pretrained scene encoders that capture geometry and object-level semantics. These fields drive a canonical filter bank within a lightweight rendering pipeline, enabling efficient and scalable real-time acoustics across arbitrary source–receiver pairs in novel environments, while still allowing full RIR reconstruction when needed.
- Adapting Music Source Separation Models for Binaural Audio
Richa Namballa, Magdalena Fuentes
New York UniversityAbstract- Binaural audio is increasingly important for immersive experiences in virtual reality, gaming, and accessibility applications, yet remains underexplored in music source separation (MSS). Existing MSS models process two-channel signals, but it is unclear how well they preserve the spatial cues critical for maintaining the listener’s sense of immersion. We evaluate several popular MSS models on both the stereo and a synthetic binaural version of MUSDB18-HQ, created using head-related transfer functions (HRTFs). Our results show that stereo-trained models often fail to retain spatial information, with the level of degradation depending on architecture and the target instrument. To address this, we investigate two methodologies: retraining an existing architecture with binaural data and introducing a spatial cue-based term into the loss function. Both strategies yield significant improvements in separation quality and spatial integrity, though the current spatial metrics remain limited in reflecting these advancements. Our findings highlight challenges and opportunities for developing MSS models that extend beyond signal fidelity to support truly immersive audio.
- A Modular Approach to Music Generation: Adding Music Controls to Neural Audio Compression Models
Daniel Faronbi, Peter Traver, Juan Bello
New York UniversityAbstract- The embedding spaces generated by neural audio compression models can be decoded to high quality audio. However, controlling audio attributes directly in this space has not been explored. We propose a method to control the embedding space across pitch and velocity while maintaining a consistent timbre. To accomplish this task, a model that transforms compressed tokens across these musical parameters is trained. We show how our model trained on compressed tokens performs similarly to transformation models trained on raw audio signals and experiment with using different amounts of hierarchical codebooks to transform the signal. Finally, we show an application of this transformation model for modularly generating music by combining different models, like signal processing blocks directly applied to compressed codes.
- Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds
Andrew Chang, Yike Li, Iran R. Roman, David Poeppel
New York University, Queen Mary University of London, Max Planck SocietyAbstract- Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex. The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds, which are essential categories for both human cognition and machine perception. These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing.
- The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling
Patrick O'Reilly, Julia Barnett, Hugo Flores García, Annie Chu, Nathan Pruyne, Prem Seetharaman, Bryan Pardo
Northwestern University, Adobe ResearchAbstract- Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge this gap, we present TRIA (The Rhythm In Anything), a masked transformer model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.
- Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects
Annie Chu, Patrick O’Reilly, Julia Barnett, Bryan Pardo
Northwestern UniversityAbstract- This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., "make this sound in-your-face and bold"). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fx.
- Adaptive Linearly Constrained Minimum Variance Framework for Volumetric Active Noise Control
Manan Mittal, Ryan M. Corey, Andrew C. Singer
Stony Brook University, University of Illinois ChicagoAbstract- Traditional volumetric noise control typically relies on multipoint error minimization to suppress sound energy across a region, but offers limited flexibility in shaping spatial responses. This paper introduces a time domain formulation for linearly constrained minimum variance active noise control (LCMV ANC) for spatial control filter design. We demonstrate how the LCMV ANC optimization framework allows system designers to prioritize noise reduction at specific spatial locations through strategically defined linear constraints, providing a more flexible alternative to uniformly weighted multi point error minimization. An adaptive algorithm based of filtered X least mean squares (FxLMS) is derived for online adaptation of filter coefficients. Simulation and experimental results validate the proposed method's noise reduction and constraint adherence, demonstrating effective, spatially selective and broadband noise control compared to multipoint volumetric noise control.
- WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling
Qihui Yang, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack
UC San DiegoAbstract- Despite rapid progress in end-to-end AI music generation, AI-driven modeling of professional Digital Signal Processing (DSP) workflows remains challenging. In particular, while there is growing interest in neural black-box modeling of audio effect graphs (e.g. reverb, compression, equalization), AI-based approaches struggle to replicate the nuanced signal flow and parameter interactions used in professional workflows. Existing differentiable plugin approaches often diverge from real-world tools, exhibiting inferior performance relative to simplified neural controllers under equivalent computational constraints. We introduce WildFX, a pipeline containerized with Docker for generating multi-track audio mixing datasets with rich effect graphs, powered by a professional Digital Audio Workstation (DAW) backend. WildFX supports seamless integration of cross-platform commercial plugins or any plugins in the wild, in VST/VST3/LV2/CLAP formats, enabling structural complexity (e.g., sidechains, crossovers) and achieving efficient parallelized processing. A minimalist metadata interface simplifies project/plugin configuration. Experiments demonstrate the pipeline's validity through blind estimation of mixing graphs, plugin/gain parameters, and its ability to bridge AI research with practical DSP demands.
- FlowSynth: Instrument Generation Through Distributional Flow Matching and Test-Time Search
Qihui Yang, Randal Leistikow, Yongyi Zang
UC San Diego, Smule LabsAbstract- Virtual instrument generation requires maintaining consistent tim-bre across different pitches and velocities a challenge that exist-ing note-level models struggle to address. We present FlowSynthwhich combines distributional flow matching (DFM) with test-timeoptimization for high-quality instrument synthesis. Unlike standardflow matching that learns deterministic mappings DFM parame-terizes the velocity field as a Gaussian distribution and optimizesvia negative log-likelihood enabling the model to express uncer-tainty in its predictions. This probabilistic formulation allows prin-cipled test-time search: we sample multiple trajectories weighted bymodel confidence and select outputs that maximize timbre consis-tency. FlowSynth outperforms the current state-of-the-art Token-Synth baseline in both single-note quality and cross-note consis-tency. Our approach demonstrates that modeling predictive uncer-tainty in flow matching combined with music-specific consistencyobjectives provides an effective path to professional-quality virtualinstruments suitable for real-time performance.
- Accenlent: An AI-Augmented Intraoral Device and Web Platform for Speech Rehabilitation in Clinical Populations
Hua-Ta Liang, ChunChen Lin, WeiYun Xu, WenChing Li
University of Illinois Urbana-ChampaignAbstract- Speech and articulation disorders affect a wide range of clinical populations, including individuals with cleft lip and palate, stroke, Parkinson’s disease, and aphasia. Conventional rehabilitation methods often provide limited feedback on hidden articulators or airflow dynamics, making it difficult for patients to self-correct. We introduce Accenlent, a device–platform system designed to provide real-time, multimodal feedback for articulation rehabilitation.
The current prototype focuses on monitoring oral airflow during speech production, allowing users and therapists to assess the adequacy of exhalation patterns critical for clear articulation. The system streams sensor data via a wireless module to a web application, which delivers visual feedback (waveforms, airflow intensity plots) and haptic cues (vibratory signals) to guide patients in adjusting their respiratory support and speech clarity. A backend AI pipeline integrates OpenAI Whisper for automatic speech recognition, GPT-4 for adaptive coaching, and audio analysis libraries such as Librosa to align patient speech with therapeutic targets.
Preliminary evaluations with patients recovering from stroke, as well as individuals with congenital speech disorders and Parkinson’s disease, demonstrate that airflow-based feedback enhances both awareness and self-correction compared to audio-only therapy. Patients reported that haptic cues offered an intuitive way to adjust breath control during speech tasks.
As a next step, we are extending Accenlent with a soft intraoral sensor array capable of capturing tongue pressure and contact patterns, enabling more comprehensive articulatory feedback beyond airflow. Looking further ahead, we are exploring Near Field Communication (NFC) as a strategy to eliminate onboard batteries and chips, thereby improving safety, comfort, and device miniaturization. In parallel, we are preparing for FDA regulatory pathways to ensure clinical-grade safety and efficacy.
At SANE 2025, we will present the current system, preliminary clinical findings, and our roadmap toward multimodal intraoral sensing, wireless power integration, and regulatory translation. We believe Accenlent represents a promising step toward next-generation AI-assisted rehabilitation tools that combine airflow sensing, articulatory feedback, and patient-centered clinical design.
- Speech and articulation disorders affect a wide range of clinical populations, including individuals with cleft lip and palate, stroke, Parkinson’s disease, and aphasia. Conventional rehabilitation methods often provide limited feedback on hidden articulators or airflow dynamics, making it difficult for patients to self-correct. We introduce Accenlent, a device–platform system designed to provide real-time, multimodal feedback for articulation rehabilitation.
- Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders
Dimitrios Bralios, Jonah Casebeer, and Paris Smaragdis
University of Illinois Urbana-Champaign, MITAbstract- Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.
- Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
NVIDIA, University of MarylandAbstract- We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to deliberately think before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks. We will open-source all our code, data, and checkpoints upon paper acceptance. Technical Appendix is in Supplementary Material. Demo: https://audioflamingo3.github.io.
- MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Sonal Kumar, Šimon Sedláček, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Plička, Miroslav Hlaváček, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadze, Yao Liu, Leibny Paola Garcia Perera, Eleni Zanou, Themos Stafylakis, Joon Son Chung, David Harwath, Chao Zhang, Dinesh Manocha, Alicia Lozano-Diez, Santosh Kesiraju, Sreyan Ghosh, Ramani Duraiswami
University of Maryland, Brno University of Technology, Universidad Autonoma de Madrid, Telefonica, Tsinghua University, KAIST, Phonexia, Middlebury College, Tufts University, Universidad de Buenos Aires, Indian Institute of Technology, Microsoft, Carnegie Mellon University, Universiti Sains Malaysia, Johns Hopkins University, Athens University of Economics and Business, University of Texas at Austin, Shanghai Artificial Intelligence LaboratoryAbstract- Audio comprehension—including speech, non-speech sounds, and music—is essential for achieving human-level intelligence. Consequently, AI agents must demonstrate holistic audio understanding to qualify as generally intelligent. However, evaluating auditory intelligence comprehensively remains challenging. To address this gap, we introduce MMAU-Pro, the most comprehensive and rigorously curated benchmark for assessing audio intelligence in AI systems. MMAU-Pro contains 5,305 instances, where each instance has one or more audios paired with human expert-generated question-answer pairs, spanning speech, sound, music, and their combinations. Unlike existing benchmarks, MMAU-Pro evaluates auditory intelligence across 49 unique skills and multiple complex dimensions, including long-form audio comprehension, spatial audio reasoning, multi-audio understanding, among others. All questions are meticulously designed to require deliberate multi-hop reasoning, including both multiple-choice and open-ended response formats. Importantly, audio data is sourced directly ``from the wild" rather than from existing datasets with known distributions. We evaluate 22 leading open-source and proprietary multimodal AI models, revealing significant limitations: even state-of-the-art models such as Gemini 2.5 Flash and Audio Flamingo 3 achieve only 57.33% and 45.9% accuracy, respectively, approaching random performance in multiple categories. Our extensive analysis highlights specific shortcomings and provides novel insights, offering actionable perspectives for the community to enhance future AI systems' progression toward audio general intelligence.
- Privacy-Aware Ambient Audio Sensing for Healthy Indoor Spaces
Bhawana Chhaglani, Jeremy Gummeson, Prashant Shenoy
University of Massachusetts AmherstAbstract- With the majority of our lives spent indoors, it is imperative to maintain healthy indoor air quality for our well-being. The importance of healthy air is further heightened during pandemic or flu season or for immuno-compromised individuals at higher risk of contracting airborne transmission. Indoor airborne transmission poses a significant health risk, yet current monitoring solutions are invasive, costly, or fail to address it directly. My research explores the untapped potential of ambient audio sensing to estimate key transmission risk factors such as ventilation, aerosol emissions, and occupant distribution - non-invasively and in real time. I develop privacy-preserving systems that leverage existing microphones to monitor the whole spectrum of indoor air quality which can have a significant effect on an individual's health. This work lays the foundation for privacy-aware airborne risk monitoring using everyday devices.
- Estimating Acoustic Power Spectral Density with Order Statistic Filters that are Universal over Rank
David Campos Anchieta, John R. Buck
University of Massachusetts DartmouthAbstract- Loud transient signals increase the bias and variance of power spectral density (PSD) estimates in underwater acoustics. Order statistic filters (OSFs) mitigate these challenges, but real time applications require the developer to choose the rank for censoring outliers in advance. Choosing this rank creates a competitive tension. Censoring the observations above a small rank reduces the risk of outliers introducing bias but increases the variance of the PSD estimate. Censoring the observations above a large rank reduces the variance of the PSD estimate but increases the risk of bias from outliers. Additionally, the rate of transients occurring in practical environments often varies over time. This work proposes a performance-weighted blend of OSFs to avoid the need for explicitly choosing a fixed rank. This "mixture-of-experts" technique adapted from universal predictors allows an online algorithm to achieve mean squared error performance that asymptotically rivals the best OSF determined by processing the same data offline in batch mode. Simulations and underwater acoustic data confirm the performance of the blended OSF. [Work supported by the US Office of Naval Research]
- LJ-Spoof: A Single-Speaker, Variation-Dense Corpus for Anti-Spoofing and Synthetic Source Tracing
Surya Subramani, Hashim Ali, Hafiz Malik
University of MichiganAbstract- Speech anti-spoofing and synthesis-source tracing are central challenges in audio forensics. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speaker, including studio-quality recordings, 30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and over 3 million synthetic utterances. Beyond detection, we provide an extended source-tracing analysis that disentangles contributions along key TTS axes: pretraining datasets (lingual coverage, speaker counts, and hours), input types (text, phonemes, audio codes), acoustic and text encoders, generative controls (temperature, speaking rate, ODE steps), generation regime (autoregressive vs. non-autoregressive), acoustic model details (training objective, backbone, and output representation such as mel or discrete codes), waveform decoder choice (neural vocoder or neural codec), and training type (zero-shot prompting (ZS-TTS), speaker-specific pre-training(SS-TTS) or fine-tuning(FT-TTS)). This variation-dense design enables robust speaker-conditioned anti-spoofing, fine-grained synthesis-source attribution, and positions LJ-Spoof as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.
- Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System
Hashim Ali, Surya Subramani, Lekha Bollinani, Nithin Sai Adupa, Sali El-Loh, Hafiz Malik
University of MichiganAbstract- The SAFE Challenge evaluates synthetic speech detection across three tasks: unmodified audio, processed audio with compression artifacts, and laundered audio designed to evade detection. We systematically explore self-supervised learning (SSL) front-ends, training data compositions, and audio length configurations for robust deepfake detection. Our AASIST-based approach incorporates WavLM large frontend with RawBoost augmentation, trained on a multilingual dataset of 256,600 samples spanning 9 languages and over 70 TTS systems from CodecFake, MLAAD v5, SpoofCeleb, Famous Figures, and MAILABS. Through extensive experimentation with different SSL front-ends, three training data versions, and two audio lengths, we achieved second place in both Task 1 (unmodified audio detection) and Task 3 (laundered audio detection), demonstrating strong generalization and robustness.
- One-class classification for Speaker-Specific Audio Spoof Detection
Hashim Ali, Surya Subramani, Sali El-loh
University of MichiganAbstract- Advancements in text-to-speech (TTS) and voice conversion (VC) technologies have significantly increased the threat posed by audio spoofing attacks, particularly in high-profile applications such as political or public figure impersonation. Existing binary classification-based Audio Spoof Detection (ASD) methods face critical limitations in generalizing to novel and unseen spoofing techniques due to the growing diversity and sophistication of synthetic speech. This paper presents a speaker-specific framework for detecting audio deepfakes, leveraging self-supervised learning (SSL) embeddings and one-class classification to address these challenges. The proposed methodology employs a one-class Support Vector Machine (SVM) trained exclusively on genuine speech samples from individual speakers to identify deviations indicative of synthetic speech. We conducted extensive evaluations on controlled datasets (ASVSpoof 2019 and DFADD) and real-world scenarios (In-The-Wild and political figures dataset) to demonstrate the effectiveness of this approach. Using vision transformer-based SSL embeddings (e.g., SSAST), our method achieves remarkably low Equal Error Rates (EERs) of 1.30%, 1.82%, and 1.99% across these datasets, substantially outperforming established baselines including AASIST, RawNet2, and wav2vec2-AASIST. Our results highlight that speaker-specific modeling offers superior robustness to novel spoofing attacks and is particularly valuable for protecting known individuals in high-stakes applications.
- Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion
Yu Zhang, Baotong Tian, Zhiyao Duan
University of RochesterAbstract- Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the speaker representation of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics.
- Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
Frank Cwitkowitz, Zhiyao Duan
University of RochesterAbstract- Multi-Pitch Estimation (MPE) continues to be a sought after capability of Music Information Retrieval (MIR) systems, and is critical for many applications and downstream tasks involving pitch, including music transcription. However, existing methods are largely based on supervised learning, and there are significant challenges in collecting annotated data for the task. Recently, self-supervised techniques exploiting intrinsic properties of pitch and harmonic signals have shown promise for both monophonic and polyphonic pitch estimation, but these still remain inferior to supervised methods. In this work, we extend the classic supervised MPE paradigm by incorporating several self-supervised objectives based on pitch-invariant and pitch-equivariant properties. This joint training results in a substantial improvement under closed training conditions, which naturally suggests that applying the same objectives to a broader collection of data will yield further improvements. However, in doing so we uncover a phenomenon whereby our model simultaneously overfits to the supervised data while degenerating on data used for self-supervision only. We demonstrate and investigate this and offer our insights on the underlying problem.
- Towards Perception-Informed Latent HRTF Representations
You Zhang, Andrew Francl, Ruohan Gao, Paul Calamia, Zhiyao Duan, Ishwarya Ananthabhotla
University of Rochester, Meta, University of MarylandAbstract- Personalized head-related transfer functions (HRTFs) are essential for ensuring a realistic auditory experience over headphones, because they take into account individual anatomical differences that affect listening. Most machine learning approaches to HRTF personalization rely on a learned low-dimensional latent space to generate or select custom HRTFs for a listener. However, these latent representations are typically learned in a manner that optimizes for spectral reconstruction but not for perceptual compatibility, meaning they may not necessarily align with perceptual distance. In this work, we first study whether traditionally learned HRTF representations are well correlated with perceptual relations using auditory-based objective perceptual metrics; we then propose a method for explicitly embedding HRTFs into a perception-informed latent space, leveraging a metric-based loss function and supervision via Metric Multidimensional Scaling (MMDS). Finally, we demonstrate the applicability of these learned representations to the task of HRTF personalization. We suggest that our method has the potential to render personalized spatial audio, leading to an improved listening experience.
- Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
Ali Vosoughi*, Yongyi Zang*, Qihui Yang, Nathan Paek, Randal Leistikow, Chenliang Xu
University of Rochester, Smule Labs, UC San Diego, Stanford UniversityAbstract- We present PromptReverb, the first system capable of generating high-quality room impulse responses (RIRs) from natural language descriptions. Our two-stage framework combines a variational autoencoder for upsampling band-limited RIRs to full-band quality (48 kHz) with a conditional diffusion transformer using rectified flow matching that generates RIRs from text prompts. We developed a caption-then-rewrite pipeline using vision-language models and LLMs to create diverse training data from existing datasets. PromptReverb achieves 8.8% mean RT60 error compared to -37% for existing baselines, enabling practical applications in VR, gaming, and audio production without requiring panoramic imagery, depth estimation, or acoustic expertise.
- Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition
Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen
University of Texas at DallasAbstract- One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM to operate on a discretized audio representation that better reflects the linguistic structure. We further create a “soft discretization” of the audio representation by updating the codebook and performing a weighted sum over the codebook embeddings. Empirical results demonstrate that our proposed method significantly improves upon the LLM-based ASR baseline, particularly in out-of-domain conditions. This work highlights the potential of soft discretization as a modality bridge in LLM-based ASR.




