SANE 2024 - Speech and Audio in the Northeast

October 17, 2024

Cambridge skyline.

SANE 2024, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Thursday October 17, 2024 at Google, in Cambridge, MA.

It is the 11th edition in the SANE series of workshops, which started in 2012 and is typically held every year alternately in Boston and New York. Since the first edition, the audience has steadily grown, with a record 200 participants and 51 posters in 2023.

SANE 2024 will feature invited talks by leading researchers from the Northeast as well as from the wider community. It will also feature a lively poster session, open to both students and researchers.

Details

  • Date: Thursday, October 17, 2024
  • Venue: Google, Cambridge, MA.
  • Time: Typically, SANE talks start around 8:30-9am and finish around 5:30-6pm.

Confirmed Speakers

Click on the talk title to jump to the abstract and bio.

Poster Session (Submission deadline: 09/30)

If you would like to present a poster, please send an email to with a brief abstract describing what you plan to present. Poster submission deadline is September 30.
Note that SANE is not a peer-reviewed workshop, and you can present work that has been or will be submitted elsewhere (if the other venue does not explicitly forbids it). As SANE participants will be a mix of audio signal processing, speech, and machine learning researchers, the poster session will be a great opportunity to foster discussion as well as get feedback and comments from various perspectives on your most recent work.

Registration

Registration is free but required. We will only be able to accommodate a limited number of participants, so we encourage those interested in attending this event to register as soon as possible by sending an email to with your name and affiliation.

Directions

The workshop will be hosted at Google, in Cambridge MA. Google Cambridge is located at 325 Main St, right above the Kendall/MIT station on the T (i.e., subway) Red Line.

Please enter the lobby of 325-355 Main St, then the entrance to 325 will be on your right. You will be able to pick up your badge and walk up the stairs to the conference space on the 3rd floor. If you need to use an elevator, please let staff know.

We strongly suggest using public transportation to get to the venue. If you need parking, there are a number of public lots in the area, including the Kendall Center "Green" Garage (90 Broadway, Cambridge, MA 02142) and the Kendall Center "Yellow" Garage (75 Ames Street, Cambridge, MA 02142).

Organizing Committee

 

Sponsors

SANE remains a free workshop thanks to the generous contributions of the sponsors below.

MERL Google

Talks

 

(Title to be announced soon)

Chris Donahue

Carnegie Mellon University

Chris Donahue

Chris Donahue is an Assistant Professor in the Computer Science Department at CMU, and a part-time research scientist at Google DeepMind working on the Magenta project. His research goal is to develop and responsibly deploy generative AI for music and creativity, thereby unlocking and augmenting human creative potential. In practice, this involves improving machine learning methods for controllable generative modeling of music and audio, and deploying real-world interactive systems that allow anyone to harness generative music AI to accomplish their creative goals through intuitive forms of control. Chris’s research has been featured in live performances by professional musicians like The Flaming Lips, and also empowers hundreds of daily users to convert their favorite music into interactive content through his website Beat Sage. His work has also received coverage from MIT Tech Review, The Verge, Business Insider, and Pitchfork. Before CMU, Chris was a postdoctoral scholar in the CS department at Stanford advised by Percy Liang. Chris holds a PhD from UC San Diego where he was jointly advised by Miller Puckette (music) and Julian McAuley (CS).

 

 

Frontiers of Speech Synthesis: Controllability, Expressiveness, and Natural Conversations

Zhiyao Duan

University of Rochester

Speech synthesis research has made profound progress in the last decade. State-of-the-art text-to-speech and voice conversion systems are able to synthesize speech with high quality that is often indistinguishable from bonafide speech by human ears. However, such systems still lack controllability and expressiveness, and they show limited naturalness in conversational settings. In this talk, I will argue that controllability, expressiveness, and natural conversations are the new frontiers of speech synthesis research. I will present our recent work on these frontiers. Specifically, I will introduce ControlVC, a voice conversion system allowing users to control pitch and speech dynamically; GTR-Voice, our attempt to extend the definition of expressiveness to articulatory phonetics as professional voice actors do; and Parakeet, a system that can synthesize conversational speech with natural pauses, interruptions, and nonverbal events.

Zhiyao Duan

Zhiyao Duan is an associate professor in Electrical and Computer Engineering, Computer Science and Data Science at the University of Rochester. He is also a co-founder of Violy, a music tech company for instrument education. He received his B.S. in Automation and M.S. in Control Science and Engineering from Tsinghua University, China, in 2004 and 2008, respectively, and received his Ph.D. in Computer Science from Northwestern University in 2013. His research interest is in computer audition and its connections with computer vision, natural language processing, and augmented and virtual reality. He received a best paper award at the SMC 2017, a best paper nomination at ISMIR 2017, and a CAREER award from the National Science Foundation (NSF). His research is funded by NSF, NIH, NIJ, New York State Center of Excellence in Data Science, and University of Rochester internal awards on AR/VR, health analytics, and data science. He is the President of the International Society for Music Information Retrieval (ISMIR).

 

 

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Mark Hamilton

MIT

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters.

Mark Hamilton

Mark Hamilton is a PhD student in William T Freeman's lab at the MIT Computer Science & Artificial Intelligence Laboratory and a Senior Engineer Manager at Microsoft. Mark’s research aims to discover "structure" in complex systems using unsupervised learning and large foundation models. His prior works include STEGO, a system capable of classifying every pixel of the visual world without any human supervision, and FeatUp, an algorithm for increasing the spatial or temporal resolution of any foundation model by 16-32x. He values working on projects for social, cultural, and environmental good and aims to use ML to empower scientific discovery.

 

 

Computational models of auditory and language processing in the human brain

Greta Tuckute

MIT

Advances in machine learning have led to powerful models for audio and language, proficient in tasks like speech recognition and fluent language generation. Beyond their immense utility in engineering applications, these models offer valuable tools for neuroscience. In this talk, I will demonstrate how these artificial neural network models can be used to understand how the human brain processes language. The first part of the talk will focus on how audio neural networks can help understand how the different parts of the human auditory cortex supports auditory behavior. The second part will examine the similarities between language processing in large language models and language processing in the human brain—and critically, how we can leverage these models to gain insights into brain processes that have previously been out of reach.

Greta Tuckute

Greta Tuckute is a PhD candidate in the Department of Brain and Cognitive Sciences at MIT. Before joining MIT, she obtained her bachelor’s and master’s degrees at The University of Copenhagen in Denmark. Greta works at the intersection of neuroscience, artificial intelligence, and cognitive science. She is interested in understanding how language is processed in the human brain and how the representations learned by humans compare to those of artificial systems.

 

 

 

 

Speaker diarization at Google: From modularized systems to LLMs

Quan Wang

Google

In this talk, we will introduce the development and evolution of speaker diarization technologies at Google in the past decade, and how they landed as impactful products such as Cloud Speech-to-Text and the Pixel Recorder app. The talk will cover four critical milestones of the speaker diarization technologies at Google: (1) leveraging deep speaker embeddings; (2) leveraging supervised clustering; (3) leveraging sequence transducers; and (4) leveraging large language models. The talk will also discuss how speaker diarization will evolve in the new era of multimodal large language models.

Quan Wang

Quan Wang is a Senior Staff Software Engineer at Google, leading the Hotword Modeling team and Speaker, Voice & Language team. Quan is an IEEE Senior Member, and was a former Machine Learning Scientist at Amazon Alexa team. Quan received his B.E. degree from Tsinghua University, and received his Ph.D. degree from Rensselaer Polytechnic Institute. Quan is the author of the award winning Chinese textbook "Voice Identity Techniques: From core algorithms to engineering practice". Quan is also the instructor of the bestselling course "Speaker Recognition" on Udemy and Udemy Business.