SANE 2016 - Speech and Audio in the Northeast

October 21, 2016

Boston Skyline Over the Charles River

The workshop is now over. Slides for the talks are available through the links in the schedule below.

SANE 2016, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, was held on Friday October 21, 2016 at MIT, in Cambridge, MA.

It was a follow-up to SANE 2012 (Mitsubishi Electric Research Labs - MERL), SANE 2013 (Columbia University), SANE 2014 (MIT CSAIL), and SANE 2015 (Google NY). Since the first edition, the audience has steadily grown, gathering 140 researchers and students in 2015.

SANE 2016 featured invited talks by leading researchers from the Northeast, as well as from the international community. It also featured a lively poster session during lunch time, open to both students and researchers, with 14 posters.


Date:Friday, October 21, 2016
McGovern Institute for Brain Research at MIT,
MIT Bldg 46, 43 Vassar Street,
Cambridge, MA


8:40-9:10Registration and Breakfast
9:10-9:15Welcome [Slides]
9:15-10:00Jesse Engel (Google)
"Understanding music with deep generative models of sound" [Slides]
10:00-10:45Juan P. Bello (NYU)
"Towards Multiple Source Identification in Environmental Audio Streams" [Slides]
10:45-11:15Coffee Break
11:15-12:00Nima Mesgarani (Columbia University)
"Reverse engineering the neural mechanisms involved in robust speech processing" [Slides]
12:00-12:45Shinji Watanabe (MERL)
"Pushing the envelope at both ends — beamforming acoustic models and joint CTC/Attention schemes for end-to-end ASR" [Slides]
12:45-3:00Lunch / Poster Session
3:00-3:45Dan Ellis (Google)
"Sound Event Recognition at Google" [Slides]
3:45-4:30Josh McDermott (MIT)
"Statistics of natural reverberation enable perceptual separation of sound and space" [Slides]
4:30-5:15William T. Freeman (MIT/Google)
"Visually Indicated Sounds" [Slides]
5:15-5:20Closing remarks
5:20-...Drinks at Cambridge Brewing Company (map)

All talks: Singleton Auditorium (46-3002), 3rd Floor, McGovern Institute for Brain Research at MIT
Lunch/Poster: Atrium, 3rd Floor, McGovern Institute for Brain Research at MIT

Poster Session

  • "Nonnegative tensor factorization with frequency modulation cues for blind audio source separation"
    Elliot Creager, Noah D. Stein, Roland Badeau, Philippe Depalle (Analog Devices Lyric Labs; CNRS, Télécom ParisTech, Université Paris-Saclay; McGill University)
  • "Higher-order acoustic-prosodic entrainment behaviors"
    Min Ma, Rivka Levitan (City University of New York)
  • "Cortical Responses to Natural and ‘Model-Matched’ Sounds Reveal a Computational Hierarchy in Human Auditory Cortex"
    Sam V. Norman-Haignere, Josh H. McDermott (MIT)
  • "CNN-Based Speech Activity Detection over Telephone Conversations"
    Diego Augusto Silva, Luís Gustavo D. Cuozzo, José Augusto Stuchi and Ricardo P. Velloso Violato (CPqD Foundation)
  • "Visual Features for Context-Aware Speech Recognition"
    Abhinav Gupta, Yajie Miao, Leonardo Neves, and Florian Metze (Carnegie Mellon University)
  • "Transcribing Piano Performances into Music Notation"
    Andrea Cogliati, Brendt Wohlberg, Zhiyao Duan (University of Rochester, Los Alamos National Laboratory)
  • "Environmental statistics enable perceptual separation of sound and space"
    James A. Traer and Josh H. McDermott (MIT)
  • "SoundNet: Learning Sound Representations from Unlabeled Video"
    Yusuf Aytar, Carl Vondrick, Antonio Torralba (MIT)
  • "Learning Mid-Level Codes for Natural Sounds"
    Wiktor Mlynarski, Josh H. McDermott (MIT)
  • "An MEG signature of perceived auditory spatial extent"
    Santani Teng, Verena Sommer, Dimitrios Pantazis, Aude Oliva (MIT)
  • "A wearable ultrasonic echolocation-based navigation aid"
    Ian Reynolds, Suma Anand, Temitope Olabinjo, Santani Teng (MIT)
  • "Customizing the Experience for Kids using Speech and Usage"
    Denys Katerenchuk, Craig Murray, Vamsi Potluru (Comcast Research, City University of New York)
  • "Learning a non-intrusive assessor of naturalness-of-speech"
    Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A. Saurous, D. Sculley (Google)
  • "Associating Players to Sound Sources in Musical Performance Videos"
    Bochen Li, Karthik Dinesh, Zhiyao Duan and Gaurav Sharma (University of Rochester)


Registration is free but required. We only have a few slots left, so we encourage those interested in attending this event to register as soon as possible by sending an email to with your name and affiliation.


The workshop will be hosted by MIT's Brain and Cognitive Sciences Department, at the McGovern Institute for Brain Research, MIT Bldg 46, 43 Vassar Street, Cambridge, MA 02139. The entrance is located on Main Street between Vassar and Albany Streets.

Organizing Committee









Understanding music with deep generative models of sound

Jesse Engel


Architectural lessons from applying deep convolutional discriminative models to images have in turn led rapid advances in the generative modeling images. Until recently, despite the dominance of deep networks in speech recognition tasks, similar improvements in acoustic generative models have not been seen. We explore several recent advancements in generating audio from deep network architectures, including autoregressive models (à la WaveNet), predicting collaborative filter embeddings for acoustic hallucination, and learning latent variable embeddings of individual instruments.

Jesse Engel

Jesse Engel is a Research Scientist at Google Brain in Mountain View, CA. Jesse joined Google from Andrew Ng's group at Baidu's Silicon Valley AI Lab, where he was a key contributor to the Deep Speech 2 end-to-end speech recognition system and a primary author of the lab's deep learning framework. Before that, he obtained his PhD at UC Berkeley in Materials Science, and was a postdoc at UC Berkeley and Stanford University in neuromorphic computing.



Towards Multiple Source Identification in Environmental Audio Streams

Juan P. Bello


Automatic sound source identification is a fundamental task in machine listening with a wide range of applications in environmental sound analysis including the monitoring of urban noise and bird migrations. In this talk I will discuss our efforts at addressing this problem, including data collection, annotation and the systematic exploration of a variety of methods for robust classification. I will discuss how simple feature learning approaches such as spherical k-means significantly outperform off-the-self methods based on MFCC, given large codebooks trained with every possible shift of the input representation. I will show how the size of codebooks, and the need for shifting data, can be reduced by using convolutional filters, first by means of the deep scattering spectrum, and then as part of deep convolutional neural networks. As model complexity increases, however, performance is impeded by the scarcity of labeled data, a limitation that we partially overcome with a new framework for audio data augmentation. While promising, these solutions only address simplified versions of the real-world problems we wish to tackle. At the end of the talk, I’ll discuss various steps we’re currently undertaking to close that gap.

Juan P. Bello

Juan Pablo Bello is Associate Professor of Music Technology, and Electrical & Computer Engineering, at New York University, with a courtesy appointment at NYU’s Center for Data Science. In 1998 he received a BEng in Electronics from the Universidad Simón Bolívar in Caracas, Venezuela, and in 2003 he earned a doctorate in Electronic Engineering at Queen Mary, University of London. Juan’s expertise is in digital signal processing, machine listening and music information retrieval, topics that he teaches and in which he has published more than 70 papers and articles in books, journals and conference proceedings. He is the director of the Music and Audio Research Lab (MARL), where he leads research on music and sound informatics. His work has been supported by public and private institutions in Venezuela, the UK, and the US, including a CAREER award from the National Science Foundation and a Fulbright scholar grant for multidisciplinary studies in France.


Reverse engineering the neural mechanisms involved in robust speech processing

Nima Mesgarani

Columbia University

The brain empowers humans with remarkable abilities to navigate their acoustic environment in highly degraded conditions. This seemingly trivial task for normal hearing listeners is extremely challenging for individuals with auditory pathway disorders, and has proven very difficult to model and implement algorithmically in machines. In this talk, I will present the result of an interdisciplinary research effort where invasive and non-invasive neural recordings from human auditory cortex and reverse-engineering methodologies are used to determine the representational and computational properties of speech processing in the human auditory cortex. These findings lead to new biologically informed models incorporating the functional properties of neural mechanisms with potential to decrease the performance gap between biological and artificial computing. A better understanding of the neural mechanisms involved in speech processing can greatly impact the current models of speech perception and lead to human-like automatic speech processing technologies.

Nima Mesgarani

Nima Mesgarani is an assistant professor of Electrical Engineering at Columbia University. He received his Ph.D. from University of Maryland where he worked on neuromorphic speech technologies and neurophysiology of mammalian auditory cortex. He was a postdoctoral scholar in Center for Language and Speech Processing at Johns Hopkins University, and the neurosurgery department of University of California San Francisco before joining Columbia in fall 2013. He was named a Pew Scholar for Innovative Biomedical Research in 2015, and received the National Science Foundation Early Career Award in 2016.



Pushing the envelope at both ends — beamforming acoustic models and joint CTC/Attention schemes for end-to-end ASR

Shinji Watanabe


Hand-designed components of conventional ASR systems have, one by one, been superseded by deep learning alternatives. These newer methods simplify the pipeline and allow data-driven optimization of the entire system. They also eliminate the mis-specification of hand-designed models based on assumptions about auditory perception and linguistics. This work investigates steps towards assimilating the remaining components, which occur at both the front and back ends. In the front end, we consider a beamforming acoustic model that replaces microphone array signal processing and acoustic modeling with a single network. The network mimics beamforming, feature extraction, and acoustic model in each layer, and is jointly optimized under the ASR objective function. In the back end, we consider a multi-task learning scheme that combines the advantages of both connectionist temporal clustering (CTC) and attention mechanism based sequence-to-sequence recurrent neural networks. We use a common encoder BLSTM network followed by two different decoder networks, one based on CTC and the other on attention mechanisms. We show that both front-end and back-end approaches greatly improve performance, paving the way for combining them into a truly end-to-end multichannel ASR system.

Shinji Watanabe

Shinji Watanabe is a Senior Principal Research Scientist in the Speech and Audio Team at Mitsubishi Electric Research Labs (MERL), in Cambridge, MA. Prior to joining MERL in 2012, Shinji was a research scientist at NTT Communication Science Laboratories in Japan for 10 years, working on Bayesian learning for speech recognition, speaker adaptation, and language modeling. His research interests include speech recognition, spoken language processing, and machine learning.



Sound Event Recognition at Google

Dan Ellis


We are investigating the recognition of "environmental sounds" ranging from laughter to saxophone to doorbell. To be able to apply the latest ideas from image recognition with deep neural networks, we need large amounts of labeled training data. We have investigated using video-level tags, which are both temporally imprecise (whole-soundtrack) and not specifically audio-related, yet, in sufficient quantities, can yield useful classifiers. We are also attempting to define a comprehensive audio-specific vocabulary of sound events which can be used to collect more temporally-precise human annotations.

Dan Ellis

Dan Ellis joined Google as a Research Scientist in 2015 after 15 years leading LabROSA at Columbia University. His current research interests include environmental sound recognition, sound ontologies, and music audio. He is particularly committed to accurate and reproducible research through the use of common datasets and high-quality software development.




Statistics of natural reverberation enable perceptual separation of sound and space

Josh McDermott


In everyday listening, sound reaches our ears directly from a source as well as indirectly via reverberation. Reverberation profoundly distorts the sound from a source, yet humans can both identify sound sources and distinguish environments from the resulting sound, via mechanisms that remain unclear. The core computational challenge is that the acoustic signatures of the source and environment are combined in a single signal received by the ear. We have explored whether our recognition of sound sources and spaces reflects an ability to separate their effects, and whether any such separation is enabled by statistical regularities of real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of 271 spaces sampled from the distribution encountered by humans during daily life. The sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates: mid frequencies reverberated longest while higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans leverage these regularities, we manipulated IR decay characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us.

Josh McDermott

Josh McDermott is a perceptual scientist studying sound and hearing in the Department of Brain and Cognitive Sciences at MIT, where he is an Assistant Professor and heads the Laboratory for Computational Audition. His research addresses human and machine audition using tools from experimental psychology, engineering, and neuroscience. McDermott obtained a BA in Brain and Cognitive Science from Harvard, an MPhil in Computational Neuroscience from University College London, a PhD in Brain and Cognitive Science from MIT, and postdoctoral training in psychoacoustics at the University of Minnesota and in computational neuroscience at NYU. He is the recipient of a James S. McDonnell Foundation Scholar Award and an NSF CAREER Award.


Visually Indicated Sounds

William T. Freeman

Massachusetts Institute of Technology and Google

Children may learn about the world by pushing, banging, and manipulating things, watching and listening as materials make their distinctive sounds-- dirt makes a thud; ceramic makes a clink. These sounds reveal physical properties of the objects, as well as the force and motion of the physical interaction. We've explored a toy version of that learning-through-interaction by recording audio and video while we hit many things with a drumstick.
We developed an algorithm that predicts sounds from silent videos of the drumstick interactions. The algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We demonstrate that the sounds generated by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that the task of predicting sounds allows our system to learn about material properties in the scene.
Joint work with: Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson

William T. Freeman

William T. Freeman is the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science at MIT, and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) there. He was the Associate Department Head from 2011 - 2014. His current research interests include machine learning applied to computer vision, Bayesian models of visual perception, and computational photography. He received outstanding paper awards at computer vision or machine learning conferences in 1997, 2006, 2009 and 2012, and test-of-time awards for papers from 1990 and 1995. Previous research topics include steerable filters and pyramids, orientation histograms, the generic viewpoint assumption, color constancy, computer vision for computer games, and belief propagation in networks with loops. He is active in the program or organizing committees of computer vision, graphics, and machine learning conferences. He was the program co-chair for ICCV 2005, and for CVPR 2013.