SANE 2015 - Speech and Audio in the Northeast

October 22, 2015

New York City from Google NY Offices

The workshop is now over. Videos and Slides for the talks are available through the links in the schedule below. There is also a YouTube Playlist for all recorded talks.

SANE 2015, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, was held on Thursday October 22, 2015 at Google, in New York City, NY. It broke the attendance record for a SANE event, with 128 participants.

It was a follow-up to SANE 2012, held at Mitsubishi Electric Research Labs (MERL), SANE 2013, held at Columbia University, and SANE 2014, held at MIT, which each gathered 70 to 90 researchers and students.

As in 2013, this year's SANE took place in conjunction with the WASPAA workshop, held October 18-21 in upstate New York. Many WASPAA attendees (47!) also attended SANE.

SANE 2015 featured invited talks by leading researchers from the Northeast, as well as from the international community. It also featured a lively poster session during lunch time, open to both students and researchers, with 20 posters.

Schedule -- Thursday, October 22   [Watch all talks on YouTube]

Click on the talk title to jump to the abstract and bio, and on Poster Session for the list of posters.

8:30-9:10Registration and Breakfast
9:10-9:15Welcome [YouTube] [Slides]
9:15-10:00 Ron Weiss (Google) [YouTube] [Slides]
"Training neural network acoustic models on (multichannel) waveforms"
10:00-10:45 Tuomas Virtanen (Tampere University of Technology) [YouTube] [Slides]
"Sound event detection in realistic environments using multilabel deep neural networks"
10:45-11:15Coffee Break
11:15-12:00 John Hershey (MERL) [YouTube] [Slides]
"Deep clustering: discriminative embeddings for single-channel separation of multiple sources"
12:00-12:45 Pablo Sprechmann (NYU) [YouTube] [Slides]
"Deep learning for solving inverse problems"
12:45-1:00Poster Setup
1:00-1:45Lunch (Cafeteria)
1:45-3:30Poster Session
3:30-4:15 Rohit Prasad (Amazon)
"Spoken Language Understanding for Amazon Echo"
4:15-5:00 Michael Mandel (Brooklyn College, CUNY) [YouTube] [Slides]
"The 2015 Jelinek Workshop on Speech and Language Technology"
5:00-5:45 Paris Smaragdis (UIUC) [YouTube] [Slides]
"NMF? Neural Nets? It’s all the same..."
5:45-6:00Closing remarks
6:00-...Drinks somewhere nearby


We have reached capacity, but you can ask to be put on the waiting list by sending an email to with your name and affiliation. SANE is a free event.


The workshop will be hosted at Google, in New York City, NY. Google NY is located at 111 8th Ave, and the closest subway stop is the A, C, and E lines' 14 St station. The entrance is the one to the RIGHT of the apparel shop and the Google logo on this Street View shot.

Organizing Committee



Google MERL






Training neural network acoustic models on (multichannel) waveforms

Dr. Ron Weiss


Instead of starting with mel spectral or similar features, several recent speech recognition results have demonstrated the possibility of training neural network acoustic models directly on the time-domain waveform. Through supervised training, such networks are able to learn a suitable auditory filterbank-like feature representation simultaneously with a discriminative classifier, thereby eliminating the need for hand crafted feature extraction.
In this talk I discuss recent results training such systems at Google. We have found that integrating long short-term memory layers into the network architecture leads to a 3% relative reduction in word error rate (WER) on noisy data compared to an analogous system trained on mel features. Furthermore, similar waveform acoustic models trained on multichannel waveforms can learn to do spatial filtering and be robust to varying direction of arrival of the target speech signal. Training such a network on inputs captured using multiple microphone array configurations results in a system that is robust to a range of microphone spacings, leading to a relative decrease of 11% WER compared to a single channel system on data with mismatched spacing.

Ron Weiss

Ron Weiss is a software engineer at Google where he has worked on content-based audio analysis, recommender systems for music, and noise robust speech recognition. Ron completed his Ph.D. in electrical engineering from Columbia University in 2009 where he worked in the Laboratory for the Recognition of Speech and Audio. From 2009 to 2010 he was a postdoctoral researcher in the Music and Audio Research Laboratory at New York University.



Sound event detection in realistic environments using multilabel deep neural networks

Prof. Tuomas Virtanen

Tampere University of Technology

Auditory scenes in our everyday environments such as office, car, street, grocery store, and home consist of a large variety of sound events such as phone ringing, car passing by, footsteps, etc. Computational analysis of sound events has plenty of applications e.g. in context-aware devices, acoustic monitoring, analysis of audio databases, and assistive technologies. This talk describes methods for automatic detection and classification of sound events, which means estimating the start and end times of each event, and assigning them a class label. We focus on realistic everyday environments, where multiple sound sources are often present simultaneously, and therefore polyphonic detection methods need to be used. We present a multilabel deep neural network based system that can be used to directly recognize temporally overlapping sounds. Event detection results on highly realistic acoustic material will be presented, and audio and video demonstrations will be given. We will also introduce the upcoming IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events.

Tuomas Virtanen

Tuomas Virtanen is an Academy Research Fellow and Associate Professor (tenure track) at Department of Signal Processing, Tampere University of Technology (TUT), Finland, where he is leading the Audio Research Group. He received the M.Sc. and Doctor of Science degrees in information technology from TUT in 2001 and 2006, respectively. He has also been working as a research associate at Cambridge University Engineering Department, UK. He is known for his pioneering work on single-channel sound source separation using non-negative matrix factorization based techniques, and their application to noise-robust speech recognition, music content analysis and audio event detection. In addition to the above topics, his research interests include content analysis of audio signals in general and machine learning. He has authored more than 100 scientific publications on the above topics, which have been cited more than 3000 times. He has received the IEEE Signal Processing Society 2012 best paper award for his article "Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria" as well as three other best paper awards. He is an IEEE Senior Member and recipient of the ERC 2014 Starting Grant.


Deep clustering: discriminative embeddings for single-channel separation of multiple sources

Dr. John Hershey


We address the problem of acoustic source separation in a deep learning framework we call "deep clustering". Previous deep network approaches to source separation have shown promising performance in scenarios where each source belongs to a distinct class of signal, such as a mixture of one speaker and a specific type of noise. However, such "class-based" approaches may not be suitable for "cocktail party" scenarios in which a) the individual sources do not represent distinct classes, and/or b) there are unknown numbers of sources. To approach such cases, we use a deep network to translate the input mixture into a set of contrastive embedding vectors, one for each time-frequency bin of the spectrogram. These embeddings implicitly define a segmentation of the spectrogram that can have any number of sources, and can be trained on arbitrary mixtures, using only segmentation labels. This yields a system closer in spirit to spectral clustering, while avoiding its practical limitations with regard to both learning and computational complexity. The complexity of spectral approaches is avoided by directly training the embeddings to implicitly form a low-rank approximation to an ideal pairwise affinity matrix. Seen another way, this deep clustering objective is equivalent to a k-means objective function, where the embeddings are the data points. The difference from k-means is that the cluster assignments are given by the segmentation, and the embeddings are trained to form compact clusters for each segment. At test time, the segmentation inferred from the input mixtures can be "decoded" in a clustering step, using the same deep clustering objective, this time optimized with respect to the unknown assignments. Preliminary experiments were conducted on single-channel mixtures of speech from multiple speakers. Results show that a speaker-independent model trained on mixtures of two speakers can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures despite training only on two-speaker mixtures. Although results are preliminary, we feel the framework has the potential to apply to much more general separation problems, and that applications to different domains, such as microphone array processing or image segmentation, may also be fruitful.

John Hershey

Prior to joining MERL in 2010, John spent 5 years at IBM's T.J. Watson Research Center in New York, where he led a team in Noise Robust Speech Recognition. He also spent a year as a visiting researcher in the speech group at Microsoft Research, after obtaining his Ph D from UCSD. He is currently working on machine learning for signal separation, speech recognition, language processing, and adaptive user interfaces.



Deep learning for solving inverse problems

Dr. Pablo Sprechmann


Traditionally, approaches to solve inverse problems in image and signal processing construct models with appropriate priors to regularize the signal estimation. Motivated by the advance of deep learning techniques in classification and recognition tasks, in recent years there has been a shift from this traditional setting towards approaching inverse problems as data-driven non-linear regressions. Given a set of pairs of observed and target signals, these approaches aim at learning a parametric mapping given by a generic neural network architecture with enough capacity to perform the regression. In the first part of the talk I will review this transition in the context of single channel source separation and enhancement. A natural connection between model-based approaches (based on non-negative matrix factorization) and regression-based approaches using deep neural networks will be presented. Some reasons that might explain why this method is particularly successful in this setting will be exposed and, based on this observations, an approach using multi-scale convolutional networks will be discussed. Despite their great success, these techniques face some limitations when the inverse problem at hand becomes more ill-conditioned. For instance, these limitations emerge when the goal is not just to the enhance the observed signal, but to produce a high fidelity version of it, which oftentimes involves multi-modality and uncertainty. The second part of the talk provides an analysis of these limitations and presents a possible alternative to mitigate them by establishing connections with recent work in computer vision.

Pablo Sprechmann

Pablo Sprechmann is currently a postdoctoral researcher in Yann LeCun's group at CILVR lab, Courant Institute for Mathematical Sciences, New York University. He received an MSc degree from the Universidad de la República, Uruguay, in 2009, and a PhD degree in 2012 from the Department of Electrical and Computer Engineering, University of Minnesota. He worked as postdoctoral researcher at the ECE Department, Duke University during 2013. His main research interests include the areas of signal processing, machine learning, and their application to computer vision, as well as audio processing and music information retrieval.


Spoken Language Understanding for Amazon Echo

Dr. Rohit Prasad


Amazon Echo is an eyes- and hands-free device designed around voice. Echo connects to Alexa, a cloud-based voice service, to provide information, answer questions, play music, read the news, check sports scores or the weather, home automation, and more. This talk will focus on overview of how Echo works, key challenges in areas spanning digital signal processing, wake word detection, far-field speech recognition, natural language understanding, and algorithmic solutions we have explored for overcoming these challenges.

Rohit Prasad

Rohit Prasad is a Director of Machine Learning in Amazon. He leads a team of scientists, software engineers, and data specialists in advancing speech recognition, natural language understanding, and computer vision technologies to enhance customer interactions with Amazon’s products and services such as Echo, Fire TV, Dash. Prior to joining Amazon, Rohit was the Deputy Manager and a Sr. Director for Speech, Language, and Multimedia Business Unit at Raytheon BBN Technologies. In that role, he was directing several US Government sponsored R&D efforts and driving technology transition and business development in areas of speech-to-speech translation, psychological health analytics, text classification, document image translation (OCR and handwriting recognition), and STEM learning. Rohit is a named author on over 100 scientific articles and holds several patents.


The 2015 Jelinek Workshop on Speech and Language Technology

Prof. Michael Mandel

Brooklyn College (CUNY)

This past summer, over 50 notable researchers and students spent six weeks at the University of Washington participating in the 20th annual Jelinek Workshop on Speech and Language Technology. They worked in teams to intensively and collaboratively explore four different topics: Far-Field Speech Enhancement and Recognition in Mismatched Settings, Continuous Wide-Band Machine Translation, Probabilistic Transcription of Languages with No Native-Language Transcribers, and Structured Computational Network Architectures for Robust ASR. Previous instances of this workshop have led to such developments in the field as the open source Kaldi and Moses frameworks for automatic speech recognition and machine translation, respectively, the i-vector for speaker identification, and the introduction of weighted finite state transducers and dynamic Bayesian networks into speech recognition. This talk will describe the four projects undertaken in the 2015 workshop and their initial results, with particular focus on the work on far-field speech recognition. Within that project, I will describe a speech enhancement system based on multichannel spatial clustering that reduced word error rates in subsequent automatic speech recognition by 16% (relative) on the CHiME3 task.

Michael Mandel

Michael I Mandel is an Assistant Professor of Computer and Information Science at Brooklyn College (CUNY) working at the intersection of machine learning, signal processing, and psychoacoustics. He earned his BSc in Computer Science from the Massachusetts Institute of Technology in 2004 and his MS and PhD with distinction in Electrical Engineering from Columbia University in 2006 and 2010 as a Fu Foundation School of Engineering and Applied Sciences Presidential Scholar. From 2009 to 2010 he was an FQRNT Postdoctoral Research Fellow in the Machine Learning laboratory at the Université de Montréal. From 2010 to 2012 he was an Algorithm Developer at Audience Inc, a company that has shipped over 350 million noise suppression chips for cell phones. From 2012 to 2015 he was a Research Scientist in Computer Science and Engineering at the Ohio State University, where his work was funded by the National Science Foundation and Google.


NMF? Neural Nets? It’s all the same...

Prof. Paris Smaragdis


In this talk I’ll present some new ways to combine concepts from NMF and deep learning-based approaches to source separation. In the process I’ll show the development of a model that operates directly on waveform data in order to perform source separation, and how such an architecture naturally leads to extensions that can potentially result in more powerful approaches.

Paris Smaragdis

Paris is an assistant professor at the Computer Science and the Electrical and Computer Engineering departments of the University of Illinois at Urbana-Champaign, as well as a senior research scientist at Adobe Research. He completed his masters, PhD, and postdoctoral studies at MIT, performing research on computational audition. In 2006 he was selected by MIT’s Technology Review as one of the year’s top young technology innovators for his work on machine listening, and in 2015 he was elevated to an IEEE Fellow for contributions in audio source separation and audio processing.


Poster Session

  • "From raw audio to songbird communication networks: multi-bird machine listening"
    Dan Stowell (Queen Mary University of London)
    • Bird audio data is available in large volumes, giving us the opportunity to make new discoveries about bird sounds - if we can develop fully automatic audio analysis. We can classify bird species automatically - but what about detailed structure of songbird communication networks? We introduce methods that characterise multi-party interaction patterns, using a social network of zebra finches as a case study.
  • "Trans-dimensional Random Fields for Sequential Modeling"
    Zhijian Ou (Tsinghua University)
    • This paper presents the potential of applying random fields for sequential modeling, demonstrated by its success in language modeling. Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random field (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential benefits, the empirical results of previous WSME models are not satisfactory. In this paper, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF (TDRF) model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and find that our TDRF models lead to performances as good as the recurrent neural network LMs but are computationally more efficient in computing sentence probability (200x faster).
  • "Estimating Structurally Significant Notes in Music Audio through Music Theoretic Domain Knowledge"
    Johanna C. Devaney (The Ohio State University)
    • Chroma representations of audio have proved useful for the analysis and classification of audio signals, but they do not distinguish between structurally important and unimportant notes. This paper describes the application of music theoretic domain knowledge in the development of a chroma-based predictor of musical audio that identifies the harmonically significant parts of the signal. This is achieved by learning a mapping between standard chroma features and a binary pitch class vector identifying the harmonically significant notes. The goal of this mapping is to facilitate more robust music similarity and indexing. In order to demonstrate this, this paper describes a music similarity experiment with a new open-source dataset of annotated theme and variations piano pieces.
  • "Transient Restoration applied to Informed Source Separation"
    Christian Dittmar, Meinard Mueller (International Audio Laboratories Erlangen)
    • We present a method to improve the restoration of transients signal portions in source separation applications. Specifically, we aim to attenuate the so-called “pre-echos” that can degrade the perceptual quality of audio decompositions. Our method is based on a modification of the iterative LSEE-MSTFTM phase estimation algorithm originally proposed by Griffin and Lim in 1984. An additional step is added in the LSEE-MSTFTM updates that enforces desired behavior in signal portion preceding transient sound events. The method requires side information on the audio signal, such as the precise location of transient events in time. We apply the method to the separation of solo drum recordings into its constituent drum sounds via score-informed Non-Negative Matrix Factor Deconvolution (NMFD). We provide audio examples for the creative re-use of the separated and restored drum sound events.
  • "Multiple Sound Source Location Estimation and Counting in a Wireless Acoustic Sensor Network"
    Anastasios Alexandridis and Athanasios Mouchtaris (Institute of Computer Science - Foundation for Research and Technology - Hellas (ICS-FORTH))
    • We consider the problem of counting the number of active sources and their locations in a wireless acoustic sensor network, where each sensor consists of a microphone array. Our method is based on inferring a location estimate for each frequency of the captured signals. A clustering approach, similar to k-means but with the number of clusters being also an unknown, is then employed. The estimated number of clusters corresponds to the number of active sources, while their centroids correspond to the sources' locations. The efficiency of our proposed method is evaluated through simulations and real recordings in scenarios with up to three simultaneous sound sources for different signal-to-noise ratios and reverberation times.
  • "Horizontal-Acoustic Awareness to Moving Broadband Stimuli: A Compact Assessment Method"
    Dietmar M. Wohlbauer, Albert Rizzo, Stefan Scherer (University Hospital and University of Zürich, University of Southern California)
    • Reproduction and assessment of directional sound is still a highly discussed issue and lacks accessible and flexible low complexity methods. A considerable demand for cost efficient solutions for the research community as well as for the medical sector further motivates the present work. Hence, we introduce our approach for the generation and assessment of spatially realistic horizontal-acoustic broadband sound using interaural time- and level differences manipulated by reverb, low-pass, and hearing level. In order to assess the performance of the proposed method, a survey is executed in which binaural data of 19 NH subjects and 12 audio patterns is analyzed. The study reveals that tested subjects are capable to distinguish between horizontal-acoustic directions through the interaural axis by using this low complex binaural parameter based algorithm for real life broadband stimuli.
  • "Perceptual Linear Prediction Incorporating Hearing Loss Model"
    Haniyeh Salehi, Vijay Parsa (University of Western Ontario)
    • Objective measures of speech quality are highly desirable in the benchmarking and monitoring the performance of hearing aids (HAs). Several speech quality assessment methods for normal hearing (NH) and hearing impaired (HI) applications have been developed, with methods incorporating auditory perception models exhibiting better predictive performance. Perceptual Linear Prediction (PLP) is one such model which incorporates human perceptual phenomena such as non-uniform filterbank analysis and nonlinear mapping between sound intensity and its perceived loudness, into the linear prediction (LP) feature extraction process. For an LP model to be appropriately used in evaluating the speech signal perceived by a HI listener, it needs to account for the listener’s hearing loss effects on speech perception. In this work, a non-intrusive speech quality metric is proposed by modifying PLP to include HI auditory system effects (HIPLP). In HIPLP, changes to the filterbank analysis and intensity-loudness mapping are applied based on the hearing loss parameters, following the procedures described in Hearing Aid Speech Quality Index (HASQI) (Kates & Arehart, 2010). The HIPLP coefficients and their statistics are subsequently used to derive the non-intrusive speech quality metric. HIPLP approach is applied on a custom database of HA speech quality ratings acquired for recordings in different noisy and reverberant environments. Results show that HIPLP performs well in HA speech quality prediction.
  • "Model compression applied to small-footprint keyword spotting"
    Minhua Wu, George Tucker, Ming Sun, Gengshen Fu, Sankaran Panchapagesan, Shiv Vitaladevuni (Amazon)
    • There are several speech products that perform on-device keyword spotting to initiate user interaction. Accurate on-device keyword spotting within a tight CPU budget is crucial for such devices. Motivated by this, we investigated two ways to improve DNN acoustic models used for keyword spotting without increasing CPU usage. First, we used low-rank weight matrices throughout the DNN. This allowed us to increase the number of hidden nodes per layer without changing the total number of multiplications. Second, we used knowledge distilled from a much larger DNN used only during training. Combined, these techniques provide a significant reduction in false alarms and misses without changing CPU usage.
  • "Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Deep Neural Network"
    Andrew J.R. Simpson, Gerard Roma, Mark D. Plumbley (University of Surrey)
    • Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.
  • "Bitwise Neural Networks for Source Separation"
    Minje Kim and Paris Smaragdis (UIUC)
    • Bitwise Neural Networks (BNN) is a novel efficient neural network where all the input, hidden, and output nodes are all binaries (+1 and -1), and so are all the weights and bias. It is apparent that this kind of network is spatially and computationally efficient in implementations since (a) we represent a real-valued sample or parameter with a bit (b) the multiplication and addition between the real-values correspond to bitwise XNOR and bit-counting, respectively. Therefore, a BNN can be used to implement a deep learning system on resource-constrained devices, so that we can deploy a deep learning system on small devices without worrying too much about using up the power, memory, CPU clocks, etc. In this work we are based on a straightforward extension of backpropagation to learn those bitwise network parameters. As an application, we show that a denoising autoencoder can be constructed to perform source separation. To this end, we learn such a bitwise denoising autoencoder that takes a discretized noisy spectrum as the input and produces a discretized version of a denoised spectrum. We believe that this kind of approach can be used to enhance the applicability of a deep learning system in general by reducing the complexity of the system.
  • "Discriminating Sound Masses via Spectral Clustering of Sinusoidal Component Parameters"
    David Heise (Lincoln University)
    • This work explores the potential to discriminate sound masses via spectral clustering of sinusoidal parameters derived from an audio mixture. A sound mass is defined here as a portion of an audio mixture that may be perceived as a unit; examples could include a single note of a piano, a phoneme of speech, or a "roar" of audience applause. In this context, a sound mass may be somewhat more general than the notion of a sound event, which might imply detection of the individual sonic events comprising the perceived sound (e.g., individuals claps from members of an audience). Our approach extends methods proposed by Martins [see Machine Audition, 2011, pp. 22-60]. We employ high-resolution analysis techniques to extract sinusoidal parameters from the mixture to generate a complete graph of time-frequency nodes, providing an intermediate representation of the original audio. Edge weights are determined by a measure combining amplitude, frequency, and harmonic similarity (as proposed by Martins), potentially with other Gestalt cues (such as those proposed by Bregman and research on perception of sound masses). Spectral clustering is then accomplished by utilizing normalized cuts of the graph, where the resulting sub-graphs (of a minimum size) constitute sound masses of interest.
  • "Room Impulse Response of a Directional Loudspeaker"
    Prasanga N. Samarasinghe, Thushara D. Abhayapala, Mark Poletti, and Terence Betlehem (Australian National University)
    • In this presentation, we wish to introduce a wave domain (spherical harmonics based) model for the Room Impulse Response (RIR) of a directional loudspeaker with higher order directivity. Traditionally, the RIR is defined as the room response to an impulse generated by an ideal point source (omnidirectional) observed at a particular listening point of interest. Even though the knowledge of RIR is a powerful tool in audio signal processing, its successful measurement is often hindered by the inability to realize perfect omnidirectional sources. Commercial loudspeakers are inherently directional and often display frequency-dependent dircetivity patterns. Here, we introduce a wave domain representation of the RIR for a directional loudspeaker, and discuss how it can be utilized to derive the point-point RIR from the room response measurements obtained through a directional loudspeaker.
  • "CHiME-Home: A Dataset for Sound Source Recognition in a Domestic Environment"
    Peter Foster, Siddharth Sigtia, Sacha Krstulovic, Jon Barker, Mark D. Plumbley (University of Surrey)
    • For the task of sound source recognition, we introduce a novel data set based on 6.8 hours of domestic environment audio recordings. We describe our approach of obtaining annotations for the recordings. Further, we quantify agreement between obtained annotations. Finally, we report baseline results for sound source recognition using the obtained dataset. Our annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment. With the aid of 3 human annotators, we obtain 3 sets of multi-label annotations, for 4378 4-second audio excerpts. We evaluate agreement between annotators by computing Jaccard indices between sets of label assignments. Observing varying levels of agreement across labels, with a view to obtaining a representation of `ground truth' in annotations, we refine our dataset to obtain a set of multi-label annotations for 1946 audio excerpts. For the set of 1946 annotated audio excerpts, we predict binary label assignments using Gaussian mixture models estimated on MFCCs. Evaluated using the area under receiver operating characteristic curves, across considered labels we observe performance scores in the range 0.76 to 0.98.
  • "Using Phonemic Inventory Features to Help Improve Dialect Recognition for PRLM Systems"
    Michelle Renee Morales and Andrew Rosenberg (CUNY Queens College)
    • Automatic language recognition is the process of using a computational model to determine the language of an utterance. After many formal evaluations, research suggests that one of the most successful approaches to this task involves using phonotactic content of the speech signal. In these approaches, phone recognizers are used to tokenize speech into phone sequences, which are then modeled using statistical language models, i.e. phone recognition language modeling (PRLM). This work explores how to exploit the output of phone recognizers to help boost the performance of PRLM systems. The phonemic inventory of a language is its most prominent phonological linguistic feature. Even if we consider geographically close languages, language families, or dialects of the same language we find great variation across phonemic inventories. For example, Mandarin Chinese has a relatively average size consonant inventory (22 +/- 3). However, the dialect Wu, spoken in southeast China differs from Mandarin Chinese in preserving initial voiced stops (sounds formed with complete closure in the vocal tract), leading it to have a larger consonant inventory. Using telephone and broadcast speech our system first performs phone recognition. Phone hypotheses are then used to derive features that are meant to represent the phonemic inventory of the language. Derived features include, consonant inventory size, vowel inventory size, consonant vowel ratio, consonant durations, vowel durations, functionals over durations, and an average confidence score. These features alone are then used to train a model to perform language recognition. When distinguishing between 5 dialects of Arabic we are able to classify dialect correctly 80.8% of time, between 4 dialects of Chinese we are able to classify dialect correctly 84.2% of the time, and finally between 3 dialects of English we find we are able to classify dialect correctly 95.5% of the time. Overall, when testing how well these features perform when used to predict languages within 6 different language family clusters, on average we are able to correctly recognize the target language 87.4% of the time. Exploration of our features suggests that the vowel inventory and vowel duration of a language are its most distinguishing phonological features. We find these initial results promising and suggest that the inclusion of these features could be valuable to a PRLM system. We are currently extending this work to discover the validity of this claim.
  • "Framework for Learning Audio Representations Using Deep and Temporal Complex-Valued Networks"
    Andy Sarroff, Colin Raffel, Michael Casey, Dan Ellis (Dartmouth College, Columbia University)
    • Many audio machine learning tasks have been proposed using deep or temporal neural networks. The front-end data representation to such models is often the magnitude Fourier spectrum. The phase component of the signal is usually discarded (in the case of classification tasks), or stored for later use (in the case of regression and synthesis tasks). However the phase component carries important information about the signal. This work presents a framework for building deep and temporal complex-valued networks that contain compositions of holomorphic and non-holomorphic functions with fully complex, split phase-magnitude, and split real-imaginary activation functions. The gradients of a real-valued cost function is back-propagated through the network to the weights by using the mathematical conveniences of Wirtinger calculus.
  • "Acoustic Monitoring of Baleen Whale Vocalization using Hydrophone Array"
    Yu Shiu (Cornell University)
    • Baleen whales, including humpback whales, bowhead whales and endangered North Atlantic right whales, are highly elusive. It is difficult for animal biologist to visually observe their behavior in the underwater marine habitat, where the light degrades drastically over distance. Baleen whales communicate acoustically with one another and the sound recordings made by hydrophones offer a window to understand their communication behavior, seasonal habitat and migration. In our poster, we will demonstrate the method and results of whale call detection and localization from the recorded audio signals. We will also show how audio signal processing work can help the ecological and behavioral biological study of baleen whales in general and how the automatic algorithm can change the status-quo methods in this field and help dealing with large-scale of audio data.
  • "Method of Moments Learning for Left-to-Right Hidden Markov Model"
    Y.Cem Subakan, Johannes Traa, Paris Smaragdis, Daniel Hsu (UIUC)
    • We propose a method-of-moments algorithm for parameter learning in Left-to-Right Hidden Markov Models. Compared to the conventional Expectation Maximization approach, the proposed algorithmis computationally more efficient, and hence more appropriate for large datasets. It is also asymptotically guaranteed to estimate the correct parameters. We show the validity of our approach with a synthetic data experiment and a word utterance onset detection experiment.
  • "On the Formation of Phoneme Categories in DNN Acoustic Models"
    Tasha Nagamine and Nima Mesgarani (Columbia University)
    • Deep neural networks (DNNs) have become the dominant technique for acoustic modeling due to their markedly improved performance over other models. Despite this, little is understood about the computation they implement in creating phonemic categories from highly variable acoustic signals. We analyzed a DNN trained for phoneme recognition and characterized its representational properties, both at the single node and population level in each layer. We found strong selectivity to distinct phonetic features in all layers, with selectivity to these features becoming more explicit in deeper layers. We also show that the representation of speech in the population of nodes in subsequent hidden layers becomes increasingly nonlinear and warps the feature space non-uniformly to aid in the discrimination of acoustically similar phones. Finally, we found that individual nodes with similar phonetic feature selectivity were differentially activated to different exemplars of these features. Thus, each node becomes tuned to a particular acoustic manifestation of the same feature, providing an effective representational basis for the formation of invariant phonemic categories.
  • "Piano Music Transcription With Fast Convolutional Sparse Coding"
    Andrea Cogliati (University of Rochester)
    • Automatic music transcription (AMT) is the process of converting an acoustic musical signal into a symbolic musical representation, such as a MIDI file, which contains the pitches, the onsets and offsets of the notes and, possibly, their dynamics and sources (i.e., instruments). Most existing algorithms for AMT operate in the frequency domain, which introduces the well known time/frequency resolution trade-off of the Short Time Fourier Transform and its variants. We propose a time-domain transcription algorithm based on an efficient convolutional sparse coding algorithm in an instrument-specific scenario, i.e., the dictionary is trained and tested on the same piano. The proposed method outperforms current state-of-the-art AMT methods trained in the same context, and drastically increases both time and frequency resolutions, especially for the lowest octaves of the piano keyboard.
  • "Compact Kernel Models for Acoustic Modeling via Random Feature Selection"
    Avner May, Michael Collins, Daniel Hsu, Brian Kingsbury (Columbia University)
    • A simple but effective method is proposed for learning compact random feature models that approximate non-linear kernel methods, in the context of acoustic modeling. The method is able to explore a large number of non-linear features while maintaining a compact model via feature selection more efficiently than existing approaches. For certain kernels, this random feature selection may be regarded as a means of non-linear feature selection at the level of the raw input features, which motivates additional methods for computational improvements. An empirical evaluation demonstrates the effectiveness of the proposed method relative to the natural baseline method for kernel approximation. In addition, we compare the performance of these compact kernel models with deep neural networks (DNNs). Interestingly, we see that our compact kernel models are able to match DNN performance in terms of the cross-entropy on the test set, but they do worse in terms of test Word Error Rate (WER). This leaves open an important question: how is it that these kernel methods match the DNNs on cross-entropy, but not on WER?