Research Interest
Generic categories: My research is at the nexus of the science and engineering of human communication and information processing. Specific topics include:
(This was created by only the 1000 most frequently used words in the English language)
- Healthcare AI: Medical ASR, NLP, Summarization, and Search
- Robust speech processing, Automatic Speech Recognition (ASR), Biometrics
- Modeling and measurement of speech production with applications to recognition and clinical assessment of speech
- Computational paralinguistics: Modeling, detection and tracking of emotion, gender, health, etc.
- Machine learning for speech processing: Winners of Interspeech sub-challenges
- Multimodal signal processing, image processing and imaging applications
(This was created by only the 1000 most frequently used words in the English language)
PhD Research Topics
Computational Paralinguistics for speech signal
|
Articulatory Variability, Emotion and Linguistic CriticalityThis study explores one aspect of the articulatory mechanism that underlies emotional speech production, namely the behavior of linguistically critical and non-critical articulators in the encoding of emotional information. The hypothesis is that the possible larger kinematic variability in the behavior of non-critical articulators enables revealing underlying emotional expression goal more explicitly than that of the critical articulators; the critical articulators are strictly controlled in service of achieving linguistic goals, and exhibit smaller kinematic variability. We found that, overall, between- and within-emotion variability in articulatory positions is larger for non-critical cases than for critical cases. Simulation experiments suggest that the postural variation of non-critical articulators depending on emotion is significantly associated with the controls of critical articulators.
|
Computational Modeling of Emotional Speech Production(ISCA grant, 2012)
Despite the large variability of articulatory movements at the execution level, the Converter/Distributor (C/D) model provides a systematic and comprehensive framework for the prosodic organization of speech production, based on the invariant properties of articulatory movements with the concept of “iceberg” region. This study examines the invariant properties in the C/D model in emotional speech to understand emotion-dependent variation patterns of important parameters in the C/D model framework. This study also explores the emotion-dependent relationships between the abstract-level C/D model parameters and the surface-level parameters of the invariant articulatory behaviors. The ultimate goal is to develop a computational model of emotional speech production from speech planning to execution, articulatory movements and speech acoustic consequences (speech waveform). |
Vocal Tract Shaping of Emotional EpeechThis study analyzes midsagittal vocal tract shaping depending emotion. The vocal tract parameters (midsagittal distances and the vocal tract length) were computed using an image segmentation software that I developed. The principal feature analysis technique is applied to the grid-line system in order to find the major movement locations. Results reveal both speaker-dependent and speaker-independent variation patterns. For example, sad speech, a low arousal emotion, tends to show smaller opening for low vowels in the front cavity than the high arousal emotions more consistently than the other regions of the vocal tract. Happiness shows significantly shorter vocal tract length than anger and sadness in most speakers. Further details of speaker-dependent and speaker-independent speech articulation variation in emotion expression and their implications are described.
|
Speech Disorders, Pathological Speech1st place, pathological speech challenge, Interspeech 2012
2nd place, Parkinson's condition challenge, Interspeech 2015 Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. This study aims at automated analytics of pathological speech. My previous studies include (i) Automatic classification of speech intelligibility of pathological speech and (ii) Automatic judgement of the degree of Parkinson's severity based on speech audio. |
Speaker Verification using Acoustic-Articulatory InformationThis study proposes a practical, feature-level and score-level fusion approach by combining acoustic and estimated articulatory information for both text independent and text dependent speaker verification. This study found that concatenating (measured) articulatory features obtained from measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the performance dramatically. For real world applications, our system uses estimated articulatory features obtained through acoustic-to-articulatory inversion technology. Speaker verification accuracy is significantly improved with the estimated articulatory features. Our methods were evaluated on the X-ray Microbeam database and the RSR 2015 database achieved +15% lower relative equal error rate.
|
Co-Registration of EMA and rtMRI Datasets(Demo webpage), (Open source MATLAB Toolbox)
This study develops an algorithm of spatio-temporal registration for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics. |
Automatic and Robust Parameterization of rtMRI DataNorthern Digital Inc. Excellence Awards
(Demo webpage), (Open source MATLAB Toolbox) This study develops an algorithm for robust parameterization of the vocal tract in the midsagittal plane. This algorithm performs image quality enhancement, including pixel sensitivity correction and grainy noise reduction, followed by robust estimation of airway path between the vocal tract walls. The airway path as well as the locations of the lips and the top of the larynx are found using the Viterbi algorithm. The tissue-airway boundaries are found for each grid line by searching the closest pixel of higher intensity than a threshold from the the estimated airway path. The accuracy of the tissue boundary segmentation was evaluated in terms of root-mean-squared-error as well as statistics (mean and standard deviation) of error for specific region in the vocal tract. Results suggest that the proposed algorithm shows significantly less estimation error than the state-of-the-art method, especially for the front cavity and the lower boundary. |
Rich Acoustic-to-Articulatory InversionThis study proposes an acoustic-to-articulatory inversion framework that estimates a variety of vocal tract information; The list of articulatory parameters to be estimated includes, but not limited to anatomical points of speech articulators, vocal tract shape, laryngeal elevation, lip protrusion, frame-level vocal tract length. Co-registration and automatic rtMRI parameterization technologies are employed in this framework.
The co-registerd data contains (i) clean speech audio (from EMA dataset), (ii) 3D tracking of a handful of anatomical landmarks on the vocal tract (from EMA) and (iii) the complete view of the upper airway (from rtMRI). The advantage of learning the inversion mapping on the co-registered data is two folds: (i) The inversion model is capable of predicting various kinds of articulatory information, and (ii) the model is more useful for real applications, because clean speech audio (from EMA) can be directly usable as input signal (Speech audio in rtMRI dataset suffers from scanning noise or artifact from noise cancellation). |
Speech Production Data Acquisition and ProcessingThe Speech Production and articulation knowledge (SPAN) group at the USC has pioneered the use of real-time Magnetic Resonance Imaging (rtMRI) to study speech production. This group also measures articulatory movements using ElectroMagnetic Articulography (EMA). I have been collecting and processing speech audio for rtMRI data. Also, I have been assisting the EMA data collection. I have contributed collection, processing and organization of two comprehensive speech production datasets: USC-EMO-MRI and USC-TIMIT.
|