Program | Tutorials

September 2, 2018

Forenoon tutorials

(Parallel sessions; 9 AM to 12.30 PM; Coffee/tea break: 10.30 AM – 11.00 AM)


DeLiang Wang, The Ohio State University

Abstract: Speech separation is the task of separating target speech from background interference. In contrast to the traditional signal processing perspective, speech separation can be formulated as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. The recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance, including the first demonstration of substantial speech intelligibility improvements by hearing impaired listeners in noisy environments. This tutorial provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We systematically introduce three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the tutorial will be on separation algorithms where we describe monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multi-talker separation), and speech dereverberation, as well as multi-microphone techniques. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

DeLiang Wang received the B.S. degree in 1983 and the M.S. degree in 1986 from Peking (Beijing) University, Beijing, China, and the Ph.D. degree in 1991 from the University of Southern California, Los Angeles, CA, all in computer science. Since 1991, he has been with the Department of Computer Science & Engineering and the Center for Cognitive and Brain Sciences at Ohio State University, Columbus, OH, where he is currently a Professor and University Distinguished Scholar. He also holds a visiting appointment at the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi’an, China. He has been a visiting scholar to Harvard University, Oticon A/S (Denmark), and Starkey Hearing Technologies. Wang’s research interests include machine perception and deep neural networks. Among his recognitions are the Office of Naval Research Young Investigator Award in 1996, the 2005 Outstanding Paper Award from IEEE Transactions on Neural Networks, and the 2008 Helmholtz Award from the International Neural Network Society. He serves as Co-Editor-in-Chief of Neural Networks, and on the editorial boards of several journals. He is an IEEE Fellow.


Rohit Prabhavalkar, Google Inc., USA and Tara Sainath, Google Inc., USA

Abstract: Traditional automatic speech recognition (ASR) systems are comprised of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); and a language model (LM). The AM takes acoustic features as input and predicts a distribution over subword units, typically context-dependent phonemes. The PM, which is traditionally a hand-engineered lexicon maps the sequence of subword units produced by the acoustic model to words. Finally, the LM assigns probabilities to various word hypotheses. In traditional ASR systems, these components are trained independently on different datasets, with a number of independence assumptions which are made for tractability.

Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single system. Examples of such system include attention-based models [1, 6], the recurrent neural transducer [2, 3], the recurrent neural aligner [4], and connectionist temporal classification with word targets [5]. A common feature of all of these models is that they are composed on a single neural network, which when given input acoustic frames directly outputs a probability distribution over graphemes or word hypotheses. In fact, as has been demonstrated in recent work, such end-to-end models can surpass the performance of a conventional ASR systems [6].

In this tutorial, we will provide a detailed introduction to the topic of end-to-end modeling in the context of ASR. We will begin by charting out the historical development of these systems, while emphasizing the commonalities and the differences between the various end-to-end approaches that have been considered in the literature. We will then discuss a number of recently introduced innovations that have significantly improved the performance of end-to-end models, allowing these to surpass the performance of conventional ASR systems. The tutorial will then describe some of the exciting applications of this research, along with possible fruitful directions to explore.

Finally, the tutorial will discuss some of the shortcomings of existing end-to-end modeling approaches and discuss ongoing efforts to address these challenges.

[1] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” in Proc. ICASSP, 2016.
[2] A. Graves, “Sequence transduction with recurrent neural networks,” in Proc. of ICASSP, 2012.
[3] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer,” in Proc. ASRU, 2017.
[4] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence-to-sequence mapping,” in Proc. Interspeech, 2017.
[5] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. of Interspeech, 2017.
[6] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski and M. Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” in Proc. ICASSP, 2018.

Rohit Prabhavalkar received the B.E. degree in computer engineering from the University of Pune, India, in 2007 and the M.S. and Ph.D. degrees in computer science and engineering from The Ohio State University, USA, in 2012 and 2013, respectively. He is currently a research scientist at Google, having joined in 2013, where his research focuses on various aspects of acoustic modeling with the goal of improving automatic speech recognition technology. His other research interests include deep neural networks, natural language processing, and machine learning.


Tara Sainath received her PhD in Electrical Engineering and Computer Science from MIT in 2009. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T.J. Watson Research Center, before joining Google Research. She has served as a Program Chair for ICLR in 2017 and 2018. Also, she has co-organized numerous special sessions and workshops, including Interspeech 2010, ICML 2013, Interspeech 2016 and ICML 2017. In addition, she is a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Her research interests are mainly in acoustic modeling, including deep neural networks, sparse representations and adaptation methods.


Haizhou Li (National University of Singapore, Singapore), Hemant A. Patil (Dhirubhai Ambani Institute of Information and Communication Technology, India), Nicholas Evans (EURECOM, France)

Abstract: Speech is the most natural means of communication between humans. Speech signals carry various levels of information, such as linguistic content, emotion, the acoustic environment, language, the speaker’s identity and their health condition, etc. Automatic speaker recognition technologies aim to verify or identify a speaker using recordings of his/her voice. In practice, automatic speaker verification (ASV) systems should be robust to nuisance variation such as differences in the microphone and transmission channel, intersession variability, acoustic noise, speaker ageing, etc. Significant effort invested over the last three decades has been tremendously successful in developing technologies to compensate for such nuisance variation, thereby improving the reliability of ASV systems in a multitude of diverse application scenarios. In a number of these, specifically those relating to authentication applications, reliability can still be compromised as a result of spoofing attacks whereby fraudsters can gain illegitimate access to protected resources or facilities through the presentation of specially crafted speech signals that reflect the characteristics of another, enrolled person’s voice. ASV systems should be resilient to such malicious spoofing attacks. This tutorial presents a treatment of the issues concerning the robustness and security of an ASV system in the face of spoofing attacks. We also discuss current research trends and progress in developing anti-spoofing countermeasures to protect against attacks derived from voice conversion, speech synthesis, replay, twins (which has more malicious nature in attacking ASV systems and also called as twin’s fraud in biometrics literature) and professional mimics. The tutorial will give an overview of the risk and technological challenges associated with each form of attack in addition to an overview of the two internationally competitive ASVspoof challenges held as special sessions at INTERSPEECH 2015 and INTERSPEECH 2017. The tutorial will conclude with a summary of the current state-of-the-art in the field and a discussion of future research directions.

Core Topics Covered

  • ASV design cycle, research issues and technological challenges
  • Different spoofing attacks: voice conversion, speech synthesis, replay and twins, professional mimicry
  • ASVspoof 2015 Challenge, INTERSPEECH 2015 – Objective, Database and Results
  • ASVspoof 2017 Challenge, INTERSPEECH 2017 – Objective, Database and Results
  • ASVspoof 2019 Roadmap
  • Technological Challenges in Replay Spoof Speech Detection (SSD)
  • Strategies to combined spoofing countermeasures with ASV

Haizhou Li is currently a Professor at the Department of Electrical and Computer Engineering, National University of Singapore. Professor Li’s research interests include speech information processing and natural language processing. He is currently the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speech and Language Processing (2015‐2018). Professor Li was the President of the International Speech Communication Association (2015‐2017), the President of Asia Pacific Signal and Information Processing Association (2015‐2016). Professor Li is a Fellow of the IEEE. He was a recipient of the President’s Technology Award 2013 in Singapore.


Hemant A. Patil is a Professor at DA-IICT Gandhinagar, India. Prof. Patil’s research interests include speaker recognition, spoofing attacks, TTS, infant cry analysis, etc. He developed Speech Research Lab at DA-IICT, which is recognized as ISCA speech labs. Dr. Patil is member of IEEE, IEEE Signal Processing Society, IEEE Circuits and Systems Society, International Speech Communication Association (ISCA), EURASIP and an affiliate member of IEEE SLTC. He visited department of ECE, University of Minnesota, Minneapolis, USA (May-July, 2009) as short-term scholar. Dr. Patil has taken a lead role in organizing several ISCA supported events at DA-IICT. Dr. Patil has supervised 03 doctoral and 37 M.Tech. theses. Recently, he offered a joint tutorial with Prof. Haizhou Li during Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2017. He has been selected as APSIPA Distinguished Lecturer (DL) for 2018-2019. He delivered APSIPA Distinguished Lecture at University of Calgary, Canada, SJTU, Shanghai, China, KIIIT Gurgaon, India and SGGSIE&T Nanded, India.


Nicholas Evans is a Professor at EURECOM, France where he heads research in Speech and Audio Processing within the Biometrics and Digital Media Research Group of the Digital Security Department. His current research aims to develop new countermeasures to protect automatic speaker verification technology from the threat of spoofing. He is among the co-founders of the ASVspoof initiative and evaluation series (2015 & 2017) and was Lead Guest Editor for the IEEE Transactions on Information Forensics and Security special issue in Biometrics Spoofing and Countermeasures, Lead Guest Editor for the IEEE SPM special issue on Biometric Security and Privacy and Guest Editor for the IEEE JSTSP special issue on Spoofing and Countermeasures for Automatic Speaker Verification. He delivered as an invited speaker a talk on Spoofing and Countermeasures for Automatic Speaker Verification at the IEEE SPS Winter School on Security and Privacy Issues in Biometrics and contributed to a tutorial on Spoofing and Anti-Spoofing: A Shared View of Speaker Verification, Speech Synthesis and Voice Conversion at APSIPA ASC 2015. He is currently co-editing the 2nd edition of the Handbook of Biometric Ant-spoofing to be published in 2018.


Vikram Ramanarayanan, Keelan Evanini, David Suendermann-Oeft (Educational Testing Service R&D, San Francisco, USA)

Abstract: This workshop will introduce participants to the basics of designing conversational applications in the educational domain using spoken and multimodal dialog technology. The increasing maturation of automated conversational technologies in recent years holds much promise towards developing intelligent agents that can guide one or multiple phases of student instruction, learning, and assessment. In language learning applications, using spoken dialogue systems (SDS) could be an effective solution to improving conversational skills, because an SDS provides a convenient means for people to both practice and obtain feedback on different aspects of their conversational skills in a new language. These allow learners to make mistakes without feeling incompetent, empowering them to improve their skills for when they do speak with native speakers. From the assessment perspective, well-designed dialog agents have potential to elicit and evaluate the full range of English speaking skills (such as turn taking abilities, politeness strategies, pragmatic competence) that are required for successful communication. Such technologies can potentially personalize education to each learner, providing a natural and practical learning interface that can adapt to their individual strengths and weaknesses in real time so as to increase the efficacy of instruction.

The workshop will assume no prior knowledge of dialog technology or intelligent tutoring systems and will demonstrate the use of open-source software tools in building conversational applications. The first part of the tutorial will cover the state of the art in dialog technologies for educational domain applications, with a particular focus on language learning and assessment. This will include an introduction to the various components of spoken dialog systems and how they can be applied to develop conversational applications in the educational domain, as well as some advanced topics such as methods for speech scoring. The final part of the tutorial will be specifically dedicated to a hands-on application building session, where participants will have a chance to design and deploy their own dialog application from scratch on the HALEF cloud-based dialog platform using open-source OpenVXML design toolkit, which will allow a better understanding how such systems can potentially be designed and built. Participants are required to bring their own laptops for this portion of the workshop (with Windows or Mac operating systems installed; Linux not currently supported). Additional software installation instructions will be sent prior to the workshop.

Vikram Ramanarayanan is a Research Scientist at Educational Testing Service’s R&D division in San Francisco and also holds an Assistant Adjunct Professor appointment in the Department of Otolaryngology — Head and Neck Surgery at the University of California, San Francisco. Vikram’s research interests lie in applying scientific knowledge to interdisciplinary engineering problems in speech, language and vision and in turn using engineering approaches to drive scientific understanding. He has authored over 70 papers in peer-reviewed journals and conference proceedings. His work on speech science and technology has won 2 Best Paper Awards, an Editor’s Choice Award and an ETS Presidential Award. He holds M.S and Ph.D. degrees in Electrical Engineering from the University of Southern California, Los Angeles. Personal Webpage:


Keelan Evanini is a Research Director at Educational Testing Service in Princeton, NJ. His research interests include automated assessment of non-native spoken English, automated feedback in computer assisted language learning applications, and spoken dialog systems. He leads the research team that develops SpeechRater, the ETS capability for automated spoken language assessment. He also leads the team of research engineers that work on applying state-of-the-art natural language processing, speech processing, and machine learning technology to a wide range of research projects aimed at improving assessment and learning. He received his Ph.D. from the University of Pennsylvania in 2009, has published over 70 papers in peer-reviewed journals and conference proceedings, has been awarded 9 patents, and is a senior member of the IEEE. Personal Webpage:


David Suendermann-Oeft is director of the Dialog, Multimodal, and Speech (DIAMONDS) research center and manager of the San Francisco Office at Educational Testing Service (ETS). He is also the director, co-founder, and chief scientist of EMR.AI Inc., a company providing AI solutions to the medical sector, headquartered in San Francisco. Throughout his career, he has been working in the field of spoken language processing, machine learning, and artificial intelligence at multiple industrial and academic institutions including Siemens (Munich and Mexico City), RWTH (Aachen), UPC (Barcelona), USC (Los Angeles), Columbia University (New York), SpeechCycle (New York), Synchronoss (New York), and ICSI (Berkeley) and has also held appointments as tenured professor of computer science and department head at DHBW Stuttgart. David has authored over 140 publications and patents including two books and ten book chapters and serves as evaluator, rapporteur, and innovation expert for the European Commission. Personal Webpage:

On-site registrants of T4 are advised to carry a laptop for the hands-on session.

September 2, 2018

Afternoon tutorials

(Parallel sessions; 2 PM to 5.30 PM; Coffee/tea break: 3.30 PM – 4 PM)


Naftali Tishby, Hebrew University of Jerusalem

Abstract: In this tutorial I will present a novel comprehensive theory of large scale learning with Deep Neural Networks, based on the correspondence between Deep Learning and the Information Bottleneck framework. The new theory has the following components: (1) rethinking Learning theory; I will prove a new generalization bound, the input-compression bound, which shows that compression of the representation of input variable is far more important for good generalization than the dimension of the network hypothesis class, an ill-defined notion for deep learning. (2) I will prove that for large scale Deep Neural Networks the mutual information on the input and the output variables, for the last hidden layer, provide a complete characterization of the sample complexity and accuracy of the network. This makes the information Bottleneck bound for the problem as the optimal trade-off between sample complexity and accuracy with ANY learning algorithm. (3) I will show how Stochastic Gradient Descent, as used in Deep Learning, achieves this optimal bound. In that sense, Deep Learning is a method for solving the Information Bottleneck problem for large scale supervised learning problems. The theory provides a new computational understating of the benefit of the hidden layers and gives concrete predictions for the structure of the layers of Deep Neural Networks and their design principles. These turn out to depend solely on the joint distribution of the input and output and on the sample size.

Based partly on joint works with Ravid Shwartz-Ziv, Noga Zaslavsky, and Amichai Painsky.

Naftali Tishby is a professor of Computer Science, and the incumbent of the Ruth and Stan Flinkman Chair for Brain Research at the Edmond and Lily Safra Center for Brain Science (ELSC) at the Hebrew University of Jerusalem. He is one of the leaders in machine learning research and computational neuroscience in Israel, and his numerous former students serve in key academic and industrial research positions all over the world. Tishby was the founding chair of the new computer-engineering program, and a director of the Leibnitz Center for Research in Computer Science at Hebrew University. Tishby received his PhD in theoretical physics from Hebrew University in 1985 and was a research staff member at MIT and Bell Labs from 1985 to 1991. Tishby has been a visiting professor at Princeton NECI, the University of Pennsylvania, UCSB, and IBM Research.


Carol Espy-Wilson (University of Maryland, USA), Mark Tiede (Haskins Lab, USA), Hosung Nam (Korea University, Seoul), Vikramjit Mitra (Apple Inc., USA), Ganesh Sivaraman (Pindrop, Atlanta, USA)

Abstract: Articulatory representations have been studied and applied to various speech technologies for many years. One of the persisting challenges of articulatory research has been the paucity of reliable articulatory data and non-scalability to large scale subject-independent applications. The aim of this tutorial is to present an overview of research in articulatory representations for large scale applications. This tutorial will discuss best practices in articulatory data collection and synthesis, present recent developments in subject independent acoustic-to-articulatory speech inversion and describe state-of-the-art Convolutional Neural Network (CNN) architectures for large vocabulary continuous speech recognition incorporating both articulatory and acoustic features. This tutorial combines experts from scientific and engineering backgrounds to present a concise tutorial about articulatory features, their measurement, synthesis and application to state-of-the-art Automatic Speech Recognition.

Carol Espy-Wilson, the lead organizer is a Professor in the Department of Electrical and Computer Engineering and the Institutes for Systems Research at the University of Maryland, College Park. She has more than 3 decades of leading research endeavors in fundamental speech acoustics, vocal tract modeling, speech and speaker recognition, speech segregation, speech enhancement, and more recently speech inversion. One of the tools coming out of Dr. Espy-Wilson’s lab is a vocal tract modeling tool, VTAR that many scientists and engineers have downloaded to use for research and teaching. Dr. Espy-Wilson has advised a number of PhD and MS students and published many papers in reputed academic conferences and journals.


Mark Tiede is a Senior Scientist at Haskins Laboratories active in speech production research, particularly the study of dyadic interaction. He has more than 20 years experience in the use of point source tracking methods such as electromagnetic articulography (EMA) and is the author of a widely used tool for the use of analyzing such data (mview).


Hosung Nam is an Assistant Professor in the Department of English Language and Literature at Korea University, Seoul, South Korea. He received the M.S. and Ph.D. degrees from the Department of Linguistics at Yale University, New Haven, CT, in 2007. He is a linguist who is an expert in the field of articulatory phonology. His research emphasis is on the link between speech perception and production, speech error, automatic speech recognition, sign language, phonological development, and their computational modeling. He has been a Research Scientist at Haskins Laboratories, New Haven, since 2007.


Vikramjit Mitra is a Research Scientist at Apple Inc. He received his Ph.D. in Electrical Engineering from University of Maryland, College Park; M.S. in Electrical Engineering from University of Denver, B.E. in Electrical Engineering from Jadavpur University, India. His research focuses on signal processing for noise/channel/reverberation, speech recognition, production/perception-motivated signal processing, information retrieval, machine learning and speech analytics. He is a senior member of the IEEE, an affiliate member of the SLTC and has served on NSF panels.


Ganesh Sivaraman is a Research Scientist at Pindrop. He received his M.S. and Ph.D. in Electrical Engineering from University of Maryland College Park, B.E. in Electrical Engineering from Birla Institute of Technology and Science, Pilani, India. His research focuses on speaker independent acoustic-to-articulatory inversion, speaker adaptation, speech enhancement and robust speech recognition.

Bhiksha Raj (Language Technologies Institute, Carnegie-Mellon University, USA) and Joseph Keshet (Bar-Ilan University, Israel)

Abstract: As neural network classifiers become increasingly successful at various tasks ranging from speech recognition and image classification to various natural language processing tasks and even recognizing malware, a second, somewhat disturbing discovery has also been made. It is possible to *fool* these systems with carefully crafted inputs that appear to the lay observer to be natural data, but cause the neural network to misclassify in random or even targeted ways.

In this tutorial we will discuss the problem of designing, identifying, and avoiding attacks by such crafted “adversarial” inputs. In the first part, we will explain how the basic training algorithms for neural networks may be turned around to learn adversarial examples, and explain why such learning is nearly always possible. Subsequently, we will explain several approaches to producing adversarial examples to fool systems such as image classifiers, speech recognition and speaker verification systems, and malware detection systems. We will describe both “glass box” approaches, where one has access to the internals of the classifier, and “black box” approaches where one does not. We will subsequently move on to discuss current approaches to *identifying* such adversarial examples when they are presented to a classifier. Finally, we will discuss recent work on introducing “backdoors” into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided specific types of inputs, but not otherwise.

Bhiksha Raj is a professor in the School of Computer Science at Carnegie Mellon University. His areas of interest are automatic speech recognition, audio processing, machine learning, and privacy. Dr. Raj is a fellow of the IEEE.


Joseph Keshet is an assistant professor in the Dept. of Computer Science at Bar-Ilan University. His areas of interest are both machine learning and computational study of human speech and language. In machine learning his research has been focused on deep learning and structured prediction, while his research on speech and language has been focused on speech processing, speech recognition, acoustic phonetics, and pathological speech.


Preeti Rao (Indian Institute of Technology Bombay, Mumbai) and Hema Murthy (Indian Institute of Technology Madras, Chennai)

Abstract: The singing voice is the most flexible of musical instruments and, unsurprisingly, vocal performance has dominated the music of many cultures. Information extraction from the singing voice requires dealing with the structure and semantics of music which are quite distinct from that of spoken language. Like speech, MIR is an interdisciplinary field where signal processing and computing experts benefit from interaction with musicologists and psychologists working in music cognition. Extracting musically useful information from vocal music recordings benefits real-world applications such as music recommendation, search and navigation, musicology studies and evaluation of singing skill. The relevant signal-level tasks include vocal activity detection and separation in polyphonic music, song segmentation, extracting voice features related to melody, rhythm and timbre and establishing models for perceived similarity across the possibly culture-specific musical dimensions. In this tutorial, we will present the audio signal processing and machine learning methods that underlie a variety of MIR applications with examples drawn from Western popular and Indian classical genres. The goal of the tutorial is to show speech researchers how they can contribute fruitfully to MIR problems which are considered topical and rewarding.

Preeti Rao has been on the faculty of the Department of Electrical Engineering, I.I.T. Bombay since 1999. She currently serves as H.A.L. R&D Chair Professor. She received the B.Tech. EE degree from I.I.T. Bombay in 1984, and Ph.D. from the University of Florida, Gainesville in 1990. After post-doctoral stints at the University of Illinois at Urbana-Champaign and Hitachi Research Labs in Tokyo, she joined the EE department at I.I.T. Kanpur in 1994, before moving to I.I.T. Bombay in 1999. Her research interests lie in speech and audio signal processing. She previously worked on low bit rate speech compression algorithms for the sub 1 kbps rates. More recently, she has been involved in computer aided spoken language training, prosody modeling for Indian languages and multichannel speech enhancement. She took up research in music computing about a decade ago and this effort received a big boost through engagement with the ERC funded CompMusic project led by UPF Barcelona in 2011-2016, towards culture-specific music information extraction with a focus on Indian classical genres. Her team of students also participated successfully in melody extraction and vocal separation challenges hosted by MIREX (MIR Evaluations) during this period. She is currently on the Editorial Board of the Journal of New Music Research (JNMR). She has been actively involved in the development of systems for Indian music and spoken language learning applications. She co-founded SensiBol Audio Technologies, a start-up focusing on music learning applications, with her Ph.D. and Masters students in 2011.


Hema Murthy is a Professor in the Department of Computer Science and Engineering, I.I.T Madras. She received the B.E (Electronics and Communication Engineering) from Osmania University (1980), M.Eng (Electrical and Computer Engineering) from McMaster University in 1986, and Ph.D (Computer Science and Engineering) from I.I.T Madras in 1992. She worked as Scientific Officer at the Tata Institute of Fundamental Research from 1980-83 where she worked in computer graphics. She joined as faculty in the Department of Computer Science and Engineering in 1988 as a lecturer. Her primarily line of work is in “signal processing directed machine learning,” where the objective is to understand a given domain, apply appropriate signal processing techniques to enable faster convergence of machine learning algorithms. She has worked in various domains including education, text, networks, brain signals, speech and music. Her major focus currently is on speech, music and brain signals. Her involvement in the ERC funded project on CompMusic where she was responsible for the analysis of Carnatic music gave an impetus to the effort on Carnatic music. The shifting tonic, and significant extempore improvisation in Indian music were some of the genre specific challenges that were addressed.She also led a consortium effort on text to speech synthesis where the objective was to build text to speech synthesis for 13 Indian languages.This technology has been transferred to a number of companies.


Petros Maragos (School of E.C.E., National Technical University of Athens, Athens 15773, Greece), Athanasia Zlatintsi (Athena Research Center, Robot Perception and Interaction Unit, Greece)

Email:, NTUA Labs:,

Abstract: The goal of this tutorial is to provide a concise overview of ideas, methods and research results in multimodal speech and audio processing, spatio-temporal sensory processing, perception and fusion, with applications in Human-Robot Interaction. Nowadays, most data are multimodal, thus there is the emergent need of developing multimodal methodologies, taking also into account the visual modality so as to enhance and assist the audio/speech modality. This tutorial will present state-of-the-art work for the major application area, which is Human-Robot Interaction, for social, edutainment and healthcare applications, including among others audio-gestural recognition for natural communication with the robotic agent and audio-visual speech synthesis for assistance and maximization of the naturalness of the interaction. Established results and recent advances from our research in various EU projects concerning the above areas as well as for the purposes of distant-speech interaction for robust home applications will also be discussed. Additionally, it will present a secondary application area that also relies on audio-visual processing, including in this case methodologies for saliency detection and automatic summarization of mono-modal or multimodal data (i.e., audio or video) and for the development of virtual interactive environments, where human body motion or hand gestures are used for audio-gestural music synthesis. Related papers and current results can be found in and

Petros Maragos received the M.Eng. Diploma in E.E. from the National Technical University of Athens (NTUA) in 1980 and the M.Sc. and Ph.D. degrees from Georgia Tech, Atlanta, in 1982 and 1985. In 1985, he joined the faculty of the Division of Applied Sciences at Harvard University, Boston, where he worked for eight years as professor of electrical engineering, affiliated with the Harvard Robotics Lab. In 1993, he joined the faculty of the School of ECE at Georgia Tech, affiliated with its Center for Signal and Image Processing. During periods of 1996-98 he had a joint appointment as director of research at the Institute of Language and Speech Processing in Athens. Since 1999, he has been working as professor at the NTUA School of ECE, where he is currently the director of the Intelligent Robotics and Automation Lab. He has held visiting positions at MIT in 2012 and at UPenn in 2016. His research and teaching interests include signal processing, systems theory, machine learning, image processing and computer vision, audio-speech & language processing, and robotics. In the above areas he has published numerous papers, book chapters, and has also co-edited three Springer research books, one on multimodal processing and two on shape analysis. He has served as: Member of the IEEE SPS committees on DSP, IMDSP and MMSP; Associate Editor for the IEEE Transactions on ASSP and the Transactions on PAMI, as well as editorial board member and guest editor for several journals on signal processing, image analysis and vision; Co-organizer of several conferences and workshops, including ECCV 2010 (Program Chair), IROS 2015 Workshop on Cognitive Mobility Assistance Robots, and EUSIPCO 2017 (General Chair). He has also served as member of the Greek National Council for Research and Technology. He is the recipient or co-recipient of several awards for his academic work, including: a 1987-1992 US NSF Presidential Young Investigator Award; the 1988 IEEE ASSP Young Author Best Paper Award; the 1994 IEEE SPS Senior Best Paper Award; the 1995 IEEE W.R.G. Baker Prize for the most outstanding original paper; the 1996 Pattern Recognition Society’s Honorable Mention best paper award; the best paper award from the CVPR-2011 Workshop on Gesture Recognition. In 1995, he was elected IEEE Fellow for his research contributions. He received the 2007 EURASIP Technical Achievement Award for contributions to nonlinear signal processing, systems theory, image and speech processing. In 2010 he was elected Fellow of EURASIP for his research contributions. He has been elected IEEE SPS Distinguished Lecturer for 2017-2018.


Athanasia Zlatintsi is a Senior Researcher at the School of Electrical and Computer Engineering, National Technical University of Athens (NTUA), Greece. She received the Ph.D. degree in Electrical and Computer Engineering from NTUA in 2013 and her M.Sc. in Media Engineering from Royal Institute of Technology (KTH, Stockholm, Sweden) in 2006. Since 2007 she has been a researcher at the Computer Vision, Speech Communication & Signal Processing Group (CVSP) at NTUA participating in different projects, funded by the European Commission and Greek Ministry of Education in the areas of music and audio signal processing and specifically the analysis of music signals using computational methods aiming in their automatic recognition. Her research interests include signal processing and analysis of musical signals, analysis and processing of monomodal and multimodal signals for the extraction of robust representations for detection of perceptually salient events, speech/multimodal processing for human-computer interaction, and machine learning. Her research contributions include the development of efficient algorithms (using also nonlinear methods) for the analysis of the structure and the characteristics of musical signals, the detection of saliency in audio signals in general, the creation of multimodal, audio and music summaries, as well as the development of multimodal human-computer interaction systems.