Skip to main content
Top
Published in: Journal of NeuroEngineering and Rehabilitation 1/2008

Open Access 01-12-2008 | Methodology

Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection

Authors: Patricia Besson, Murat Kunt

Published in: Journal of NeuroEngineering and Rehabilitation | Issue 1/2008

Login to get access

Abstract

Background

Speaker detection is an important component of many human-computer interaction applications, like for example, multimedia indexing, or ambient intelligent systems. This work addresses the problem of detecting the current speaker in audio-visual sequences. The detector performs with few and simple material since a single camera and microphone meets the needs.

Method

A multimodal pattern recognition framework is proposed, with solutions provided for each step of the process, namely, the feature generation and extraction steps, the classification, and the evaluation of the system performance. The decision is based on the estimation of the synchrony between the audio and the video signals. Prior to the classification, an information theoretic framework is applied to extract optimized audio features using video information. The classification step is then defined through a hypothesis testing framework in order to get confidence levels associated to the classifier outputs, allowing thereby an evaluation of the performance of the whole multimodal pattern recognition system.

Results

Through the hypothesis testing approach, the classifier performance can be given as a ratio of detection to false-alarm probabilities. Above all, the hypothesis tests give means for measuring the whole pattern recognition process effciency. In particular, the gain offered by the proposed feature extraction step can be evaluated. As a result, it is shown that introducing such a feature extraction step increases the ability of the classifier to produce good relative instance scores, and therefore, the performance of the pattern recognition process.

Conclusion

The powerful capacities of hypothesis tests as an evaluation tool are exploited to assess the performance of a multimodal pattern recognition process. In particular, the advantage of performing or not a feature extraction step prior to the classification is evaluated. Although the proposed framework is used here for detecting the speaker in audiovisual sequences, it could be applied to any other classification task involving two spatio-temporal co-occurring signals.
Appendix
Available only for authorised users
Literature
1.
go back to reference Potamianos G, Neti C, Gravier G, Garg A, Senior AW: Recent advances in the automatic recognition of audio-visual speech. Proceedings of IEEE 2003,91(9):1306-1326. 10.1109/JPROC.2003.817150CrossRef Potamianos G, Neti C, Gravier G, Garg A, Senior AW: Recent advances in the automatic recognition of audio-visual speech. Proceedings of IEEE 2003,91(9):1306-1326. 10.1109/JPROC.2003.817150CrossRef
2.
go back to reference Ras E, Becker M, Koch J: Engineering Tele-Health Solutions in the Ambient Assisted Living Lab. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07). Volume 2. Niagara Falls, Canadax; 2007:804-809.CrossRef Ras E, Becker M, Koch J: Engineering Tele-Health Solutions in the Ambient Assisted Living Lab. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07). Volume 2. Niagara Falls, Canadax; 2007:804-809.CrossRef
3.
go back to reference Hershey J, Movellan J: Audio-Vision: Using Audio-Visual Synchrony to Locate Sounds. In Proceeding of NIPS. Volume 12. Denver, CO, USA; 1999:813-819. Hershey J, Movellan J: Audio-Vision: Using Audio-Visual Synchrony to Locate Sounds. In Proceeding of NIPS. Volume 12. Denver, CO, USA; 1999:813-819.
4.
go back to reference Nock HJ, Iyengar G, Neti C: Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study. In Proceedings of CIVR. Urbana, IL, USA; 2003:488-499. Nock HJ, Iyengar G, Neti C: Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study. In Proceedings of CIVR. Urbana, IL, USA; 2003:488-499.
5.
go back to reference Butz T, Thiran JP: From error probability to information theoretic (multi-modal) signal processing. Signal Processing 2005, 85: 875-902. 10.1016/j.sigpro.2004.11.027CrossRef Butz T, Thiran JP: From error probability to information theoretic (multi-modal) signal processing. Signal Processing 2005, 85: 875-902. 10.1016/j.sigpro.2004.11.027CrossRef
6.
go back to reference Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503CrossRef Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503CrossRef
7.
go back to reference Besson P, Popovici V, Vesin JM, Thiran JP, Kunt M: Extraction of Audio Features Specific to Speech using Information Theory and Differential Evolution.Tech Rep TR-ITS-2005.018, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerxland; 2005. [http://infoscience.epfl.ch/record/87173] Besson P, Popovici V, Vesin JM, Thiran JP, Kunt M: Extraction of Audio Features Specific to Speech using Information Theory and Differential Evolution.Tech Rep TR-ITS-2005.018, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerxland; 2005. [http://​infoscience.​epfl.​ch/​record/​87173]
8.
go back to reference Besson P, Popovici V, Vesin JM, Thiran JP, Kunt M: Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection. IEEE Transactions on Multimedia 2008, 10: 63-73. 10.1109/TMM.2007.911302CrossRef Besson P, Popovici V, Vesin JM, Thiran JP, Kunt M: Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection. IEEE Transactions on Multimedia 2008, 10: 63-73. 10.1109/TMM.2007.911302CrossRef
9.
go back to reference Ihler AT, Fisher JW III, Willsky AS: Nonparametric Hypothesis Tests for Statistical Dependency. IEEE Transactions on Signal Processing 2004,52(8):2234-2249. 10.1109/TSP.2004.830994CrossRef Ihler AT, Fisher JW III, Willsky AS: Nonparametric Hypothesis Tests for Statistical Dependency. IEEE Transactions on Signal Processing 2004,52(8):2234-2249. 10.1109/TSP.2004.830994CrossRef
10.
go back to reference Moon TK, Stirling WC: Mathematical Methods and Algorithms for Signal Processing. Prentice hall; 2000. Moon TK, Stirling WC: Mathematical Methods and Algorithms for Signal Processing. Prentice hall; 2000.
11.
go back to reference Horn BKP, Schunck BG: Determining optical flow. Artificial Intelligence 1981, 17: 185-203. 10.1016/0004-3702(81)90024-2CrossRef Horn BKP, Schunck BG: Determining optical flow. Artificial Intelligence 1981, 17: 185-203. 10.1016/0004-3702(81)90024-2CrossRef
12.
go back to reference Meynet J, Popovici V, Thiran JP: Face Detection with Boosted Gaussian Features. Pattern Recognition 2007,40(8):2283-2291. 10.1016/j.patcog.2007.02.001CrossRef Meynet J, Popovici V, Thiran JP: Face Detection with Boosted Gaussian Features. Pattern Recognition 2007,40(8):2283-2291. 10.1016/j.patcog.2007.02.001CrossRef
13.
go back to reference Gold B, Morgan N: Speech and audio signal processing. John Wiley & sons, Inc; 2000. Gold B, Morgan N: Speech and audio signal processing. John Wiley & sons, Inc; 2000.
15.
go back to reference Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN: CUAVE: a new audio-visual database for multimodal human-computer interface research. Proceedings of ICASSP, Orlando 2002, 2: 2017-2020. Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN: CUAVE: a new audio-visual database for multimodal human-computer interface research. Proceedings of ICASSP, Orlando 2002, 2: 2017-2020.
16.
go back to reference Besson P, Monaci G, Vandergheynst P, Kunt M: Experimental evalutation framework for speaker detection on the CUAVE database.Tech Rep TR-ITS-2006.003, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; 2006. [http://infoscience.epfl.ch/record/87331] Besson P, Monaci G, Vandergheynst P, Kunt M: Experimental evalutation framework for speaker detection on the CUAVE database.Tech Rep TR-ITS-2006.003, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland; 2006. [http://​infoscience.​epfl.​ch/​record/​87331]
Metadata
Title
Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection
Authors
Patricia Besson
Murat Kunt
Publication date
01-12-2008
Publisher
BioMed Central
Published in
Journal of NeuroEngineering and Rehabilitation / Issue 1/2008
Electronic ISSN: 1743-0003
DOI
https://doi.org/10.1186/1743-0003-5-11

Other articles of this Issue 1/2008

Journal of NeuroEngineering and Rehabilitation 1/2008 Go to the issue