ABSTRACT
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.
- C.-C. Chang and C.-J. Lin. LibSVM: a library for support vector machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
- Z. Duan, G. J. Mysore, and P. Smaragdis. Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. In Proc. of Interspeech, Portland, OR, USA, 2012.Google ScholarCross Ref
- F. Eyben, F. Weninger, N. Lehment, G. Rigoll, and B. Schuller. Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature Sets. In Proceedings MediaEval 2012 Workshop, Pisa, Italy, October 2012. 2 pages.Google Scholar
- F. Eyben, M. Wollmer, and B. Schuller. openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proc. of ACM MM, pages 1459--1462, Florence, Italy, October 2010. ACM. Google ScholarDigital Library
- A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602--610, 2005. Google ScholarDigital Library
- R. Maas, A. Schwarz, Y. Zheng, K. Reindl, S. Meier, A. Sehr, and W. Kellermann. A Two-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments. In Proc. of CHiME, pages 41--46, 2011.Google Scholar
- B. Schuller. The Computational Paralinguistics Challenge. IEEE Signal Processing Magazine, 29(4):97--101, July 2012.Google ScholarCross Ref
- B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al. The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Con ict, Emotion, Autism. In Proc. of INTERSPEECH, Lyon, France, August 2013. ISCA. in press.Google Scholar
- F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer. On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Emotion Science, 2013. DOI: 10.3389/fpsyg.2013.00292, in press.Google ScholarCross Ref
- F. Weninger, C. Wagner, M. Wollmer, B. Schuller, and L.-P. Morency. Speaker Trait Characterization in Web Videos: Uniting Speech, Language, and Facial Features. In Proc. of ICASSP, Vancouver, Canada, May 2013. IEEE. in press.Google ScholarCross Ref
- I. H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2nd edition, 2005. Google ScholarDigital Library
- S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book (v3.4). Cambridge University Press, 2006.Google Scholar
Index Terms
- Recent developments in openSMILE, the munich open-source multimedia feature extractor
Recommendations
Evaluating Acoustic Feature Maps in 2D-CNN for Speaker Identification
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and ComputingThis paper presents a study evaluating different acoustic feature map representations in two-dimensional convolutional neural networks (2D-CNN) on the speech dataset for various speech-related activities. Specifically, the task involves identifying ...
Dysarthric speech classification from coded telephone speech using glottal features
AbstractThis paper proposes a new dysarthric speech classification method from coded telephone speech using glottal features. The proposed method utilizes glottal features, which are efficiently estimated from coded telephone speech using a ...
Hybrid multi-modal emotion recognition framework based on InceptionV3DenseNet
AbstractEmotion recognition is one of the most complex research areas as individuals express emotional cues based on several modalities such as audio, facial expressions, and language. The recognition of emotion from one of the modalities is not always ...
Comments