research-article

Recent developments in openSMILE, the munich open-source multimedia feature extractor

Authors:
Florian Eyben

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

,
Felix Weninger

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

,
Florian Gross

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

,
Björn Schuller

Technische Universität München, München, Germany

Technische Universität München, München, Germany
View Profile

MM '13: Proceedings of the 21st ACM international conference on MultimediaOctober 2013Pages 835–838https://doi.org/10.1145/2502081.2502224

Published:21 October 2013Publication History

MM '13: Proceedings of the 21st ACM international conference on Multimedia

Pages 835–838

ABSTRACT

We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parameters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical functionals (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes statistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Available low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded from http://opensmile.sourceforge.net/.

References

C.-C. Chang and C.-J. Lin. LibSVM: a library for support vector machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
Z. Duan, G. J. Mysore, and P. Smaragdis. Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. In Proc. of Interspeech, Portland, OR, USA, 2012.Google ScholarCross Ref
F. Eyben, F. Weninger, N. Lehment, G. Rigoll, and B. Schuller. Violent Scenes Detection with Large, Brute-forced Acoustic and Visual Feature Sets. In Proceedings MediaEval 2012 Workshop, Pisa, Italy, October 2012. 2 pages.Google Scholar
F. Eyben, M. Wollmer, and B. Schuller. openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proc. of ACM MM, pages 1459--1462, Florence, Italy, October 2010. ACM. Google ScholarDigital Library
A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602--610, 2005. Google ScholarDigital Library
R. Maas, A. Schwarz, Y. Zheng, K. Reindl, S. Meier, A. Sehr, and W. Kellermann. A Two-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments. In Proc. of CHiME, pages 41--46, 2011.Google Scholar
B. Schuller. The Computational Paralinguistics Challenge. IEEE Signal Processing Magazine, 29(4):97--101, July 2012.Google ScholarCross Ref
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al. The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Con ict, Emotion, Autism. In Proc. of INTERSPEECH, Lyon, France, August 2013. ISCA. in press.Google Scholar
F. Weninger, F. Eyben, B. W. Schuller, M. Mortillaro, and K. R. Scherer. On the Acoustics of Emotion in Audio: What Speech, Music and Sound have in Common. Frontiers in Emotion Science, 2013. DOI: 10.3389/fpsyg.2013.00292, in press.Google ScholarCross Ref
F. Weninger, C. Wagner, M. Wollmer, B. Schuller, and L.-P. Morency. Speaker Trait Characterization in Web Videos: Uniting Speech, Language, and Facial Features. In Proc. of ICASSP, Vancouver, Canada, May 2013. IEEE. in press.Google ScholarCross Ref
I. H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2nd edition, 2005. Google ScholarDigital Library
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book (v3.4). Cambridge University Press, 2006.Google Scholar

Index Terms

Recent developments in openSMILE, the munich open-source multimedia feature extractor

Recommendations

Evaluating Acoustic Feature Maps in 2D-CNN for Speaker Identification
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

This paper presents a study evaluating different acoustic feature map representations in two-dimensional convolutional neural networks (2D-CNN) on the speech dataset for various speech-related activities. Specifically, the task involves identifying ...
Read More
Dysarthric speech classification from coded telephone speech using glottal features
Abstract
This paper proposes a new dysarthric speech classification method from coded telephone speech using glottal features. The proposed method utilizes glottal features, which are efficiently estimated from coded telephone speech using a ...
Read More
Hybrid multi-modal emotion recognition framework based on InceptionV3DenseNet
Abstract
Emotion recognition is one of the most complex research areas as individuals express emotional cues based on several modalities such as audio, facial expressions, and language. The recognition of emotion from one of the modalities is not always ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '13: Proceedings of the 21st ACM international conference on Multimedia
October 2013
1166 pages
ISBN:9781450324045
DOI:10.1145/2502081
General Chairs:
Alejandro (Alex) Jaimes
Yahoo!, Spain
,
Nicu Sebe
University of Trento, Italy
,
Nozha Boujemaa
INRIA, France
,
Program Chairs:
Daniel Gatica-Perez
IDIAP & EPFL, Switzerland
,
David A. Shamma
Yahoo!, USA
,
Marcel Worring
University of Amsterdam, The Netherlands
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
acoustic features
affect recognition
affective computing
audio features
computational paralinguistics
feature extraction
machine learning
multimedia analysis
openSMILE
video features
visual features
Qualifiers
- research-article
Conference

Acceptance Rates
MM '13 Paper Acceptance Rate47of235submissions,20%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 690
  Total Citations
  View Citations
- 2,049
  Total Downloads
- Downloads (Last 12 months)252
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recent developments in openSMILE, the munich open-source multimedia feature extractor

MM '13: Proceedings of the 21st ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating Acoustic Feature Maps in 2D-CNN for Speaker Identification

Dysarthric speech classification from coded telephone speech using glottal features

Hybrid multi-modal emotion recognition framework based on InceptionV3DenseNet