Continuous Action Recognition Based on Sequence Alignment

Kulkarni, Kaustubh; Evangelidis, Georgios; Cech, Jan; Horaud, Radu

doi:10.1007/s11263-014-0758-9

Continuous Action Recognition Based on Sequence Alignment

Published: 09 September 2014

Volume 112, pages 90–114, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Kaustubh Kulkarni¹,
Georgios Evangelidis¹,
Jan Cech² &
…
Radu Horaud¹

1846 Accesses
44 Citations
Explore all metrics

An Erratum to this article was published on 18 November 2014

Abstract

Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time warping framework and devise a novel visual alignment technique, namely dynamic frame warping (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two extensions which enable to perform recognition concomitant with segmentation, namely one-pass DFW and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action recognition has been overlooked. We test and illustrate the proposed techniques with a recently released dataset (RAVEL) and with two public-domain datasets widely used in action recognition (Hollywood-1 and Hollywood-2). We also compare the performances of the proposed isolated and continuous recognition algorithms with several recently published methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Action Recognition Using Invariant 3D Motion Encoding

HMDB51: A Large Video Database for Human Motion Recognition

Motion Boundary Trajectory for Human Action Recognition

Notes

References

Alameda-Pineda, X., Sanchez-Riera, J., Wienke, J., Franc, V., Cech, J., Kulkarni, K., et al. (2013). RAVEL: An annotated corpus for training robots with audiovisual abilities. Journal on Multimodal User Interfaces, 7(1–2), 79–91.
Article Google Scholar
Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.
Article Google Scholar
Blackburn, J., & Ribeiro, E. (2007). Human motion recognition using isomap and dynamic time warping. Human motion-understanding, modeling, capture and animation (pp. 285–298). Berlin: Springer.
Chapter Google Scholar
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. New York, NY: Cambridge University Press.
Book MATH Google Scholar
Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In N. Paragios (Ed.), Computer Vision-ECCV 2010 (pp. 721–734). Berlin: Springer.
Chapter Google Scholar
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Rev, 43(1), 129–159.
Article MATH MathSciNet Google Scholar
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision.
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athitsos, V., & Escalante, H. J. (2013). Multi-modal gesture recognition challenge 2013: Dataset and results. In ChaLearn Multi-modal Gesture Recognition Grand Challenge and Workshop, 15th ACM International Conference on Multimodal Interaction.
Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.
Article Google Scholar
Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Article Google Scholar
Gill, P. R., Wang, A., & Molnar, A. (2011). The in-crowd algorithm for fast basis pursuit denoising. IEEE Transactions on Signal Processing, 59(10), 4595–4605.
Article MathSciNet Google Scholar
Gong, D., & Medioni, G. (2011) Dynamic manifold warping for view invariant action recognition. In IEEE International Conference on Computer Vision, (pp. 571–578). IEEE.
Hienz, H., Bauer, B., & Kraiss, K. F. (1999). HMM-based continuous sign language recognition using stochastic grammars. In A. Braffort, R. Gherbi, S. Gibet, D. Teil, & J. Richardson (Eds.), Gesture-based communication in human-computer interaction (Vol. 1739, pp. 185–196)., Lecture Notes in Computer Science Berlin: Springer.
Chapter Google Scholar
Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In 2011 IEEE Conference on Computer Vision and Pattern Recognition CVPR. (pp. 3265–3272). IEEE.
Ikizler, N., & Duygulu, P. (2009). Histogram of oriented rectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 27(10), 1515–1526.
Article Google Scholar
Jain, M., Jégou, H., & Bouthémy, P. (2013). Better exploiting motion for better action recognition. In Computer Vision and Pattern Recognition, (pp. 2555–2562). IEEE.
Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision, (pp. 425–438). Berlin :Springer.
Kulkarni, K., Cherla, S., Kale, A., & Ramasubramanian, V. (2008). A framework for indexing human actions in video. In The 1st International Workshop on Machine Learning for Vision-based Motion Analysis-MLVMA’08.
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, (pp. 1–8). IEEE.
Lee, C., & Rabiner, L. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1649–1658.
Article Google Scholar
Liang, R., & Ouhyoung, M. (1998). A real-time continuous gesture recognition system for sign language. In Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, (pp. 558–567). IEEE.
Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost. In European Conference on Computer Vision, (pp. 359–372). Berlin: Springer.
Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Computer Vision and Pattern Recognition, 2007. CVPR’07, (pp. 1–8). IEEE.
Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Marszalek, M., Laptev, I., & Schmid, C. (2009) Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2929–2936). IEEE.
Morency, L., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In Computer Vision and Pattern Recognition, (pp. 1–8). IEEE.
Mueller, M. (2007). Dynamic time warping. Information retrieval for music and motion (pp. 69–84). Berlin: Springer.
Chapter Google Scholar
Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2), 263–271.
Article Google Scholar
Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 64–83.
Article Google Scholar
Ning, H., Xu, W., Gong, Y., Huang, T. (2008). Latent pose estimator for continuous action recognition. In European Conference on Computer Vision, (pp. 419–433). Springer.
Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Salt Lake: Prentice hall.
Google Scholar
Sakoe, H. (1979). Two-level DP-matching - a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustic, Speech, and Signal Processing, 27(6), 588–595.
Article Google Scholar
Sanchez-Riera, J., Cech, J., Horaud, R. P. (2012). Action recognition robust to background clutter by using stereo vision. In The Fourth International Workshop on Video Event Categorization, Tagging and Retrieval, LNCS: Springer.
Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using SMMs. IJCV, 93(1), 22–32.
Article MATH Google Scholar
Sigal, L., Balan, A., & Black, M. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1), 4–27.
Article Google Scholar
Sivic, J., & Zisserman, A. (2009). Efficient visual search of videos cast as text retrieval. IEEE Transactions on PAMI, 31(4), 591–606.
Article Google Scholar
Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Conditional models for contextual human motion recognition. CVIU, 104(2–3), 210–220.
Google Scholar
Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global video descriptor. Machine vision and applications, 24(7), 1473–1485.
Article Google Scholar
Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Article Google Scholar
Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.
Article MATH MathSciNet Google Scholar
Ullah, M. M., Parizi, S. N,, Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British Machine Vision Conference. (Vol. 10, pp. 95–101).
Vail, D., Veloso, M., & Lafferty, J. (2007). Conditional random fields for activity recognition. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, (p. 235). ACM.
Vintsyuk, T. (1971). Element-wise recognition of continuous speech composed of words from a specified dictionary. Cybernetics and Systems Analysis, 7(2), 361–372.
Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Sixth International Conference on Computer Vision, (pp. 363–369).
Vogler, C., & Metaxas, D. (2001). A framework for recognizing the simultaneous aspects of american sign language. Computer Vision and Image Understanding, 81(3), 358–384.
Article MATH Google Scholar
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In International Conference on Computer Vision, (pp. 3551–3558). IEEE.
Young, S., Russell, N. H., & Thornton, J. (1989). Token passing: a simple conceptual model for connected speech recognition systems. Technical Report 38, University of Cambridge, Department of Engineering.
Young, S., Woodland, P., & Byrne, W. (1993). HTK: Hidden Markov model toolkit v1. 5. Technical Report, University of Cambridge, Department of Engineering.
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., et al. (2009). The HTK book. Technical Report: University of Cambridge, Department of Engineering.
Zhou, F., & la Torre, F. D. (2009). Canonical time warping for alignment of human behavior. In Advances in Neural Information Processing Systems, (pp. 2286–2294).

Download references

Acknowledgments

The authors acknowledge support from the European project HUMAVIPS #247525 (2010–2013) and from the ERC Advanced Grant VHIA #340113 (2014–2019). J. Cech acknowledges support from the Czech Science Foundation Project GACR.

Author information

Authors and Affiliations

INRIA Grenoble Rhône-Alpes, Montbonnot Saint-Martin, France
Kaustubh Kulkarni, Georgios Evangelidis & Radu Horaud
Center for Machine Perception, Czech Technical University in Prague, Prague, Czech Republic
Jan Cech

Authors

Kaustubh Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Evangelidis
View author publications
You can also search for this author in PubMed Google Scholar
Jan Cech
View author publications
You can also search for this author in PubMed Google Scholar
Radu Horaud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radu Horaud.

Additional information

Communicated by Ivan Laptev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kulkarni, K., Evangelidis, G., Cech, J. et al. Continuous Action Recognition Based on Sequence Alignment. Int J Comput Vis 112, 90–114 (2015). https://doi.org/10.1007/s11263-014-0758-9

Download citation

Received: 18 February 2013
Accepted: 27 August 2014
Published: 09 September 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s11263-014-0758-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous Action Recognition Based on Sequence Alignment

Abstract

Access this article

Similar content being viewed by others

Natural Action Recognition Using Invariant 3D Motion Encoding

HMDB51: A Large Video Database for Human Motion Recognition

Motion Boundary Trajectory for Human Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Continuous Action Recognition Based on Sequence Alignment

Abstract

Access this article

Similar content being viewed by others

Natural Action Recognition Using Invariant 3D Motion Encoding

HMDB51: A Large Video Database for Human Motion Recognition

Motion Boundary Trajectory for Human Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation