extended-abstract

OpenCL caffe: Accelerating and enabling a cross platform machine learning framework

IWOCL '16: Proceedings of the 4th International Workshop on OpenCLApril 2016Article No.: 8Pages 1–5https://doi.org/10.1145/2909437.2909443

Published:19 April 2016Publication History

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

Pages 1–5

ABSTRACT

Deep neural networks (DNN) achieved significant breakthrough in vision recognition in 2012 and quickly became the leading machine learning algorithm in Big Data based large scale object recognition applications. The successful deployment of DNN based applications pose challenges for a cross platform software framework that enable multiple user scenarios, including offline model training on HPC clusters and online recognition in embedded environments. Existing DNN frameworks are mostly focused on a closed format CUDA implementations, which is limiting of deploy breadth of DNN hardware systems.

This paper presents OpenCL™ caffe, which targets in transforming the popular CUDA based framework caffe [1] into open standard OpenCL backend. The goal is to enable a heterogeneous platform compatible DNN framework and achieve competitive performance based on OpenCL tool chain. Due to DNN models' high complexity, we use a two-phase strategy. First we introduce the OpenCL porting strategies that guarantee algorithm convergence; then we analyze OpenCL's performance bottlenecks in DNN domain and propose a few optimization techniques including batched manner data layout and multiple command queues to better map the problem size into existing BLAS library, improve hardware resources utilization and boost OpenCL runtime efficiency.

We verify OpenCL caffe's successful offline training and online recognition on both server-end and consumer-end GPUs. Experimental results show that the phase-two's optimized OpenCL caffe achieved a 4.5x speedup without modifying BLAS library. The user can directly run mainstream DNN models and achieves the best performance for a specific processors by choosing the optimal batch number depending on H/W properties and input data size.

References

caffe. http://caffe.berkeleyvision.org.Google Scholar
Cifar. http://www.cs.toronto.edu/~kriz/cifar.html.Google Scholar
Convnet. https://github.com/sdemyanov/ConvNet.Google Scholar
Mxnet. https://github.com/dmlc/mxnet.Google Scholar
Tensor flow. http://www.tensorflow.org/tutorials.Google Scholar
Threefry random generator. https://www.deshawresearch.com/resources_random123.html.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. page arXiv:1512.03385, 2015.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. page 1097&Udblac;1105, December 2012.Google Scholar
e. Olga Russakovsky, Jia Deng*. Imagenet large scale visual recognition challenge. pages arXiv:1409.0575, 2014, 2014.Google Scholar

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Read More
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Read More
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL
April 2016
131 pages
ISBN:9781450343381
DOI:10.1145/2909437

Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 April 2016
Check for updates
Author Tags
Deep learning frameworks
Deep neural networks (DNN)
OpenCL
Qualifiers
- extended-abstract
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate84of152submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 611
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OpenCL caffe: Accelerating and enabling a cross platform machine learning framework

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

ABSTRACT

References

Cited By

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

OpenCL caffe: Accelerating and enabling a cross platform machine learning framework

IWOCL '16: Proceedings of the 4th International Workshop on OpenCL

ABSTRACT

References

Cited By

Recommendations

Nuclear Reactor Simulations on OpenCL FPGA Platform

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media