This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and "war stories" of users that have adopted BigDL to address their challenges(i.e., how to easily build end-to-end data analysis and deep learning pipelines for their production data).
In ACM Symposium of Cloud Computing conference (SoCC) 2019
Recent breakthroughs in artificial intelligence applications have brought deep learning to the forefront of new generations of data analytics. In this tutorial, we will present the practice and design tradeoffs on building large-scale deep learning applications (such as computer vision and NLP), for production data and workflow on Big Data platforms. We will provide an overview of emerging deep learning frameworks for Big Data (e.g., BigDL, TensorFlowOnSpark, Deep Learning Pipelines for Apache Spark, etc.), and present the underlying distributed systems and algorithms. More importantly, we will show how to build and productionize end-to-end deep learning application pipelines for Big Data (on top of Analytics Zoo, a unified analytics + AI platform for distributed TensorFlow, Keras and BigDL on Apache Spark), using real-world use cases (such as Azure, JD.com, World Bank, Midea/KUKA, etc.)
Tutorial in the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI) 2019
MapReduce and its popular open source implementation, Hadoop, are moving toward ubiquitous for Big Data storage and processing. Therefore, it is essential to quantitatively evaluate and characterize the Hadoop deployment through extensive benchmarking. We present HiBench, a representative and comprehensive benchmark suite forHadoop, which consists of a set of Hadoop programs including both synthetic micro-benchmarks and real-world applications.
In Proceedings of the 26th International Conference on Data Engineering Whokshops (ICDEW) 2010
Modern network processors employs parallel processing engines(PEs) to be organized in a pipelined fashion to enable higher processing throughput and flexibility. In this paper, we present a novel program transformation technique to automatically partitions a sequential packet processing application into coordinated pipelined parallel subtasks, which can be naturally mapped to contemporary high performance network processors. We have implemented the proposed transformation method in an auto-partitioning C compiler product for Intel Network Processors. Experimental results show that our method provides impressive speed up for the commonly used NPF IPv4 forwarding and IP forwarding benchmarks.
In ACM 2005 Sigplan Conference on Programming Language Design and Implementation (PLDI 2005)