Automated ML Workflow for Distributed Big Data Using Analytics Zoo


Jason Dai


2-5PM (Pacific Time), June 19, 2020


Applying machine learning (ML) techniques to distributed big data analytics plays a central role in today’s intelligent applications and systems. These problem settings have pushed the field to address issues of data scale that were almost inconceivable even a decade ago for AI researchers. In addition, building machine learning applications for these big data problems can also be a laborious and knowledge-intensive process for ML engineers.

To address these challenges, we have open sourced Analytics Zoo, which helps users to build and productionize end-to-end ML workflow for distributed big data in an automated fashion. Using Analytics Zoo, users can simply build conventional Python notebooks on their laptops (with possible AutoML support), which can then automatically scale out to large clusters and process large amount of data in a distributed fashion.

This tutorial will present how to implement the automated ML workflow for big data (with a focus on supporting computer vision models and pipelines), by seamlessly integrating different technologies including deep learning frameworks (e.g., TensroFlow, Keras, PyTorch, etc.), distributed analytics frameworks (e.g., Apache Spark, Apache Flink, Apache Kafka, Ray, etc.), and AutoML techniques (such as hyperparameter optimizations). In addition, it will also share real-world experience and “war stories” of users who have adopted Analytics Zoo to address their challenges when applying ML techniques to distributed big data analytics.