ScaDL

Scalable Deep Learning over Parallel And Distributed Infrastructures

Welcome to the website of the ScaDL series of workshops!

The ScaDL series of workshops started in 2019, and are generally co-located with the IEEE IPDPS conference.

Scope of the Workshop

Recently, Deep Learning (DL) has received tremendous attention in the research community because of the impressive results obtained for a large number of machine learning problems. The success of state-of-the-art deep learning systems relies on training deep neural networks over a massive amount of training data, which typically requires a large-scale distributed computing infrastructure to run. In order to run these jobs in a scalable and efficient manner, on cloud infrastructure or dedicated HPC systems, several interesting research topics have emerged which are specific to DL. The sheer size and complexity of deep learning models when trained over a large amount of data makes them harder to converge in a reasonable amount of time. It demands advancement along multiple research directions such as, model/data parallelism, model/data compression, distributed optimization algorithms for DL convergence, synchronization strategies, efficient communication and specific hardware acceleration.

In order to provide a few concrete examples, this workshop seeks to advance the following pertinent research directions:

  • Asynchronous and Communication-Efficient SGD: Stochastic gradient descent is at the core of large-scale machine learning. Parallelizing SGD gradient computation across multiple nodes increases the data processed per iteration, but exposes the SGD to communication and synchronization delays and unpredictable node failures in the system. Thus, there is a critical need to design robust and scalable distributed SGD methods to achieve fast error-convergence in spite of such system variabilities.

  • High performance computing aspects: Deep learning is highly compute intensive. Algorithms for kernel computations on commonly used accelerators (e.g. GPUs), efficient techniques for communicating gradients and loading data from storage are critical for training performance.

  • Model and Gradient Compression Techniques: Techniques such as reducing weights and the size of weight tensors help in reducing the compute complexity. Using lower-bit representations allow for more optimal use of memory and communication bandwidth.

This intersection of distributed/parallel computing and deep learning is becoming critical and demands specific attention to address the above topics which some of the broader forums may not be able to provide. The aim of this workshop is to foster collaboration among researchers from distributed/parallel computing and deep learning communities to share the relevant topics as well as results of the current approaches lying at the intersection of these areas.