Speaker: Dan Alistarh (IST Austria / ETH Zurich)

Title: Quantized Stochastic Gradient Descent

Abstract: Parallel implementations of stochastic gradient descent (SGD) have received significant research attention recently, thanks to the good scalability properties of this algorithm. A fundamental barrier for parallelizing large-scale SGD is the fact that the cost of communicating the gradient updates between nodes can become very large. Consequently, several compression heuristics have been proposed, by which nodes only communicate quantized, approximate versions of the model updates. Although effective in practice, these heuristics do not always converge, and it is not clear whether they can be improved.

In this talk, I will describe Quantized SGD (QSGD), a family of lossy compression techniques which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions. Empirical results show that QSGD can significantly reduce communication cost for multi-GPU DNN training, while being competitive with standard uncompressed techniques in terms of accuracy on a variety of deep learning tasks. Time permitting, I will also discuss an extension of these techniques which allows SGD to run entirely on compressed, low-precision data representations. For linear models, it is possible to simultaneously quantize the samples, the model, and the gradient updates using as little as one bit per dimension, while maintaining the convergence guarantees. This framework enables an FPGA implementation that's almost an order of magnitude faster than an optimized multi-threaded implementation.