To enable the use of probabilistic models at Amazon at scale, we propose to exploit synergies between the Clay probabilistic language and the distributed computation effort called Ray that is currently being carried out at UC Berkeley RISELab.
Update
Researchers
- Vinamra Benara, UC Berkeley RISE Lab, https://people.eecs.berkeley.edu/~vbenara/
- Holger Waechtler, Amazon CoreAI
- Asmus Hetzel, Amazon CoreAI
- Ion Stoica, UC Berkeley RISE Lab, https://people.eecs.berkeley.edu/~istoica/
Overview
Clay is a tool for Probabilistic Programming on multi-GPU machines internally developed by Amazon’s CoreAI Berlin team, used in ongoing projects. While performance and ease of use are very satisfactory, the large datasets currently processed are quickly approaching the limits of multi-GPU-instances in the AWS cloud. The amount of data processed is filling the available GPU memory of even the largest available machines.The goal of this project is to integrate RISELab’s Ray as toolkit for distributed computation and scale-out over multiple instances that will (i) overcome these limits in ongoing projects where we expect datasets experiencing steady natural growth for the foreseeable future, and (ii) enable future use cases working on even larger datasets.
We are currently solving this by implementing dedicated distributed inference algorithms and by also extending the clay’s internal executor to automatically distribute over several machines. We also have implemented distributed versions of Hamiltonian Monte Carlo (HMC) and L-BFGS. This involved design, implementation, and validation of distributed solvers, and integration as backend of Clay’s compute graph compiler, and provided interesting opportunities for algorithmic research into scalable inference methods.