1. Hydra: Efficient Training for Larger-than-Memory Deep Learning Models
Kabir Nagrecha, Arun Kumar
GPU
GPU
GPU
2. Deep Learning and Scale…A Natural Pairing
GPT-3: 175B parameters
BERT-Large: 345M parameters
Megatron-LM: 1 Trillion Parameters
3. But how do we train them?
No GPU in the world can train a
trillion-parameter model…
But perhaps 1000 GPUs together?
Model Parallelism – Combining
GPUs into a “super-device”