undefined min read
Distributed LLM Training 06 - Where LLM Training Memory Actually Goes
Looking only at parameter size leads to bad decisions; training memory is really a combination of parameters, gradients, optimizer state, and activations