Month: January 2023

Weights vs Activations

Why do Activations need more bits (16) than weights (8) ? source – https://stackoverflow.com/questions/72397839/why-do-activations-need-more-bits-16bit-than-weights-8bit-in-tensor-flows-n

Answer:

Activations are actual signals propagating through the network. They have nothing to do with activation function, this is just a name collision. They are higher accuracy because they are not part of the model, so they do not affect storage, download size, or memory usage, as if you are not training your model you never store activations beyond the current one.

For example for an MLP (middle layer perceptron ?) we have something among the lines of

a1 = relu(W1x + b1)
a2 = relu(W2a1 + b2)
...
an = Wnan-1 + bn

where each W and b will be 8bit parameters. And activations are a1, …, an. The thing is you only need previous and current layer, so to calculate at you just need at-1, and not previous ones, consequently storing them during computation at higher accuracy is just a good tradeoff.

Datastore for Activations:

  • During training, activations are typically stored in the GPU’s memory for models trained on GPUs. This is because backpropagation requires these activations for gradient computation. Given that modern deep learning models can have millions to billions of parameters, storing all these activations can be memory-intensive.
  • During inference, you only need to perform a forward pass and don’t need to store all activations, except for the ones necessary for computing subsequent layers. Once an activation has been used to compute the next layer, it can be discarded if not needed anymore.