“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”
“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”
Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.
“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”
GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension/translation/summarization tasks, but writing/generative tasks. BERT is a response to GPT and GPT-2 is in turn a response to BERT. GPT-2 was released Feb’2019 and is trained on 40Gb of text
This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner – my observation. Although implemented differently it’s a way to keep track of and refer to global state.