“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”
“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”
Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.
“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”
GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension/translation/summarization tasks, but writing/generative tasks. BERT is a response to GPT and GPT-2 is in turn a response to BERT. GPT-2 was released Feb’2019 and is trained on 40Gb of text
This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner – just my observation. Although implemented differently it’s a way to keep track of and refer to global state.
AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html
BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text.
Bahdanau, 2014 https://arxiv.org/abs/1409.0473
“The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”