BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT is a pre-trained neural network that has learned to take in sentences with some words obscured, and predict those obscured words (their dictionary id, at least). It is a deep, bidirectional network, and can be fine-tuned to a specific question/answer or other language task with a “small” amount of training data. Thus far, it is part of the ensemble method that is winning the SQuAD challenge, the question answering dataset out of Stanford.
BERT is meant to be used for transfer learning (fixing the weights of a neural network, adding a few layers on top, then training just the last few layers to a specific task). It is NOT a set of embeddings.
The model architecture is a multi-layer bidirectional Transformer encoder. We’ll find out what a Transformer encoder is in a moment, but it is available in the tensor2tensor library: http://nlp.seas.harvard.edu/2018/04/03/attention.html