Embedding is the process of transforming discrete symbols, such as words or subwords (See Tokenization), into dense vectors of real numbers. This transformation allows machines to understand the meaning of words and phrases by capturing their semantic relationships within a high-dimensional vector space. The resulting vector representations, known as embeddings, are crucial for various natural language processing (NLP) tasks, including machine translation, text summarization, and question answering.
Here’s how the embedding process works:
Tokenization: The first step in the embedding process is to tokenize the input text. This involves breaking the text down into individual tokens, such as words, punctuation marks, or subwords.
Vectorization: Once the text has been tokenized, each token is assigned a unique identifier (ID). This ID is then used to look up the corresponding vector representation of the token in a vocabulary.
Pre-training: The vector representations of the tokens are then used to train a language model. This model is trained on a massive amount of text data. During this training process, the vectors are adjusted so that they capture the semantic relationships between words.
Embedding: The final step in the embedding process is to extract the learned vectors from the trained language model. These vectors are then used to represent the tokens in the input text.
Here is an overly simplified example meant to illustrate the concept. In real world models the vectors are much larger.
Tokenization:
The input text is divided into individual tokens, such as words or punctuation marks. For example, the sentence “It is sunny outside” might be tokenized as [‘It’, ‘is’, ‘sunny’, ‘outside’].
Vectorization:
Each token is assigned a unique identifier (ID). For example, the token “It” might be assigned the ID 1, the token “is” might be assigned the ID 2, and so on. Each token ID is then mapped to a vector of numbers. For example, the token ID 1 might be mapped to the vector [0.1, 0.2, 0.3], the token ID 2 might be mapped to the vector [0.4, 0.5, 0.6], and so on.
Embedding:
The vectors are then trained to represent the meaning of the tokens. This is done by feeding the vectors into a neural network and adjusting the weights of the network until the network can accurately predict the meaning of the tokens. The trained vectors are then used to represent the tokens in the input text. For example, the sentence “It is sunny outside” might be represented by the vectors [0.12, 0.32, 0.13], [0.43, 6.05, 0.36], [1.7, 0.58, 0.29], [1.03, 1.14, 0.2].
Examples:
Llama-2 employs a vocabulary of 32,000 tokens with 5,120 as vector size.
BERT (small) employs a vocabulary of 30,000 tokens with 768 as vector size. (source: https://arxiv.org/pdf/1810.04805.pdf)
See Also: Tokenization, Embedding space, Embedding vs Encoding