Tokenization is the process of dividing text into smaller elements, known as tokens, which can be words, subwords, characters, or even parts of words. This technique simplifies text analysis in NLP by breaking complex sentences into manageable units. It also involves segregating punctuation marks, ensuring they are correctly interpreted by machine learning models. This segmentation facilitates various NLP tasks by making the text more accessible for computational processing.
Now you may be wondering why not use each word as a token. According to HuggingFace, the word-based tokenizer results in a vocabulary size of 267,735 words. This is because there are a large number of unique words in the English language. The subword tokenizer, on the other hand, results in a vocabulary size of 50,000 words. This is because the subword tokenizer is able to represent many words using a smaller number of subwords. This can be beneficial in a number of ways. For example, it can reduce the amount of memory that is required to store the vocabulary, and it can improve the performance of neural networks.
There are three types of tokenization: word-based, character-based, and subword tokenization.
Word-based tokenization is the simplest method, and it involves splitting the text into words based on spaces or punctuation. However, this method can be problematic for languages that do not use spaces to separate words, or for words that are not found in the tokenizer’s vocabulary.
Character-based tokenization is a more robust method, but it can be less efficient and can lead to problems with representing words that are not found in the tokenizer’s vocabulary.
Subword tokenization is a compromise between word-based and character-based tokenization. It involves splitting words into smaller subwords, which can then be represented by the tokenizer. This method is more efficient than character-based tokenization, and it can be more effective for representing words that are not found in the tokenizer’s vocabulary.
Here are some of the commonly used tokenization algorithms:
Byte-Pair Encoding (BPE): BPE involves pre-tokenizing the training data into words and then creating a base vocabulary from all symbols in the data. It uses merge rules to form new symbols until the desired vocabulary size is reached.
WordPiece: Similar to BPE, WordPiece starts with a base vocabulary and learns merge rules. However, it selects symbol pairs that maximize the likelihood of the training data, rather than just frequency.
Unigram: This method starts with a large base vocabulary and progressively trims it down. The Unigram model calculates the impact on overall loss for each symbol removal, prioritizing the removal of symbols that least affect the overall loss.
SentencePiece: SentencePiece treats the input as a raw stream, including spaces as characters. It can use BPE or Unigram algorithms for vocabulary construction, making it suitable for languages that don’t use spaces to separate words.
GPT uses Byte-Pair Encoding (BPE), LLaMA uses BPE with SentencePiece, and PaLM employs a 256k token SentencePiece vocabulary.
Example:
Source (https://huggingface.co/transformers/v4.4.2/glossary.html#input-ids)
Let us consider the LLM, BERT. It uses a WordPiece tokenizer.
Sequence = “A Titan RTX has 24GB of VRAM”
Tokenized Sequence:
[‘A’, ‘Titan’, ‘R’, ‘##T’, ‘##X’, ‘has’, ’24’, ‘##GB’, ‘of’, ‘V’, ‘##RA’, ‘##M’]
Encoded Sequence:
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
Note that the tokenizer automatically adds “special tokens” (if the associated model relies on them) which are special IDs the model sometimes uses.
Decoded Sequence:
[CLS] A Titan RTX has 24GB of VRAM [SEP]
See Also: Embedding, Embedding space, Embedding vs Encoding