[Notes] Text Tokenization in LLM #9

heathersherry · 2024-03-05T02:38:33Z

我们可将文本tokenization分为三个阶段

第一步，就是原本的subword based算法（e.g., BPE, WordPiece, Unigram, SentencePiece。注：以前的BPE就是byte-level的)；
第二步，就是进阶版的subword based算法（主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法，https://github.com/karpathy/minbpe 。我简单看了一下，这个minbpe更像是一个tutorial，提出minimal and clean code）；

Let's build the GPT Tokenizer (Video by Karpathy): https://www.youtube.com/watch?v=zduSFxRajkE

larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio

Forced splits using re (as in GPT2, to avoid the vocabularies such as dog., dog?, dog!)

titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)

There are special tokens. For example, in GPT2: - if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...

minbpe: you can follow the [exercise.md] for the 5 steps required for building a tokenizer.

SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments

- when byte_fallback = True, add_dummy_prefix=Ture

About vocab_size: a very important hyperparameter

Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.

Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)

(Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.

第三步，是进进阶版的文本分词，代表为megabyte， https://github.com/lucidrains/MEGABYTE-pytorch （主要是meta整的新tokenzination方法。按他们自己所说：MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling）

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Notes] Text Tokenization in LLM #9

[Notes] Text Tokenization in LLM #9

heathersherry commented Mar 5, 2024 •

edited

Loading

[Notes] Text Tokenization in LLM #9

[Notes] Text Tokenization in LLM #9

Comments

heathersherry commented Mar 5, 2024 • edited Loading

heathersherry commented Mar 5, 2024 •

edited

Loading