You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio
Forced splits using re (as in GPT2, to avoid the vocabularies such as dog., dog?, dog!)
titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)
There are special tokens. For example, in GPT2: - if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...
minbpe: you can follow the [exercise.md] for the 5 steps required for building a tokenizer.
SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments
- when byte_fallback = True, add_dummy_prefix=Ture
About vocab_size: a very important hyperparameter
Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.
Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)
(Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.
第三步,是进进阶版的文本分词,代表为megabyte, https://github.com/lucidrains/MEGABYTE-pytorch (主要是meta整的新tokenzination方法。按他们自己所说:MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling)
The text was updated successfully, but these errors were encountered:
我们可将文本tokenization分为三个阶段
第一步,就是原本的subword based算法(e.g., BPE, WordPiece, Unigram, SentencePiece。注:以前的BPE就是byte-level的);
第二步,就是进阶版的subword based算法(主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法,https://github.com/karpathy/minbpe 。我简单看了一下,这个minbpe更像是一个tutorial,提出minimal and clean code);
The text was updated successfully, but these errors were encountered: