Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Notes] Text Tokenization in LLM #9

Open
heathersherry opened this issue Mar 5, 2024 · 0 comments
Open

[Notes] Text Tokenization in LLM #9

heathersherry opened this issue Mar 5, 2024 · 0 comments

Comments

@heathersherry
Copy link
Owner

heathersherry commented Mar 5, 2024

我们可将文本tokenization分为三个阶段

  • 第一步,就是原本的subword based算法(e.g., BPE, WordPiece, Unigram, SentencePiece。注:以前的BPE就是byte-level的);

  • 第二步,就是进阶版的subword based算法(主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法,https://github.com/karpathy/minbpe 。我简单看了一下,这个minbpe更像是一个tutorial,提出minimal and clean code);

  • Let's build the GPT Tokenizer (Video by Karpathy): https://www.youtube.com/watch?v=zduSFxRajkE
  • larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio
  • 20240304
  • Forced splits using re (as in GPT2, to avoid the vocabularies such as dog., dog?, dog!)
  • token
  • titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)
  • There are special tokens. For example, in GPT2: 20240307 - if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...
  • minbpe: you can follow the [exercise.md] for the 5 steps required for building a tokenizer.
  • SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments
  • 20240307-2
  • 20240307-3
  • 20240307-4 - when byte_fallback = True, add_dummy_prefix=Ture
  • About vocab_size: a very important hyperparameter
  • Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.
  • Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)
  • 20240307-5
  • (Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.
    20240307-6
  • 第三步,是进进阶版的文本分词,代表为megabyte, https://github.com/lucidrains/MEGABYTE-pytorch (主要是meta整的新tokenzination方法。按他们自己所说:MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant