Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce vocab size for BPE tokenizer #1668

Open
fzyzcjy opened this issue Oct 29, 2024 · 8 comments
Open

Reduce vocab size for BPE tokenizer #1668

fzyzcjy opened this issue Oct 29, 2024 · 8 comments

Comments

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Oct 29, 2024

Hi thanks for the library! I am using e.g. llama 3.1's tokenizer, but its 128k vocab size is too large for my field. Thus, to make training faster, I would like to reduce the tokenizer vocab size by removing the tokens that I will never use (e.g. words outside of my field). However, it seems tokenizers does not provide a convenient method for this.

@ArthurZucker
Copy link
Collaborator

Hey! I'll add the feature request as indeed we don't provide this out of the box.
You need to also re-map the ids of the model embeddings so it's a bit more involved.

If you directly modify the tokenizer.json this can be achieved easily tho!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Oct 29, 2024

@ArthurZucker Thank you! Could you please provide a bit more details? I was thinking about modify tokenizer.json but gets worried about below:

For example, suppose I am only interested in token hello. And suppose merges are h e, l l, ll o, he llo (or something else). Then if I throw away tokens like h, e, ll, ..., or throw away the merges, then I am worried I will never get the hello token.

My naive thought is to keep all "parent" tokens (h, e, ll, ...) and their merges not removed, but that makes vocab quite large. Is there a better way?

@ArthurZucker
Copy link
Collaborator

Well, you can switch to use a non BPE tokenizer for example.
A way to achieve that is to use added_tokens. You can remove h e ll etc and add hello as an added token

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Oct 30, 2024

@ArthurZucker Thank you! However, then I am afraid the tokenization of the same sentence may be quite different. I am using a pretrained llama or something like that and doing some SFT, so I hope not to make tokenized results so wildly different that makes it confused.

@ArthurZucker
Copy link
Collaborator

Yeah completely get it.
It's kind of an open problem : how to effectively compress a tokenizer !
The main issue is that:

  1. Not all the merges are part of the vocab
  2. All tokens should be accessible with merges
  3. You don't necessarily need all merges from the vocab for your language

Here is what I would do:

  1. train a new tokenizer on your language, set the limit you want
  2. from the newly created vocab and merges, I know what tokens are needed for my language.
  3. remap these tokens / embeddings.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Oct 30, 2024

Thanks!

remap these tokens / embeddings

I would appreciate it if I could know a bit more. I am currently thinking about reducing the tokenizer, e.g. pick 10000 vocabs from the original 128000 vocab. Then we can pick the corresponding columns in the embedding/lm_head. Seems you are doing something even more complex: choose some vocabs that may even not appear from the original vocab. Then, I wonder how should we utilize the original embedding/lm_head.

@ArthurZucker
Copy link
Collaborator

What I am suggesting is the same as you, but with a way of selecting those 10000 vocabs (training a new tokenizer on relevant corpus) which should yield ~ the vocab you need or should at least be a good start

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Oct 30, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants