-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce vocab size for BPE tokenizer #1668
Comments
Hey! I'll add the feature request as indeed we don't provide this out of the box. If you directly modify the |
@ArthurZucker Thank you! Could you please provide a bit more details? I was thinking about modify tokenizer.json but gets worried about below: For example, suppose I am only interested in token My naive thought is to keep all "parent" tokens ( |
Well, you can switch to use a non BPE tokenizer for example. |
@ArthurZucker Thank you! However, then I am afraid the tokenization of the same sentence may be quite different. I am using a pretrained llama or something like that and doing some SFT, so I hope not to make tokenized results so wildly different that makes it confused. |
Yeah completely get it.
Here is what I would do:
|
Thanks!
I would appreciate it if I could know a bit more. I am currently thinking about reducing the tokenizer, e.g. pick 10000 vocabs from the original 128000 vocab. Then we can pick the corresponding columns in the embedding/lm_head. Seems you are doing something even more complex: choose some vocabs that may even not appear from the original vocab. Then, I wonder how should we utilize the original embedding/lm_head. |
What I am suggesting is the same as you, but with a way of selecting those 10000 vocabs (training a new tokenizer on relevant corpus) which should yield ~ the vocab you need or should at least be a good start |
Thank you! |
Hi thanks for the library! I am using e.g. llama 3.1's tokenizer, but its 128k vocab size is too large for my field. Thus, to make training faster, I would like to reduce the tokenizer vocab size by removing the tokens that I will never use (e.g. words outside of my field). However, it seems tokenizers does not provide a convenient method for this.
The text was updated successfully, but these errors were encountered: