Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral Instruct tokenizer from Colab notebook doesn't work. #38

Open
jmuntaner-smd opened this issue Jul 8, 2024 · 2 comments
Open

Comments

@jmuntaner-smd
Copy link

jmuntaner-smd commented Jul 8, 2024

When running the Google Colab notebook, it looks like there is some error when loading the Mixtral Instruct Tokenizer:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    109         elif fast_tokenizer_file is not None and not from_slow:
    110             # We have a serialization from tokenizers which let us directly build the backend
--> 111             fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    112         elif slow_tokenizer is not None:
    113             # We need to convert a slow tokenizer to build the backend

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

This appears to be a bug with the transformers and tokenizer versions (see: huggingface/transformers#31789), so the requirements.txt probably need to be updated. But i haven't been able to fix it properly. I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

@kaushikacharya
Copy link

kaushikacharya commented Jul 11, 2024

`> I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

`What is the tokenizer version that you are using? I am also facing a similar issue.

The issue seems to be due to recent commits in
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commits/main

@jmuntaner-smd
Copy link
Author

I just changed the google colab line to this: tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants