Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenizer mismatch bug between model and tokenizer for THUDM/glm-… #2672

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

darkSuperman
Copy link

Fix tokenizer mismatch bug between model and tokenizer for THUDM/glm-4-9b example

@LaurentMazare
Copy link
Collaborator

Did you try out the change? There doesn't seem to be a tokenizer.json file in the repo that you've switched to as far as I can tell.https://huggingface.co/THUDM/glm-4-9b/tree/main

@darkSuperman
Copy link
Author

Sorry I misread that. Also I found this branch used natively within Transformers, and it also provides a tokenizer.json, but loading the model requires some changes Are you interested in using this branch to modify the glm4 example? https://huggingface.co/THUDM/glm-4-9b/tree/refs%2Fpr%2F15

@LaurentMazare
Copy link
Collaborator

The main reason why this model uses the tokenizer from codegeex4 is that the two tokenizers should be identical. When you look at the two sentencepiece model they have the same hash, here and here so I don't think the current version is actually fine or maybe I'm missing something here?

@darkSuperman
Copy link
Author

Yes, they are the same. Also, when I was running the inference, I encountered some problems and it seemed that I could not finish the inference. I am trying it and I will give you feedback if I have more information.

@darkSuperman
Copy link
Author

darkSuperman commented Dec 29, 2024

I tried the glm4 example and used the tokenizer.json from THUDM/codegeex4-all-9b, but I couldn't output eos_token. The token output never ended until the length of sample_len was reached, and I found that the token output was repeated. The code is the same as the glm4 example, the only difference is that I loaded the st and tokenizer.json files from the local huggingface-cli download file.

let filenames: Vec<PathBuf> = (1..=10) //Load 10 st files from local
        .map(|i| {
            format!( "/home/neptune/glm4/model-{0:05}-of-00010.safetensors", i )
        })
        .map(|path| Path::new(&path).to_path_buf())
        .collect();

let tokenizer_filename =  PathBuf::from("/home/neptune/temp/tokenizer.json");

In addition, I saw that GLM officially provided the HF version of glm4, https://huggingface.co/THUDM/glm-4-9b-chat-hf. Can I use candle for inference at present?

Could you provide some help or troubleshooting suggestions? Thanks @LaurentMazare

@LaurentMazare
Copy link
Collaborator

Not sure to understand what your problem is exactly, I've refactored a bit the glm4 example so that it is closer to other examples and from what I see most generations properly end with an eos token being produced, e.g.

$ cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "
    Finished `release` profile [optimized] target(s) in 0.81s
     Running `target/release/examples/glm4 --prompt 'This is a test. '`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 9.378198ms
loaded the model in 3.518224561s
starting the inference loop
This is a test.  This is only a test.
: The Federal Emergency Management Agency has issued an alert that the U.S. East Coast could be hit by high winds and heavy rains from Hurricane Irene, which was upgraded to Category 3 storm on Friday morning. The hurricane's eye wall had been expected to pass well east of Florida late Saturday night or early Sunday morning, but it now appears likely to make landfall in North Carolina sometime between Monday afternoon and Tuesday.
 the National Weather Service said. "The center of Irene is forecast to move near or over portions of eastern Cuba tonight," according to a statement issued by the NWS at 8:00 AM ET on Saturday. The storm's maximum sustained winds have increased to near 125 mph, with higher gusts, and it has been upgraded from Category 2 to Category 3 hurricane.
The U.S. National Hurricane Center said in its latest advisory. Irene is expected to remain a powerful hurricane through tonight; some fluctuations in intensity are possible on Sunday as the center of the storm moves over eastern Cuba. On the forecast track, the core of Irene will move near or over portions of eastern Cuba today and approach the southeastern United States coast late Monday and Tuesday.
Posted by dave at 9:00 AM
251 tokens generated (48.56 token/s)

Maybe you can provide more details, ideally with a simple way to reproduce the isuse?

@darkSuperman
Copy link
Author

The output after I run it is as follows,Look at the end of the last line, the same token will be output until the length of sample_len reaches 2048, so I ended it early.:

avx: false, neon: false, simd128: false, f16c: false
temp: 0.60 repeat-penalty: 1.20 repeat-last-n: 64
retrieved the files in 17.502µs
loaded the model in 3.467558855s
starting the inference loop
This is a test.
 This is only a test.
If you were listening to the Emergency Broadcast System in 1963, this would have been your warning that an emergency was imminent and that it might be necessary for you to take shelter immediately.
The EBS was designed as part of President John F. Kennedy's civil defense program. The system was intended to provide a means by which the public could receive official information about air raid warnings or other emergencies.
The Emergency Broadcast System (EBS) is an emergency warning system that was used in the United States from 1963 until it was decommissioned on December 31, 2011.
During its operational life, the EBS consisted of a series of transmitters located throughout the country. These transmitters were connected to a central control center located at the Federal Emergency Management Agency (FEMA) headquarters in Washington, D.C.
When: When an emergency situation was detected or predicted by appropriate authorities, they would activate the EBS and send out a warning message over television and radio stations across the United States.
How to listen for warnings:
- Turn on your TV or radio
- Tune it to one of the designated Emergency Broadcast System (EBS) channels
- Listen carefully for any emergency messages that may be broadcasted. If you hear an EBS tone, this indicates that a warning message is about to follow.
What: The EBS was designed so that it could send out warnings over both television and radio stations across the United States as well as through cable TV systems throughout North America including Canada Mexico
The Emergency Broadcast System (EBS) was decommissioned on December 31st 2011. This means that there are no longer any designated EBS channels for broadcast of emergency messages.
What: The decommissioning of the EBS meant that:
- There were no longer any designated EBS channels for broadcast of emergency messages
- Emergency management agencies had to rely on other methods such as social media Twitter Facebook Instagram TikToktok YouTube Snapchat Reddit LinkedIn WeChatGPTelegrammGrammGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGramGram^C

This is the code copied from the glm4 example with almost no changes, nnly one line is added to print token information:

fn run(&mut self, sample_len: usize) -> anyhow::Result<()> {
        use std::io::BufRead;
        use std::io::BufReader;
        use std::io::Write;
        println!("starting the inference loop");
        let stdin = std::io::stdin();
        let reader = BufReader::new(stdin);
        for line in reader.lines() {
            let line = line.expect("Failed to read line");
            let tokens = self.tokenizer.encode(line, true).expect("tokens error");
            if tokens.is_empty() {
                panic!("Empty prompts are not supported in the chatglm model.")
            }
            if self.verbose_prompt {
                for (token, id) in tokens.get_tokens().iter().zip(tokens.get_ids().iter()) {
                    let token = token.replace('▁', " ").replace("<0x0A>", "\n");
                    println!("{id:7} -> '{token}'");
                }
            }
            let eos_token = match self.tokenizer.get_vocab(true).get("<|endoftext|>") {
                Some(token) => *token,
                None => panic!("cannot find the endoftext token"),
            };
            let mut tokens = tokens.get_ids().to_vec();
            let mut generated_tokens = 0usize;

            std::io::stdout().flush().expect("output flush error");
            let start_gen = std::time::Instant::now();

            let mut count = 0;
            let mut result = vec![];
            for index in 0..sample_len {
                count += 1;
                let context_size = if index > 0 { 1 } else { tokens.len() };
                let ctxt = &tokens[tokens.len().saturating_sub(context_size)..];
                let input = Tensor::new(ctxt, &self.device)?.unsqueeze(0)?;
                let logits = self.model.forward(&input)?;
                let logits = logits.squeeze(0)?.to_dtype(self.dtype)?;
                let logits = if self.repeat_penalty == 1. {
                    logits
                } else {
                    let start_at = tokens.len().saturating_sub(self.repeat_last_n);
                    candle_transformers::utils::apply_repeat_penalty(
                        &logits,
                        self.repeat_penalty,
                        &tokens[start_at..],
                    )?
                };

                let next_token = self.logits_processor.sample(&logits)?;
                tokens.push(next_token);
                generated_tokens += 1;
                if next_token == eos_token {
                    break;
                }
                let token: String = self
                    .tokenizer
                    .decode(&[next_token], true)
                    .expect("Token error");
                if self.verbose_prompt {
                    println!(
                        "[Count: {}] [Raw Token: {}] [Decode Token: {}]",
                        count, next_token, token
                    );
                }
                print!("{}", token); //Added this line to print
                result.push(token);
                std::io::stdout().flush()?;
            }
            let dt = start_gen.elapsed();
            println!(
                "\n{generated_tokens} tokens generated ({:.2} token/s)",
                generated_tokens as f64 / dt.as_secs_f64(),
            );
            println!("Result:");
            for tokens in result {
                print!("{tokens}");
            }
            self.model.reset_kv_cache(); // clean the cache
        }
        Ok(())
    }

This is the code that is loaded, with some changes:

pub fn main() -> anyhow::Result<()> {
    let args = Args::parse();
    println!(
        "avx: {}, neon: {}, simd128: {}, f16c: {}",
        candle_core::utils::with_avx(),
        candle_core::utils::with_neon(),
        candle_core::utils::with_simd128(),
        candle_core::utils::with_f16c()
    );
    println!(
        "temp: {:.2} repeat-penalty: {:.2} repeat-last-n: {}",
        args.temperature.unwrap_or(0.6),
        args.repeat_penalty,
        args.repeat_last_n
    );

    let start = std::time::Instant::now();

    let filenames: Vec<PathBuf> = (1..=10)
        .map(|i| {
            format!(
                "/home/paibo/neptune/huggingface/model-{0:05}-of-00010.safetensors",
                i
            )
        })
        .map(|path| Path::new(&path).to_path_buf())
        .collect();

    println!("retrieved the files in {:?}", start.elapsed());
    let tokenizer_filename =
        std::path::PathBuf::from("/home/paibo/neptune/huggingface/tokenizer.json");
    let tokenizer = Tokenizer::from_file(tokenizer_filename).expect("Tokenizer Error");

    let start = std::time::Instant::now();
    let config = Config::glm4();

    let device = candle_examples::device(false)?;
    let dtype = if device.is_cuda() {
        DType::BF16
    } else {
        DType::F32
    };
    let vb = unsafe { VarBuilder::from_mmaped_safetensors(&filenames, dtype, &device)? };
    let model = Model::new(&config, vb).unwrap();
    println!("loaded the model in {:?}", start.elapsed());

    let mut pipeline = TextGeneration::new(
        model,
        tokenizer,
        args.seed,
        args.temperature,
        args.top_p,
        args.repeat_penalty,
        args.repeat_last_n,
        args.verbose_prompt,
        &device,
        dtype,
    );
    pipeline.run(args.sample_len).unwrap();
    Ok(())
}

@LaurentMazare
Copy link
Collaborator

Could you try running the same code that I ran (so the glm4 example from the current github version) and see if it behaves differently compared to what I got?

cargo run --features cuda -r --example glm4 -- --prompt "This is a test. "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants