Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

Open
suhara opened this issue Dec 5, 2024 · 1 comment

Comments

@suhara
Copy link

suhara commented Dec 5, 2024

I've been observing a huge discprepancy between models on Math Lvl 5 on Open LLM Leaderbord 2. For example,

  • Qwen/Qwen2.5-72B-Instruct: 1.21
  • Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-3B-Instruct: 0
  • meta-llama/Llama-3.2-3B-Instruct: 17.15

It's hard to believe that Qwen2.5 models, which show very strong scores on GSM8K, achieve zero or very low scores.

I was looking at the answer validation logic and found that the generation must follow the exact response format to be extracted using the regular expression (i.e., "The final answer is")

r"Final Answer: The final answer is(.*?). I hope it is correct.",

def get_unnormalized_answer(text: str) -> str:
    end_seq = "I hope it is correct."
    text += end_seq
    match = re.search(
        r"Final Answer: The final answer is(.*?). I hope it is correct.",
        text,
    )
    if match:
        return match.group(1).strip()
    else:
        return INVALID_ANSWER

I'm afraid that's too strict. Just missing "The final answer is" in the response could turn the perfect score into zero.

  • Q1. Is this a known issue (i.e., has there been discussed somewhere)?
  • Q2. Is there a plan to update the regular expression or another version of Math Lvl5 that has a more relaxed extraction logic?
@clefourrier
Copy link
Contributor

Hi!
Yes, that's a known issue, but on our side, it's a feature not a bug: since we provide the prompt with 5-shot examples, we expect the models to be able to do in context learning of the formatting instead of overfitting format from their instruction fine tuning data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants