Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

suhara · 2024-12-05T01:38:57Z

I've been observing a huge discprepancy between models on Math Lvl 5 on Open LLM Leaderbord 2. For example,

Qwen/Qwen2.5-72B-Instruct: 1.21
Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-3B-Instruct: 0
meta-llama/Llama-3.2-3B-Instruct: 17.15

It's hard to believe that Qwen2.5 models, which show very strong scores on GSM8K, achieve zero or very low scores.

I was looking at the answer validation logic and found that the generation must follow the exact response format to be extracted using the regular expression (i.e., "The final answer is")

lm-evaluation-harness/lm_eval/tasks/leaderboard/math/utils.py

Line 199 in f49b037

r"Final Answer: The final answer is(.*?). I hope it is correct.",

def get_unnormalized_answer(text: str) -> str:
    end_seq = "I hope it is correct."
    text += end_seq
    match = re.search(
        r"Final Answer: The final answer is(.*?). I hope it is correct.",
        text,
    )
    if match:
        return match.group(1).strip()
    else:
        return INVALID_ANSWER

I'm afraid that's too strict. Just missing "The final answer is" in the response could turn the perfect score into zero.

Q1. Is this a known issue (i.e., has there been discussed somewhere)?
Q2. Is there a plan to update the regular expression or another version of Math Lvl5 that has a more relaxed extraction logic?

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-12-06T07:40:36Z

Hi!
Yes, that's a known issue, but on our side, it's a feature not a bug: since we provide the prompt with 5-shot examples, we expect the models to be able to do in context learning of the formatting instead of overfitting format from their instruction fine tuning data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

suhara commented Dec 5, 2024 •

edited

Loading

clefourrier commented Dec 6, 2024

Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

Answer extraction logic for Math Lvl 5 (Open LLM Leaderboard 2) may be too strict #2539

Comments

suhara commented Dec 5, 2024 • edited Loading

clefourrier commented Dec 6, 2024

suhara commented Dec 5, 2024 •

edited

Loading