You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's hard to believe that Qwen2.5 models, which show very strong scores on GSM8K, achieve zero or very low scores.
I was looking at the answer validation logic and found that the generation must follow the exact response format to be extracted using the regular expression (i.e., "The final answer is")
r"Final Answer: The final answer is(.*?). I hope it is correct.",
def get_unnormalized_answer(text: str) -> str:
end_seq = "I hope it is correct."
text += end_seq
match = re.search(
r"Final Answer: The final answer is(.*?). I hope it is correct.",
text,
)
if match:
return match.group(1).strip()
else:
return INVALID_ANSWER
I'm afraid that's too strict. Just missing "The final answer is" in the response could turn the perfect score into zero.
Q1. Is this a known issue (i.e., has there been discussed somewhere)?
Q2. Is there a plan to update the regular expression or another version of Math Lvl5 that has a more relaxed extraction logic?
The text was updated successfully, but these errors were encountered:
Hi!
Yes, that's a known issue, but on our side, it's a feature not a bug: since we provide the prompt with 5-shot examples, we expect the models to be able to do in context learning of the formatting instead of overfitting format from their instruction fine tuning data.
I've been observing a huge discprepancy between models on Math Lvl 5 on Open LLM Leaderbord 2. For example,
Qwen/Qwen2.5-72B-Instruct
: 1.21Qwen/Qwen2.5-7B-Instruct
,Qwen/Qwen2.5-3B-Instruct
: 0meta-llama/Llama-3.2-3B-Instruct
: 17.15It's hard to believe that Qwen2.5 models, which show very strong scores on GSM8K, achieve zero or very low scores.
I was looking at the answer validation logic and found that the generation must follow the exact response format to be extracted using the regular expression (i.e., "The final answer is")
lm-evaluation-harness/lm_eval/tasks/leaderboard/math/utils.py
Line 199 in f49b037
I'm afraid that's too strict. Just missing "The final answer is" in the response could turn the perfect score into zero.
The text was updated successfully, but these errors were encountered: