You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The regular expression used to extract answers for MMLU in common.py fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.
Example
The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.
Explanation
The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".
# All the different ways "Answer" is written in different languages
MULTILINGUAL_ANSWER_REGEXES= [
"Answer\s*:",
"Answer\s*:", # Korean invisible character
"উত্তর\s*:",
"उत्तर\s*:",
"উত্তরঃ",
"উত্তর\s*:",
"Antwort\s*:",
"답변\s*:",
"정답\s*:",
"답\s*:",
"答案\s*:",
"答案\s*:",
"答\s*:",
"答\s*:",
"答复\s*:",
"答曰\s*:",
"الإجابة:",
"الجواب:",
"إجابة:",
"الإجابة النهائية:",
"الإجابة الصحيحة:",
"الإجابة الصحيحة هي:",
"الإجابة هي:",
"Respuesta\s*:",
"Risposta\s*:",
"答え\s*:",
"答え\s*:",
"回答\s*:",
"回答\s*:",
"解答\s*:",
"Jawaban\s*:",
"Réponse\s*:",
"Resposta\s*:",
"Jibu\s*:",
"Idahun\s*:",
"Ìdáhùn\s*:",
"Idáhùn\s*:",
"Àmọ̀nà\s*:",
"Àdáhùn\s*:",
"Ànúgọ\s*:",
"Àṣàyàn\s*:",
]
In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".
Impact
This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.
The text was updated successfully, but these errors were encountered:
Description
The regular expression used to extract answers for MMLU in
common.py
fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.Example
The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.
Explanation
The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".
simple-evals/common.py
Lines 25 to 71 in a8e85cc
In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".
Impact
This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.
The text was updated successfully, but these errors were encountered: