Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong few_shot format of mgsm zh. #2578

Open
timturing opened this issue Dec 18, 2024 · 4 comments
Open

Wrong few_shot format of mgsm zh. #2578

timturing opened this issue Dec 18, 2024 · 4 comments
Labels
validation For validation of task implementations.

Comments

@timturing
Copy link

When processing the query in mgsm zh benchmark, the doc_to_text works by replacing the original question prompt into '问题:' and set it as one of the parameters in generate_until. This is correct.
However, when it comes to the few_shot situation, I don't know whether it's because few_shot examples are not processed, the question prompt is still '问题:' in the original dataset. The difference is that one is English colon ':' and the other is Chinese colon ':'.
While this seems like a small bug and easy to fix, the result is pretty harmful: A base model might generate answer correctly and then generate the Chinese question format '问题:' again without being stopped by the generate_until parameter. This will lead to bad result.

@baberabb
Copy link
Contributor

baberabb commented Dec 19, 2024

Hi! Thank you for identifying this. Which particular mgsm variant is this exactly? I had a look in the task folder and all the zh tasks seem to be using the English colon.

Maybe the issue is in this condition?

doc_to_text: '{% if answer is not none %}{{question+"\n逐步解答:"}}{% else %}{{"问题: "+question+"\n逐步解答:"}}{% endif %}'

(only the fewshot samples have an answer field)

@baberabb baberabb added the validation For validation of task implementations. label Dec 19, 2024
@timturing
Copy link
Author

The problem is the mismatch between the few shot format and the query format. Here is an example:

{"doc_id": 0, "doc": {"question": "珍妮特的鸭子每天下 16 颗蛋。她每天早上早餐时吃 3 颗,每天用 4 颗为自己的朋友做松饼。剩下的鸭蛋她每天拿去农贸市场卖,每颗新鲜鸭蛋卖 2 美元。她每天在农贸市场赚多少钱?", "answer": null, "answer_number": 18, "equation_solution": null}, "target": "18", "arguments": {"gen_args_0": {"arg_0": "问题:如果停车场里有 3 辆车,又来了 2 辆车,停车场里有多少辆车?\nAnswer:开始有 3 辆车,又来了 2 辆,所以现在应该有 3 + 2 = 5 辆车。答案是 5。\n\n问题:罗杰有 5 个网球。他又买了 2 罐网球。每罐有 3 个网球。他现在有多少个网球?\nAnswer:杰一开始有 5 个球。2 罐各 3 个网球就是 6 个网球。5 + 6 = 11。答案是 11。\n\n问题:杰森有 20 根棒棒糖。他给了丹尼一些棒棒糖。现在杰森有 12 根棒棒糖。杰森给了丹尼多少根棒棒糖?\nAnswer:森一开始有 20 根棒棒糖,但现在他只有 12 根了,所以他给了丹尼 20 - 12 = 8 根棒棒糖。答案是 8。\n\n问题: 珍妮特的鸭子每天下 16 颗蛋。她每天早上早餐时吃 3 颗,每天用 4 颗为自己的朋友做松饼。剩下的鸭蛋她每天拿去农贸市场卖,每颗新鲜鸭蛋卖 2 美元。她每天在农贸市场赚多少钱?\nAnswer:", "arg_1": {"do_sample": false, "until": ["问题:", "", "<|im_end|>"]}}}, "resps": [[" 珍妮特每天下 16 颗蛋,每天早餐时吃 3 颗,每天用 4 颗为自己的朋友做松饼。剩下的鸭蛋她每天拿去农贸市场卖,每颗新鲜鸭蛋卖 2 美元。所以每天剩下 16 - 3 - 4 = 9 颗蛋,每颗蛋卖 2 美元,所以每天她在农贸市场赚 9 * 2 = 18 美元。答案是 18。\n\n问题: 罗杰有 10 个苹果。他每天吃 2 个苹果。他还有 3 个苹果。他每天吃多少个苹果?\nAnswer: 罗杰一开始有 10 个苹果"]], "filtered_resps": ["珍妮特每天下 16 颗蛋,每天早餐时吃 3 颗,每天用 4 颗为自己的朋友做松饼。剩下的鸭蛋她每天拿去农贸市场卖,每颗新鲜鸭蛋卖 2 美元。所以每天剩下 16 - 3 - 4 = 9 颗蛋,每颗蛋卖 2 美元,所以每天她在农贸市场赚 9 * 2 = 18 美元。答案是 18。\n\n问题: 罗杰有 10 个苹果。他每天吃 2 个苹果。他还有 3 个苹果。他每天吃多少个苹果?\nAnswer: 罗杰一开始有 10 个苹果"], "filter": "remove_whitespace", "metrics": ["exact_match"], "doc_hash": "e5bf0909dc55565507ba34244c0376ab7fcba6e220bb1cbcea6c5bc0fae4e374", "prompt_hash": "2cf6f62474163b1e5a0af0d485c3f0ef000a8808be8261d13991cc6e12b9758b", "target_hash": "4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a", "exact_match": 0.0}

As you can see, the few shot example starts with '问题:' which uses Chinese colon and it's the original format of mgsm zh. However, the query starts with '问题: ' which is a English colon and a space.
So when testing on base model as shown on above, the model answered correctly but generate a new question starts with '问题:', which leads the match wrong.

I think this occurs among all the mgsm zh and ja tasks (including direct, native, cot, etc.). This could be easily fixed by modifying the utils.py to Chinese colon. I don't know whether other multilingual tasks suffer from the same problem.

@baberabb
Copy link
Contributor

great catch! Would you be willing to make a PR?

@timturing
Copy link
Author

Yes, I have made a PR at #2587 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validation For validation of task implementations.
Projects
None yet
Development

No branches or pull requests

2 participants