You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been working on implementing the CaseHOLD task within the lm-harness-evaluation framework, as introduced in Pull Request #2570. While I managed to create the necessary files (casehold.yaml and utils.py) to preprocess and evaluate the dataset, I encountered significant challenges during evaluation. Specifically, all models I tested consistently performed at or below random chance, which indicates an issue I have been unable to resolve.
Below are the results of an example run using the lm_eval with the Llama-3.2-1B-Instruct:
I’ve created the pull request to share my implementation and to seek help with identifying and fixing the root cause of these subpar results. If you have experience working with the caseHOLD dataset or lm-harness-evaluation, your insights would be greatly appreciated.
For context, here are the resources relevant to the task:
Hello,
I have been working on implementing the CaseHOLD task within the lm-harness-evaluation framework, as introduced in Pull Request #2570. While I managed to create the necessary files (casehold.yaml and utils.py) to preprocess and evaluate the dataset, I encountered significant challenges during evaluation. Specifically, all models I tested consistently performed at or below random chance, which indicates an issue I have been unable to resolve.
Below are the results of an example run using the
lm_eval
with theLlama-3.2-1B-Instruct
:I’ve created the pull request to share my implementation and to seek help with identifying and fixing the root cause of these subpar results. If you have experience working with the caseHOLD dataset or lm-harness-evaluation, your insights would be greatly appreciated.
For context, here are the resources relevant to the task:
The CaseHOLD dataset
Paper on CaseHOLD
Any guidance or feedback would be greatly appreciated. Thank you in advance for your support!
Best regards,
David
The text was updated successfully, but these errors were encountered: