Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CaseHOLD Task Implementation #2571

Open
zolastro opened this issue Dec 16, 2024 · 0 comments
Open

CaseHOLD Task Implementation #2571

zolastro opened this issue Dec 16, 2024 · 0 comments

Comments

@zolastro
Copy link

zolastro commented Dec 16, 2024

Hello,

I have been working on implementing the CaseHOLD task within the lm-harness-evaluation framework, as introduced in Pull Request #2570. While I managed to create the necessary files (casehold.yaml and utils.py) to preprocess and evaluate the dataset, I encountered significant challenges during evaluation. Specifically, all models I tested consistently performed at or below random chance, which indicates an issue I have been unable to resolve.

Below are the results of an example run using the lm_eval with the Llama-3.2-1B-Instruct:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --task casehold --device cuda:1 --batch_size 1 --apply_chat_template

| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|casehold|      1|none  |     0|acc     |↑  |0.1472|±  |0.0049|
|        |       |none  |     0|acc_norm|↑  |0.2633|±  |0.0060|
|        |       |none  |     0|f1      |↑  |0.1472|±  |   N/A|

I’ve created the pull request to share my implementation and to seek help with identifying and fixing the root cause of these subpar results. If you have experience working with the caseHOLD dataset or lm-harness-evaluation, your insights would be greatly appreciated.

For context, here are the resources relevant to the task:

The CaseHOLD dataset
Paper on CaseHOLD
Any guidance or feedback would be greatly appreciated. Thank you in advance for your support!

Best regards,
David

@zolastro zolastro changed the title caseHOLD Task Implementation CaseHOLD Task Implementation Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant