CaseHOLD Task Implementation #2571

zolastro · 2024-12-16T10:30:34Z

Hello,

I have been working on implementing the CaseHOLD task within the lm-harness-evaluation framework, as introduced in Pull Request #2570. While I managed to create the necessary files (casehold.yaml and utils.py) to preprocess and evaluate the dataset, I encountered significant challenges during evaluation. Specifically, all models I tested consistently performed at or below random chance, which indicates an issue I have been unable to resolve.

Below are the results of an example run using the lm_eval with the Llama-3.2-1B-Instruct:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --task casehold --device cuda:1 --batch_size 1 --apply_chat_template

| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|casehold|      1|none  |     0|acc     |↑  |0.1472|±  |0.0049|
|        |       |none  |     0|acc_norm|↑  |0.2633|±  |0.0060|
|        |       |none  |     0|f1      |↑  |0.1472|±  |   N/A|

I’ve created the pull request to share my implementation and to seek help with identifying and fixing the root cause of these subpar results. If you have experience working with the caseHOLD dataset or lm-harness-evaluation, your insights would be greatly appreciated.

For context, here are the resources relevant to the task:

The CaseHOLD dataset
Paper on CaseHOLD
Any guidance or feedback would be greatly appreciated. Thank you in advance for your support!

Best regards,
David

The text was updated successfully, but these errors were encountered:

zolastro changed the title ~~caseHOLD Task Implementation~~ CaseHOLD Task Implementation Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CaseHOLD Task Implementation #2571

CaseHOLD Task Implementation #2571

zolastro commented Dec 16, 2024 •

edited

Loading

CaseHOLD Task Implementation #2571

CaseHOLD Task Implementation #2571

Comments

zolastro commented Dec 16, 2024 • edited Loading

zolastro commented Dec 16, 2024 •

edited

Loading