Incorrect max_steps calculation with sample_packing and multi gpu case #2203

chiragjn · 2024-12-19T13:04:11Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

max_steps should be correctly calculated

Current behaviour

Basically I am running into dataloader exhaustion and a crash when HF trainer is trying to gather batch sizes from all ranks.
The dataset I am using is private, so give me some time to figure out how to reproduce this

2 GPU devices
sample packing enabled
completion dataset - so using the axolotl.prompt_strategies.user_defined strategy
micro_batch_size 1

[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 34, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 47, in do_train
[rank0]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/train.py", line 191, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2473, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 5142, in get_batch_samples
[rank0]:     num_items_in_batch = self.accelerator.gather(num_items_in_batch).sum().item()
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/accelerator.py", line 2458, in gather
[rank0]:     return gather(tensor)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 376, in wrapper
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 437, in gather
[rank0]:     return _gpu_gather(tensor)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank0]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
[rank0]:     raise TypeError(
[rank0]: TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_gather_one`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should

Steps to reproduce

TODO

Config yaml

{}

Possible solution

TODO

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

chiragjn · 2024-12-22T08:42:46Z

Closing
It is an issue with HF trainer under specific params, nothing to do with sample packing
huggingface/transformers#35387

chiragjn added the bug Something isn't working label Dec 19, 2024

NanoCode012 added the waiting for reporter label Dec 20, 2024

chiragjn closed this as completed Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect max_steps calculation with sample_packing and multi gpu case #2203

Incorrect max_steps calculation with sample_packing and multi gpu case #2203

chiragjn commented Dec 19, 2024

chiragjn commented Dec 22, 2024

Incorrect max_steps calculation with sample_packing and multi gpu case #2203

Incorrect max_steps calculation with sample_packing and multi gpu case #2203

Comments

chiragjn commented Dec 19, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

chiragjn commented Dec 22, 2024