Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect max_steps calculation with sample_packing and multi gpu case #2203

Closed
6 of 8 tasks
chiragjn opened this issue Dec 19, 2024 · 1 comment
Closed
6 of 8 tasks
Labels
bug Something isn't working waiting for reporter

Comments

@chiragjn
Copy link
Contributor

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

max_steps should be correctly calculated

Current behaviour

Basically I am running into dataloader exhaustion and a crash when HF trainer is trying to gather batch sizes from all ranks.
The dataset I am using is private, so give me some time to figure out how to reproduce this

  • 2 GPU devices
  • sample packing enabled
  • completion dataset - so using the axolotl.prompt_strategies.user_defined strategy
  • micro_batch_size 1
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 34, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/cli/train.py", line 47, in do_train
[rank0]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/axolotl/train.py", line 191, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 2473, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/transformers/trainer.py", line 5142, in get_batch_samples
[rank0]:     num_items_in_batch = self.accelerator.gather(num_items_in_batch).sum().item()
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/accelerator.py", line 2458, in gather
[rank0]:     return gather(tensor)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 376, in wrapper
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 437, in gather
[rank0]:     return _gpu_gather(tensor)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank0]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jovyan/.conda/envs/jupyter-base/lib/python3.11/site-packages/accelerate/utils/operations.py", line 129, in recursively_apply
[rank0]:     raise TypeError(
[rank0]: TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_gather_one`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should 

Steps to reproduce

TODO

Config yaml

{}

Possible solution

TODO

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@chiragjn chiragjn added the bug Something isn't working label Dec 19, 2024
@chiragjn
Copy link
Contributor Author

Closing
It is an issue with HF trainer under specific params, nothing to do with sample packing
huggingface/transformers#35387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for reporter
Projects
None yet
Development

No branches or pull requests

2 participants