Fix: forbid repeated deepspeed.initialize on training objects #6874

traincheck-team · 2024-12-16T00:18:34Z

Previously closed PR:
#6848

Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.

What are changed:

Marking 'model', 'optimizer' and 'lr_scheduler' in the arguments of deepspeed.initialize with the flag ds_is_inited = True.
Marking 'engine', 'engine.optimizer' and 'engine.lr_scheduler' in the return values of deepspeed.initialize with the flag ds_is_inited = True.
When calling deepspeed.initialize, raise an exception if detected ds_is_inited == True in the input model, optimizer or lr_scheduler

Expected Behavior:
Forbid repeated deepspeed.initialize invocations on model, optimizer and lr_scheduler objects.

traincheck-team · 2024-12-16T00:19:18Z

@microsoft-github-policy-service agree

traincheck-team · 2024-12-16T20:14:13Z

This fix still has interference with existing unit tests. Let me double check before we proceed.

…peedEngine propagates flag from the internal model

traincheck-team · 2024-12-16T21:03:48Z

The unit tests in tests/unit/runtime/test_ds_initialize.py all passed.
The PR is ready for review @tjruwase.

I am not able to check other unit tests due to GPU memory constraint.

deepspeed/__init__.py

tjruwase · 2024-12-18T03:47:12Z

deepspeed/__init__.py

+    if _is_initialized(model):
+        raise ValueError(
+            "Model has already been initialized, please make sure to only call deepspeed.initialize on a model once.")
+    if optimizer is not None and _is_initialized(optimizer):


Note that optimizer could be a Callable, not an object

DeepSpeed/deepspeed/__init__.py

Line 71 in 4cd1d97

optimizer: Optional[Union[Optimizer, DeepSpeedOptimizerCallable]] = None,

tjruwase · 2024-12-18T03:47:41Z

deepspeed/__init__.py

+        raise ValueError(
+            "Optimizer has already been initialized, please make sure to only call deepspeed.initialize on an optimizer once."
+        )
+    if lr_scheduler is not None and _is_initialized(lr_scheduler):


Ditto for lr_scheduler

DeepSpeed/deepspeed/__init__.py

Line 74 in 4cd1d97

lr_scheduler: Optional[Union[_LRScheduler, DeepSpeedSchedulerCallable]] = None,

tjruwase · 2024-12-18T03:49:30Z

deepspeed/__init__.py

@@ -137,6 +181,10 @@ def initialize(args=None,
    zero.partition_parameters.shutdown_init_context()

    assert model is not None, "deepspeed.initialize requires a model"
+    # enforce that model, optimizer, and lr_scheduler have not been used in a previous deepspeed.initialize call
+    _assert_trainobjs_not_inited(model, optimizer, lr_scheduler)


I think this call should be moved into `_mark_trainobjs_initialized()

traincheck-team · 2024-12-19T18:43:01Z

Thanks for the review @tjruwase.

I added handling for callable types for optimizer and lr_scheduler. The handling is to only mark is_ds_inited for objects instead of callables, as the callables are not stateful and reuse should be allowed.

Regarding,

I think this call should be moved into _mark_trainobjs_initialized()

I think _assert_trainobjs_not_inited should still be separated from _mark_trainobjs_initialized as _mark_trainobjs_initialized is also called on the wrapped model and optimizers before exiting from deepspeed.initialize. The wrapped models may already have is_ds_inited being True since in DeepSpeedEngine all model flags will be passed through on the wrapper.

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

add a flag to _mark_trainobjs_initialized to indicate whether the *not_inited` check should be performed
add/check a flag using __dict__ rather than setattr/getattr
check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

fix: forbid repeated deepspeed.initialize on training objects

238ba1f

traincheck-team requested review from tjruwase, loadams and tohtana as code owners December 16, 2024 00:18

fix: remove mark-time checking for non-existence of the flag as DeepS…

d1e7777

…peedEngine propagates flag from the internal model

traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from dc81325 to d1e7777 Compare December 16, 2024 21:02

tjruwase reviewed Dec 18, 2024

View reviewed changes

deepspeed/__init__.py Outdated Show resolved Hide resolved

tjruwase reviewed Dec 18, 2024

View reviewed changes

deepspeed/__init__.py Outdated Show resolved Hide resolved

tjruwase reviewed Dec 18, 2024

View reviewed changes

handle callable types in init mark

62067cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: forbid repeated deepspeed.initialize on training objects #6874

Fix: forbid repeated deepspeed.initialize on training objects #6874

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024 •

edited

Loading

tjruwase Dec 18, 2024

tjruwase Dec 18, 2024

tjruwase Dec 18, 2024

traincheck-team commented Dec 19, 2024 •

edited

Loading

Fix: forbid repeated deepspeed.initialize on training objects #6874

Are you sure you want to change the base?

Fix: forbid repeated deepspeed.initialize on training objects #6874

Conversation

traincheck-team commented Dec 16, 2024 • edited Loading

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024 • edited Loading

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

traincheck-team commented Dec 19, 2024 • edited Loading

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 19, 2024 •

edited

Loading