Add shuffling in Nanotron for subsequent epochs when data is repeated #247

Lauler · 2024-11-24T13:32:12Z

Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. This PR adds unique within-epoch shuffling for each epoch.

See the following issue: #237

I ran the tests in tests/nanotron:

=========================================================================== warnings summary ===========================================================================
tests/helpers/context.py:7: 35 warnings
  /home/faton/projects/text/nanotron_dev/nanotron/tests/helpers/context.py:7: PytestCollectionWarning: cannot collect test class 'TestContext' because it has a __init__ constructor (from: nanoset/test_build_nanoset_dataloader.py)
    class TestContext:

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================== 12 passed, 35 warnings in 44.55s ===================================================================

@TJ-Solergibert

NouamaneTazi

Nice catch! LGTM 🤗
Minor sugg before merging

NouamaneTazi · 2024-11-25T09:27:20Z

src/nanotron/data/nanoset.py

+
+        # Shuffle indices in each epoch with different random seeds and concatenate them
+        r = np.random.RandomState(self.random_seed)
+        epoch_random_seeds = r.randint(0, 2**32 - 1, num_epochs)


Can we just use self.random_seed + num_epoch for easier reproducibility?

Good suggestion, makes it a lot simpler and clearer what's going on. I've changed the code to incorporate this in latest commit:

nanotron/src/nanotron/data/nanoset.py

Lines 116 to 123 in f060414

dataset_indices = []

dataset_sample_indices = []

for num_epoch in range(num_epochs):

# Shuffle the sample and dataset indices in epoch with a given seed

numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)

numpy_random_state.shuffle(dataset_index)

numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)

numpy_random_state.shuffle(dataset_sample_index)

NouamaneTazi approved these changes Nov 25, 2024

View reviewed changes

Lauler added 2 commits November 28, 2024 09:18

Add shuffling for subsequent epochs when data is repeated

eab4770

Simplify random seed in epoch data for reproducibility

f060414

Lauler force-pushed the nanoset-shuffle-epochs-data branch from fe4aa4e to f060414 Compare November 28, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shuffling in Nanotron for subsequent epochs when data is repeated #247

Add shuffling in Nanotron for subsequent epochs when data is repeated #247

Lauler commented Nov 24, 2024 •

edited

Loading

NouamaneTazi left a comment •

edited

Loading

NouamaneTazi Nov 25, 2024

Lauler Nov 28, 2024

	dataset_indices = []
	dataset_sample_indices = []
	for num_epoch in range(num_epochs):
	# Shuffle the sample and dataset indices in epoch with a given seed
	numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
	numpy_random_state.shuffle(dataset_index)
	numpy_random_state = np.random.RandomState(self.random_seed + num_epoch)
	numpy_random_state.shuffle(dataset_sample_index)

Add shuffling in Nanotron for subsequent epochs when data is repeated #247

Are you sure you want to change the base?

Add shuffling in Nanotron for subsequent epochs when data is repeated #247

Conversation

Lauler commented Nov 24, 2024 • edited Loading

NouamaneTazi left a comment • edited Loading

Choose a reason for hiding this comment

NouamaneTazi Nov 25, 2024

Choose a reason for hiding this comment

Lauler Nov 28, 2024

Choose a reason for hiding this comment

Lauler commented Nov 24, 2024 •

edited

Loading

NouamaneTazi left a comment •

edited

Loading