Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageNet finetuning exploding #69

Open
ds2268 opened this issue Dec 5, 2023 · 10 comments
Open

ImageNet finetuning exploding #69

ds2268 opened this issue Dec 5, 2023 · 10 comments

Comments

@ds2268
Copy link

ds2268 commented Dec 5, 2023

I have pre-trained the resnet50 model for 800 epochs. The loss looks fine:

image

I have then used a pre-trained model for ImageNet fine-tuning and the loss pretty much always "exploded" (bellow).

I am using the original hyperparameters defined in HP_DEFAULT_VALUES over 32 x A100 GPUs with a default batch_size=4096.

Any clues @keyu-tian?

[12-06 03:28:44] (nstream_imagenet/util.py, line  98)=> [optimizer=<class 'timm.optim.lamb.Lamb'>]
[12-06 03:28:44] (nstream_imagenet/util.py, line 110)=> [loss_fn] BinaryCrossEntropy(smoothing=0, target_threshold=None, reduction=mean)
[12-06 03:28:44] (nstream_imagenet/util.py, line 111)=> [mixup_fn] <mixup.BatchMixup object at 0x7fde57918b80>
[12-06 03:28:44] (nstream_imagenet/util.py, line 119)=> [try to resume from file `/ceph/hpc/data/MFIP/outputs/SparK/8_node__GPU_ID_9099653_a_resnet50_b_4096_e_800_train//resnet50_1kpretrained_timm_style.pth`]
[12-06 03:28:45] (nstream_imagenet/util.py, line 125)=> [load_checkpoint] missing_keys=['fc.weight', 'fc.bias']
[12-06 03:28:45] (nstream_imagenet/util.py, line 126)=> [load_checkpoint] unexpected_keys=[]
[12-06 03:28:45] (nstream_imagenet/util.py, line 127)=> [load_checkpoint] ep_start=0, performance_desc=[no performance_desc]
[12-06 03:28:45] (nstream_imagenet/data.py, line  99)=> Transform [train] = 
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> RandomResizedCropAndInterpolation(size=(224, 224), scale=(0.08, 1.0), ratio=(0.75, 1.3333), interpolation=bicubic)
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> RandomHorizontalFlip(p=0.5)
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> TrivialAugmentWide(num_magnitude_bins=31, interpolation=InterpolationMode.BICUBIC, fill=None)
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> ToTensor()
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> Normalize(mean=tensor([0.4850, 0.4560, 0.4060]), std=tensor([0.2290, 0.2240, 0.2250]))
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> RandomErasing(p=0.25, mode=pixel, count=(1, 1))
[12-06 03:28:45] (nstream_imagenet/data.py, line 102)=> ---------------------------

[12-06 03:28:45] (nstream_imagenet/data.py, line  99)=> Transform [val] = 
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> Resize(size=235, interpolation=bicubic, max_size=None, antialias=None)
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> CenterCrop(size=(224, 224))
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> ToTensor()
[12-06 03:28:45] (nstream_imagenet/data.py, line 101)=> Normalize(mean=tensor([0.4850, 0.4560, 0.4060]), std=tensor([0.2290, 0.2240, 0.2250]))
[12-06 03:28:45] (nstream_imagenet/data.py, line 102)=> ---------------------------

[12-06 03:30:00] (nstream_imagenet/data.py, line  75)=> [dataset: train] bs=32x128=4096, num_iters=313
[12-06 03:30:00] (nstream_imagenet/data.py, line  84)=> [dataset: val] bs=32x256=8192, num_iters=196
[12-06 03:30:53] (nstream_imagenet/main.py, line  47)=> [fine-tune] initial acc=0.11, ema=0.11
[12-06 03:30:53] (nstream_imagenet/main.py, line  50)=> [FT start] ep_eval=[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299] 
[12-06 03:30:53] (nstream_imagenet/main.py, line  51)=> [FT start] from ep0
[12-06 03:30:53] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(0)]
[12-06 03:32:05] (nstream_imagenet/main.py, line 174)=> [ep0 it  3/313]    L: 0.7049    Acc: 0.78    lr: 1.8e-05~2.2e-04    Remain: 1:32:40
[12-06 03:33:43] (nstream_imagenet/main.py, line 174)=> [ep0 it156/313]    L: 0.0081    Acc: 0.00    lr: 2.7e-04~3.3e-03    Remain: 0:02:48
[12-06 03:35:20] (nstream_imagenet/main.py, line 174)=> [ep0 it312/313]    L: 0.0078    Acc: 0.00    lr: 5.4e-04~6.5e-03    Remain: 0:00:00
[12-06 03:35:46] (nstream_imagenet/main.py, line  84)=> [ep0/300]    Max (Last) Acc: 0.48 (0.48 o 50000.0)    EMA: 0.12 (0.12 o 50000.0)    Ep cost: 293.86s,   Ev cost: 9.6,    Remain: 1 day, 0:24:24,    Finish @ 12-06 21:00
[12-06 03:35:48] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(1)]
[12-06 03:35:55] (nstream_imagenet/main.py, line 174)=> [ep1 it  3/313]    L: 0.0074    Acc: 1.56    lr: 5.4e-04~6.6e-03    Remain: 0:08:53
[12-06 03:37:41] (nstream_imagenet/main.py, line 174)=> [ep1 it156/313]    L: 0.0074    Acc: 1.56    lr: 8.0e-04~9.7e-03    Remain: 0:01:53
[12-06 03:39:19] (nstream_imagenet/main.py, line 174)=> [ep1 it312/313]    L: 0.0058    Acc: 17.00    lr: 1.1e-03~1.3e-02    Remain: 0:00:00
[12-06 03:39:19] (nstream_imagenet/main.py, line  84)=> [ep1/300]    Max (Last) Acc: 0.48 (0.48 o 50000.0)    EMA: 0.12 (0.12 o 50000.0)    Ep cost: 212.48s,   Ev cost: -,    Remain: 17:35:19,    Finish @ 12-06 14:14
[12-06 03:39:20] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(2)]
[12-06 03:39:37] (nstream_imagenet/main.py, line 174)=> [ep2 it  3/313]    L: 0.0063    Acc: 21.09    lr: 1.1e-03~1.3e-02    Remain: 0:20:45
[12-06 03:41:24] (nstream_imagenet/main.py, line 174)=> [ep2 it156/313]    L: 0.0054    Acc: 21.88    lr: 1.3e-03~1.6e-02    Remain: 0:02:03
[12-06 03:43:04] (nstream_imagenet/main.py, line 174)=> [ep2 it312/313]    L: 0.0060    Acc: 29.00    lr: 1.6e-03~1.9e-02    Remain: 0:00:00
[12-06 03:43:04] (nstream_imagenet/main.py, line  84)=> [ep2/300]    Max (Last) Acc: 0.48 (0.48 o 50000.0)    EMA: 0.12 (0.12 o 50000.0)    Ep cost: 224.47s,   Ev cost: -,    Remain: 18:31:08,    Finish @ 12-06 15:14
[12-06 03:43:09] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(3)]
[12-06 03:43:21] (nstream_imagenet/main.py, line 174)=> [ep3 it  3/313]    L: 0.0065    Acc: 21.09    lr: 1.6e-03~1.9e-02    Remain: 0:16:12
[12-06 03:45:12] (nstream_imagenet/main.py, line 174)=> [ep3 it156/313]    L: 0.0063    Acc: 26.56    lr: 1.8e-03~2.2e-02    Remain: 0:02:02
[12-06 03:46:55] (nstream_imagenet/main.py, line 174)=> [ep3 it312/313]    L: 0.0059    Acc: 27.00    lr: 2.1e-03~2.6e-02    Remain: 0:00:00
[12-06 03:46:55] (nstream_imagenet/main.py, line  84)=> [ep3/300]    Max (Last) Acc: 0.48 (0.48 o 50000.0)    EMA: 0.12 (0.12 o 50000.0)    Ep cost: 226.84s,   Ev cost: -,    Remain: 18:39:05,    Finish @ 12-06 15:25
[12-06 03:47:12] (nstream_imagenet/main.py, line 174)=> [ep4 it  3/313]    L: 0.0064    Acc: 28.12    lr: 2.1e-03~2.6e-02    Remain: 0:20:50
[12-06 03:49:02] (nstream_imagenet/main.py, line 174)=> [ep4 it156/313]    L: 0.0054    Acc: 24.22    lr: 2.4e-03~2.9e-02    Remain: 0:02:05
[12-06 03:50:47] (nstream_imagenet/main.py, line 174)=> [ep4 it312/313]    L: 0.0061    Acc: 31.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 03:50:47] (nstream_imagenet/main.py, line  84)=> [ep4/300]    Max (Last) Acc: 0.48 (0.48 o 50000.0)    EMA: 0.12 (0.12 o 50000.0)    Ep cost: 232.49s,   Ev cost: -,    Remain: 19:03:05,    Finish @ 12-06 15:53
[12-06 03:51:05] (nstream_imagenet/main.py, line 174)=> [ep5 it  3/313]    L: 0.0068    Acc: 23.44    lr: 2.6e-03~3.2e-02    Remain: 0:20:43
[12-06 03:52:55] (nstream_imagenet/main.py, line 174)=> [ep5 it156/313]    L: 0.0053    Acc: 30.47    lr: 2.6e-03~3.2e-02    Remain: 0:02:05
[12-06 03:54:38] (nstream_imagenet/main.py, line 174)=> [ep5 it312/313]    L: 0.0065    Acc: 17.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 03:54:56] (nstream_imagenet/main.py, line  84)=> [ep5/300]    Max (Last) Acc: 19.08 (19.08 o 50000.0)    EMA: 0.12 (0.00 o 50000.0)    Ep cost: 248.84s,   Ev cost: 8.36,    Remain: 20:19:19,    Finish @ 12-06 17:14
[12-06 03:55:05] (nstream_imagenet/main.py, line 174)=> [ep6 it  3/313]    L: 0.0068    Acc: 14.06    lr: 2.6e-03~3.2e-02    Remain: 0:09:57
[12-06 03:56:55] (nstream_imagenet/main.py, line 174)=> [ep6 it156/313]    L: 0.0058    Acc: 13.28    lr: 2.6e-03~3.2e-02    Remain: 0:01:56
[12-06 03:58:38] (nstream_imagenet/main.py, line 174)=> [ep6 it312/313]    L: 0.0065    Acc: 11.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 03:58:38] (nstream_imagenet/main.py, line  84)=> [ep6/300]    Max (Last) Acc: 19.08 (19.08 o 50000.0)    EMA: 0.12 (0.00 o 50000.0)    Ep cost: 220.9s,   Ev cost: -,    Remain: 17:58:44,    Finish @ 12-06 14:57
[12-06 03:58:55] (nstream_imagenet/main.py, line 174)=> [ep7 it  3/313]    L: 0.0065    Acc: 10.94    lr: 2.6e-03~3.2e-02    Remain: 0:20:51
[12-06 04:00:44] (nstream_imagenet/main.py, line 174)=> [ep7 it156/313]    L: 0.0086    Acc: 1.56    lr: 2.6e-03~3.2e-02    Remain: 0:02:05
[12-06 04:02:28] (nstream_imagenet/main.py, line 174)=> [ep7 it312/313]    L: 1.7164    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:02:28] (nstream_imagenet/main.py, line  84)=> [ep7/300]    Max (Last) Acc: 19.08 (19.08 o 50000.0)    EMA: 0.12 (0.00 o 50000.0)    Ep cost: 230.42s,   Ev cost: -,    Remain: 18:41:23,    Finish @ 12-06 15:43
[12-06 04:02:46] (nstream_imagenet/main.py, line 174)=> [ep8 it  3/313]    L: 1.6271    Acc: 1.56    lr: 2.6e-03~3.2e-02    Remain: 0:21:01
[12-06 04:04:33] (nstream_imagenet/main.py, line 174)=> [ep8 it156/313]    L: 0.8356    Acc: 1.56    lr: 2.6e-03~3.2e-02    Remain: 0:02:03
[12-06 04:06:15] (nstream_imagenet/main.py, line 174)=> [ep8 it312/313]    L: 25.5185    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:06:15] (nstream_imagenet/main.py, line  84)=> [ep8/300]    Max (Last) Acc: 19.08 (19.08 o 50000.0)    EMA: 0.12 (0.00 o 50000.0)    Ep cost: 226.79s,   Ev cost: -,    Remain: 18:19:56,    Finish @ 12-06 15:26
[12-06 04:06:32] (nstream_imagenet/main.py, line 174)=> [ep9 it  3/313]    L: 52.2942    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:20:42
[12-06 04:08:21] (nstream_imagenet/main.py, line 174)=> [ep9 it156/313]    L: 555.4008    Acc: 1.56    lr: 2.6e-03~3.2e-02    Remain: 0:02:04
[12-06 04:10:02] (nstream_imagenet/main.py, line 174)=> [ep9 it312/313]    L: 4506.8662    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:10:02] (nstream_imagenet/main.py, line  84)=> [ep9/300]    Max (Last) Acc: 19.08 (19.08 o 50000.0)    EMA: 0.12 (0.00 o 50000.0)    Ep cost: 226.59s,   Ev cost: -,    Remain: 18:15:11,    Finish @ 12-06 15:25
[12-06 04:10:19] (nstream_imagenet/main.py, line 174)=> [ep10 it  3/313]    L: 4193.1646    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:20:30
[12-06 04:12:08] (nstream_imagenet/main.py, line 174)=> [ep10 it156/313]    L: 19934.4492    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:04
[12-06 04:13:50] (nstream_imagenet/main.py, line 174)=> [ep10 it312/313]    L: 74546.9453    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:14:07] (nstream_imagenet/main.py, line  84)=> [ep10/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 244.51s,   Ev cost: 10.77,    Remain: 19:37:43,    Finish @ 12-06 16:51
[12-06 04:14:18] (nstream_imagenet/main.py, line 174)=> [ep11 it  3/313]    L: 78996.7109    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:13:17
[12-06 04:16:12] (nstream_imagenet/main.py, line 174)=> [ep11 it156/313]    L: 331006.5312    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:03
[12-06 04:17:52] (nstream_imagenet/main.py, line 174)=> [ep11 it312/313]    L: 1176097.8750    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:17:52] (nstream_imagenet/main.py, line  84)=> [ep11/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 224.82s,   Ev cost: -,    Remain: 17:59:08,    Finish @ 12-06 15:17
[12-06 04:18:09] (nstream_imagenet/main.py, line 174)=> [ep12 it  3/313]    L: 1121348.3750    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:20:45
[12-06 04:19:58] (nstream_imagenet/main.py, line 174)=> [ep12 it156/313]    L: 3709660.7500    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:04
[12-06 04:21:42] (nstream_imagenet/main.py, line 174)=> [ep12 it312/313]    L: 8070643.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:21:42] (nstream_imagenet/main.py, line  84)=> [ep12/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 230.11s,   Ev cost: -,    Remain: 18:20:42,    Finish @ 12-06 15:42
[12-06 04:21:59] (nstream_imagenet/main.py, line 174)=> [ep13 it  3/313]    L: 9853109.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:20:37
[12-06 04:23:48] (nstream_imagenet/main.py, line 174)=> [ep13 it156/313]    L: 34221128.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:04
[12-06 04:25:29] (nstream_imagenet/main.py, line 174)=> [ep13 it312/313]    L: 107099432.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:25:29] (nstream_imagenet/main.py, line  84)=> [ep13/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 226.71s,   Ev cost: -,    Remain: 18:00:39,    Finish @ 12-06 15:26
[12-06 04:25:48] (nstream_imagenet/main.py, line 174)=> [ep14 it  3/313]    L: 112121688.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:22:23
[12-06 04:27:36] (nstream_imagenet/main.py, line 174)=> [ep14 it156/313]    L: 468563456.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:05
[12-06 04:29:20] (nstream_imagenet/main.py, line 174)=> [ep14 it312/313]    L: 16655913984.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:29:20] (nstream_imagenet/main.py, line  84)=> [ep14/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 230.8s,   Ev cost: -,    Remain: 18:16:18,    Finish @ 12-06 15:45
[12-06 04:29:38] (nstream_imagenet/main.py, line 174)=> [ep15 it  3/313]    L: 14278984704.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:21:09
[12-06 04:31:26] (nstream_imagenet/main.py, line 174)=> [ep15 it156/313]    L: 17783138304.0000    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:04
[12-06 04:33:06] (nstream_imagenet/main.py, line 174)=> [ep15 it312/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:33:23] (nstream_imagenet/main.py, line  84)=> [ep15/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 242.62s,   Ev cost: 12.07,    Remain: 19:08:24,    Finish @ 12-06 16:41
[12-06 04:33:36] (nstream_imagenet/main.py, line 174)=> [ep16 it  3/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:14:41
[12-06 04:35:28] (nstream_imagenet/main.py, line 174)=> [ep16 it156/313]    L: nan    Acc: 1.56    lr: 2.6e-03~3.2e-02    Remain: 0:02:03
[12-06 04:37:08] (nstream_imagenet/main.py, line 174)=> [ep16 it312/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
[12-06 04:37:08] (nstream_imagenet/main.py, line  84)=> [ep16/300]    Max (Last) Acc: 19.08 (0.10 o 50000.0)    EMA: 0.12 (0.10 o 50000.0)    Ep cost: 225.16s,   Ev cost: -,    Remain: 17:42:00,    Finish @ 12-06 15:19
[12-06 04:37:25] (nstream_imagenet/main.py, line 174)=> [ep17 it  3/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:20:18
[12-06 04:39:13] (nstream_imagenet/main.py, line 174)=> [ep17 it156/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:02:03
[12-06 04:40:53] (nstream_imagenet/main.py, line 174)=> [ep17 it312/313]    L: nan    Acc: 0.00    lr: 2.6e-03~3.2e-02    Remain: 0:00:00
@ds2268
Copy link
Author

ds2268 commented Dec 8, 2023

I have also tried parameters from the paper (batch size 2048, lr=3e-8, etc.). The finetunning is still exploding (loss quickly to 0 and then NaN).

[12-07 18:37:04] (nstream_imagenet/main.py, line 174)=> [ep0 it  3/626]    L: 0.6937    Acc: 0.00    lr: 3.1e-05~3.8e-04    Remain: 3:26:47
[12-07 18:40:10] (nstream_imagenet/main.py, line 174)=> [ep0 it313/626]    L: 0.0078    Acc: 0.00    lr: 5.5e-04~6.7e-03    Remain: 0:04:24
[12-07 18:43:23] (nstream_imagenet/main.py, line 174)=> [ep0 it625/626]    L: 0.0059    Acc: 9.72    lr: 1.1e-03~1.3e-02    Remain: 0:00:00
[12-07 18:44:04] (nstream_imagenet/main.py, line  84)=> [ep0/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 500.25s,   Ev cost: 23.38,    Remain: 1 day, 17:32:55,    Finish @ 12-09 05:16
[12-07 18:44:06] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(1)]
[12-07 18:44:13] (nstream_imagenet/main.py, line 174)=> [ep1 it  3/626]    L: 0.0059    Acc: 15.62    lr: 1.1e-03~1.3e-02    Remain: 0:18:02
[12-07 18:47:18] (nstream_imagenet/main.py, line 174)=> [ep1 it313/626]    L: 0.0055    Acc: 21.09    lr: 1.6e-03~1.9e-02    Remain: 0:03:11
[12-07 18:50:15] (nstream_imagenet/main.py, line 174)=> [ep1 it625/626]    L: 0.0056    Acc: 23.61    lr: 2.1e-03~2.6e-02    Remain: 0:00:00
[12-07 18:50:15] (nstream_imagenet/main.py, line  84)=> [ep1/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 370.16s,   Ev cost: -,    Remain: 1 day, 6:38:28,    Finish @ 12-08 18:28
[12-07 18:50:17] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(2)]
[12-07 18:50:28] (nstream_imagenet/main.py, line 174)=> [ep2 it  3/626]    L: 0.0055    Acc: 23.44    lr: 2.1e-03~2.6e-02    Remain: 0:29:35
[12-07 18:53:36] (nstream_imagenet/main.py, line 174)=> [ep2 it313/626]    L: 0.0071    Acc: 13.28    lr: 2.6e-03~3.2e-02    Remain: 0:03:18
[12-07 18:56:33] (nstream_imagenet/main.py, line 174)=> [ep2 it625/626]    L: 0.0069    Acc: 5.56    lr: 3.2e-03~3.9e-02    Remain: 0:00:00
[12-07 18:56:33] (nstream_imagenet/main.py, line  84)=> [ep2/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 376.92s,   Ev cost: -,    Remain: 1 day, 7:05:45,    Finish @ 12-08 19:02
[12-07 18:56:34] (nstream_imagenet/main.py, line  60)=> [loader_train.sampler.set_epoch(3)]
[12-07 18:56:48] (nstream_imagenet/main.py, line 174)=> [ep3 it  3/626]    L: 0.0077    Acc: 0.78    lr: 3.2e-03~3.9e-02    Remain: 0:34:59
[12-07 18:59:55] (nstream_imagenet/main.py, line 174)=> [ep3 it313/626]    L: 62.9384    Acc: 0.00    lr: 3.7e-03~4.5e-02    Remain: 0:03:20
[12-07 19:02:52] (nstream_imagenet/main.py, line 174)=> [ep3 it625/626]    L: 317.5974    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:00:00
[12-07 19:02:52] (nstream_imagenet/main.py, line  84)=> [ep3/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 378.86s,   Ev cost: -,    Remain: 1 day, 7:09:03,    Finish @ 12-08 19:11
[12-07 19:03:08] (nstream_imagenet/main.py, line 174)=> [ep4 it  3/626]    L: 267.8481    Acc: 0.00    lr: 4.2e-03~5.1e-02    Remain: 0:38:13
[12-07 19:06:16] (nstream_imagenet/main.py, line 174)=> [ep4 it313/626]    L: 352016.5938    Acc: 0.00    lr: 4.7e-03~5.8e-02    Remain: 0:03:21
[12-07 19:09:15] (nstream_imagenet/main.py, line 174)=> [ep4 it625/626]    L: 3266225152.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00
[12-07 19:09:15] (nstream_imagenet/main.py, line  84)=> [ep4/300]    Max (Last) Acc: 8.97 (8.97 o 50000.0)    EMA: 0.13 (0.01 o 50000.0)    Ep cost: 382.58s,   Ev cost: -,    Remain: 1 day, 7:21:01,    Finish @ 12-08 19:30
[12-07 19:09:31] (nstream_imagenet/main.py, line 174)=> [ep5 it  3/626]    L: 3494824192.0000    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:38:32
[12-07 19:12:40] (nstream_imagenet/main.py, line 174)=> [ep5 it313/626]    L: nan    Acc: 1.56    lr: 5.3e-03~6.4e-02    Remain: 0:03:22
[12-07 19:15:39] (nstream_imagenet/main.py, line 174)=> [ep5 it625/626]    L: nan    Acc: 0.00    lr: 5.3e-03~6.4e-02    Remain: 0:00:00

@keyu-tian
Copy link
Owner

keyu-tian commented Dec 8, 2023

Hi @ds2268, the 800-ep pre-training seems normal. The fine-tuning loss before explosion (5e-3, close to zero) is also as expected, since we are using BCE loss instead of CE. (ps: we never observed any loss explosion problem in all of our finetuning experiments)

Have you used mixed precision?

I also found that the default batch size should be 2048, maybe you can also try this.

@ds2268
Copy link
Author

ds2268 commented Dec 8, 2023

I have tried 2048 configs from the paper, with no success. I think that downstream ImageNet is not using mixed precision. I could only find apex libs in downstream mmdet.

@keyu-tian
Copy link
Owner

Could you try running with timm==0.5.4?

@ds2268
Copy link
Author

ds2268 commented Dec 9, 2023

I am already running with:

timm 0.5.44
torch 1.12.0
torchvision 0.13.1

@ds2268
Copy link
Author

ds2268 commented Dec 9, 2023

Looks like the issue with ResNet-50 is related to #27

@keyu-tian
Copy link
Owner

Honestly I have no idea what the problem is with the fine-tuning code (yes #27 is similar). Maybe you can try again with base_lr < 0.002. I will run this too.

@ds2268
Copy link
Author

ds2268 commented Dec 15, 2023

@keyu-tian, I have now pretrained ConvNext-S model (800 epochs) and performed ImageNet finetuning:

image

It's not yet finished (140 epochs / 200), but looks like it's working on ConvNext-S. The reported results for ConvNext-S are 84.1. I will probably not reach it by 200 epochs, but probably due to only 800 epochs pretraining.

image

The problem is then really just with the Resnet-50 stability.

@keyu-tian
Copy link
Owner

keyu-tian commented Dec 15, 2023

@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem.

Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101.

ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere.

@sdreamforchen
Copy link

@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem.

Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101.

ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere.
Will you further upgrade the algorithm to achieve less batchsize, but will it have the same effect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants