-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageNet finetuning exploding #69
Comments
I have also tried parameters from the paper (batch size 2048, lr=3e-8, etc.). The finetunning is still exploding (loss quickly to 0 and then NaN).
|
Hi @ds2268, the 800-ep pre-training seems normal. The fine-tuning loss before explosion (5e-3, close to zero) is also as expected, since we are using BCE loss instead of CE. (ps: we never observed any loss explosion problem in all of our finetuning experiments) Have you used mixed precision? I also found that the default batch size should be 2048, maybe you can also try this. |
I have tried 2048 configs from the paper, with no success. I think that downstream ImageNet is not using mixed precision. I could only find apex libs in downstream mmdet. |
Could you try running with timm==0.5.4? |
I am already running with: timm 0.5.44 |
Looks like the issue with ResNet-50 is related to #27 |
Honestly I have no idea what the problem is with the fine-tuning code (yes #27 is similar). Maybe you can try again with base_lr < 0.002. I will run this too. |
@keyu-tian, I have now pretrained ConvNext-S model (800 epochs) and performed ImageNet finetuning: It's not yet finished (140 epochs / 200), but looks like it's working on ConvNext-S. The reported results for ConvNext-S are 84.1. I will probably not reach it by 200 epochs, but probably due to only 800 epochs pretraining. The problem is then really just with the Resnet-50 stability. |
@ds2268 thanks for your verification. So it should be LAMB or BCE causing the problem. Currently I don't have enough GPU or time to debug more, you can start with convnext, or try to use a smaller finetune learning rate of resnet50, or try resnet101. ps: it is always recommended to use the default hyperparameters in downstream_imagenet/args.py, not from the paper (which may be old) or elsewhere. |
|
I have pre-trained the resnet50 model for 800 epochs. The loss looks fine:
I have then used a pre-trained model for ImageNet fine-tuning and the loss pretty much always "exploded" (bellow).
I am using the original hyperparameters defined in HP_DEFAULT_VALUES over 32 x A100 GPUs with a default batch_size=4096.
Any clues @keyu-tian?
The text was updated successfully, but these errors were encountered: