Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recovery after remote repository temporarily unavailable #80

Open
peter-sk opened this issue Jun 11, 2024 · 0 comments
Open

recovery after remote repository temporarily unavailable #80

peter-sk opened this issue Jun 11, 2024 · 0 comments

Comments

@peter-sk
Copy link

What happened?
We are using a remote repository in order to collect data on related runs from multiple GPU nodes. This has worked fine so far, but last night, the remote repository was unreachable for a few minutes (likely due to a network issue). The tracked data was collected during and after the incident, but at some point in time the queue was filled up and the processes were blocked.

What should have happened?
After the incident, aim should have attempted to reconnect to the remote repository, delivering the collected values from the queue.

What is the impact of the issue?
Long runs might be resumed from the last checkpoint (if available) or have to be restarted (if resuming is not an option). In this case, we lost a total of approx. 200 GPU hours.

How might the issue be mitigated?
When there was an outage, the system should periodically try to reestablish the connection. If pointed to the right place in the code, I might give it a try to implement this, if manpower is an issue. This is a showstopper for using remote repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant