recovery after remote repository temporarily unavailable #80

peter-sk · 2024-06-11T08:33:50Z

What happened?
We are using a remote repository in order to collect data on related runs from multiple GPU nodes. This has worked fine so far, but last night, the remote repository was unreachable for a few minutes (likely due to a network issue). The tracked data was collected during and after the incident, but at some point in time the queue was filled up and the processes were blocked.

What should have happened?
After the incident, aim should have attempted to reconnect to the remote repository, delivering the collected values from the queue.

What is the impact of the issue?
Long runs might be resumed from the last checkpoint (if available) or have to be restarted (if resuming is not an option). In this case, we lost a total of approx. 200 GPU hours.

How might the issue be mitigated?
When there was an outage, the system should periodically try to reestablish the connection. If pointed to the right place in the code, I might give it a try to implement this, if manpower is an issue. This is a showstopper for using remote repositories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recovery after remote repository temporarily unavailable #80

recovery after remote repository temporarily unavailable #80

peter-sk commented Jun 11, 2024

recovery after remote repository temporarily unavailable #80

recovery after remote repository temporarily unavailable #80

Comments

peter-sk commented Jun 11, 2024