You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened?
We are using a remote repository in order to collect data on related runs from multiple GPU nodes. This has worked fine so far, but last night, the remote repository was unreachable for a few minutes (likely due to a network issue). The tracked data was collected during and after the incident, but at some point in time the queue was filled up and the processes were blocked.
What should have happened?
After the incident, aim should have attempted to reconnect to the remote repository, delivering the collected values from the queue.
What is the impact of the issue?
Long runs might be resumed from the last checkpoint (if available) or have to be restarted (if resuming is not an option). In this case, we lost a total of approx. 200 GPU hours.
How might the issue be mitigated?
When there was an outage, the system should periodically try to reestablish the connection. If pointed to the right place in the code, I might give it a try to implement this, if manpower is an issue. This is a showstopper for using remote repositories.
The text was updated successfully, but these errors were encountered:
What happened?
We are using a remote repository in order to collect data on related runs from multiple GPU nodes. This has worked fine so far, but last night, the remote repository was unreachable for a few minutes (likely due to a network issue). The tracked data was collected during and after the incident, but at some point in time the queue was filled up and the processes were blocked.
What should have happened?
After the incident, aim should have attempted to reconnect to the remote repository, delivering the collected values from the queue.
What is the impact of the issue?
Long runs might be resumed from the last checkpoint (if available) or have to be restarted (if resuming is not an option). In this case, we lost a total of approx. 200 GPU hours.
How might the issue be mitigated?
When there was an outage, the system should periodically try to reestablish the connection. If pointed to the right place in the code, I might give it a try to implement this, if manpower is an issue. This is a showstopper for using remote repositories.
The text was updated successfully, but these errors were encountered: