-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paginate Replication Jobs Async #21330
Comments
thanks for reporting the issue, @ianseyer. What timeout did you encounter during the replication process? The proposed logic change would fundamentally rework the core replication functionality. We need to first identify the root cause of the issue and explore potential enhancements or fixes based on the current design. If that doesn't resolve the problem, we can then consider modifying the core flow. |
I have a very large docker registry. I have spun up a replication job, and it was timing out while in the "stage" workflow. I imagine this is because of the sheer number of repositories in the source registry. I modified the code to run the filter per page of the catalog, which has eliminated the timeout issue. However - I have been now been staging images for 16 hours (at 2.8M to be replicated). I think it will crash when it is done staging, as it will have to fetch tags for millions of repositories. I think that the replication job should be re-worked to be incremental: it should run a replication job per-page of responses from |
Not all replication will call the Harbor v2.12.0 already fixed this issue by change the pageSize to 100 with this pr: #21081, you could try Harbor v2.12.0. |
Yes, I am using v2.12.0. A page size adjustment does not solve the issue here, as it does not run a replication job per page, but instead just collects all pages into one big array. This is the phase that is not viable for large registries (2M+ repositories). This would be for native adapter, though I see no reason why you wouldn't want replication to happen per page for all adaptors. The behavior I am seeing is that harbor is attempting to build a list of all images in the target registry, as seen here: https://github.com/goharbor/harbor/blob/main/src/pkg/registry/client.go#L171 Only then does it run the filter against that result and begin replicating images. For very large registries, this collation can take literal days, is non resumable, and fragile (if it fails for any reason, all that time is wasted). While this list is being built, no replication is happening. The proposal is to have each page be replicated as it is received. So rather than:
have it be:
Presuming you have |
@ianseyer let me clarify, you're doing replication from a self-hosted docker registry to your harbor? I understand your situation. The root cause of the issue seems to be the performance bottleneck in distribution on catalog API, as it takes time to traverse the entire structure of the registry file system. Your suggestion would fundamentally alter the replication flow, impacting all adapters. Additionally, large mount candidates on self-hosted Docker registries aren't a common use case for replication. Given this, I'm considering whether there might be a workaround or enhancement that could be implemented within the native adapter itself. |
Yes - exactly. The One feature I have considered would be to be able to rehydrate a harbor instance from an existing storage engine. E.g. traverse the image storage (s3, gcs, etc) and populate harbor from that by generating the appropriate API calls. But that seems difficult and/or messy. It also assumes that the harbor operator also owns the source registry. |
As for affecting all replication flows, I guess I am still curious as to why not running replications per-page would be preferable. Presumably, all |
@ianseyer Did you notice that the timeout issue occurred during the stage of calling the catalog API? I believe there are two potential workarounds outside of Harbor that could help resolve this issue:
|
It would appear that harbor attempts to build a list of every repository in a replication source registry. This is not scalable, as some may have a lot of repositories. This means that replication jobs sit idly, potentially allocating an array of >100,000 repositories or timing out.
However, unfortunately distribution provides no way to filter on _catalog. This means that even a very specific filter replication requires pulling a list of every single repository in a registry. Is there any work planned to make this process more scalable? I am currently trying a (sloppy) fork that runs filters per-page of results, rather than every populating a list of every single repository. However, this is still ultimately single threaded and not very efficient.
I see this as a potential solution:
This way, progress is being made inside workers rather than waiting on core for an unknown amount of time for an unknown amount of repositories to be replicated.
The text was updated successfully, but these errors were encountered: