Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginate Replication Jobs Async #21330

Open
ianseyer opened this issue Dec 17, 2024 · 8 comments
Open

Paginate Replication Jobs Async #21330

ianseyer opened this issue Dec 17, 2024 · 8 comments
Assignees
Labels

Comments

@ianseyer
Copy link

ianseyer commented Dec 17, 2024

It would appear that harbor attempts to build a list of every repository in a replication source registry. This is not scalable, as some may have a lot of repositories. This means that replication jobs sit idly, potentially allocating an array of >100,000 repositories or timing out.

However, unfortunately distribution provides no way to filter on _catalog. This means that even a very specific filter replication requires pulling a list of every single repository in a registry. Is there any work planned to make this process more scalable? I am currently trying a (sloppy) fork that runs filters per-page of results, rather than every populating a list of every single repository. However, this is still ultimately single threaded and not very efficient.

I see this as a potential solution:

  • in core: grab a page of _catalog (Client.Catalog() already does pagination)
  • pass that page to a worker so that it can asynchronously:
    • filter that page
    • spawn subJobs to replicate results of the filter for that page
  • in core: grab the next page, repeat

This way, progress is being made inside workers rather than waiting on core for an unknown amount of time for an unknown amount of repositories to be replicated.

@ianseyer ianseyer changed the title Paginate Replication Jobs Paginate Replication Jobs Async Dec 17, 2024
@wy65701436
Copy link
Contributor

thanks for reporting the issue, @ianseyer.

What timeout did you encounter during the replication process? The proposed logic change would fundamentally rework the core replication functionality. We need to first identify the root cause of the issue and explore potential enhancements or fixes based on the current design. If that doesn't resolve the problem, we can then consider modifying the core flow.

@ianseyer
Copy link
Author

ianseyer commented Dec 18, 2024

I have a very large docker registry. I have spun up a replication job, and it was timing out while in the "stage" workflow. I imagine this is because of the sheer number of repositories in the source registry.

I modified the code to run the filter per page of the catalog, which has eliminated the timeout issue. However - I have been now been staging images for 16 hours (at 2.8M to be replicated). I think it will crash when it is done staging, as it will have to fetch tags for millions of repositories.

I think that the replication job should be re-worked to be incremental: it should run a replication job per-page of responses from /catalog, rather than attempting to iterate through the entire registry first. I don't believe that should result in any novel issues around duplication. I cannot think of any downside to this approach.

@Vad1mo Vad1mo added the kind/requirement New feature or idea on top of harbor label Dec 18, 2024
@stonezdj
Copy link
Contributor

Not all replication will call the /_catalog API, if it is replication source is Harbor, it should call the list artifact API. the default page size (15) will slow down the stage process.

Harbor v2.12.0 already fixed this issue by change the pageSize to 100 with this pr: #21081, you could try Harbor v2.12.0.

@ianseyer
Copy link
Author

ianseyer commented Dec 19, 2024

Yes, I am using v2.12.0. A page size adjustment does not solve the issue here, as it does not run a replication job per page, but instead just collects all pages into one big array. This is the phase that is not viable for large registries (2M+ repositories).

This would be for native adapter, though I see no reason why you wouldn't want replication to happen per page for all adaptors.

The behavior I am seeing is that harbor is attempting to build a list of all images in the target registry, as seen here: https://github.com/goharbor/harbor/blob/main/src/pkg/registry/client.go#L171

Only then does it run the filter against that result and begin replicating images. For very large registries, this collation can take literal days, is non resumable, and fragile (if it fails for any reason, all that time is wasted). While this list is being built, no replication is happening. The proposal is to have each page be replicated as it is received.

So rather than:

while catalogHasPages():
  page = catalog.getNextPage()
  repositories += [page]
 
addReplicationTasks(filter(repositories))

have it be:

while catalogHasPages():
  page = catalog.getNextPage()
  addReplicationTask(filter(page))

Presuming you have override set to false, duplicates would be marked as a no-op.

@wy65701436 wy65701436 self-assigned this Dec 19, 2024
@wy65701436
Copy link
Contributor

wy65701436 commented Dec 19, 2024

@ianseyer let me clarify, you're doing replication from a self-hosted docker registry to your harbor?

I understand your situation. The root cause of the issue seems to be the performance bottleneck in distribution on catalog API, as it takes time to traverse the entire structure of the registry file system.

Your suggestion would fundamentally alter the replication flow, impacting all adapters. Additionally, large mount candidates on self-hosted Docker registries aren't a common use case for replication. Given this, I'm considering whether there might be a workaround or enhancement that could be implemented within the native adapter itself.

@ianseyer
Copy link
Author

ianseyer commented Dec 19, 2024

Yes - exactly. The catalog is definitely the limiting factor here - I don't understand why it does not allow jumping to arbitrary points in the catalog without first having seen the previous page of responses. It forces you to proceed linearly rather than being able to divide and conquer.

One feature I have considered would be to be able to rehydrate a harbor instance from an existing storage engine. E.g. traverse the image storage (s3, gcs, etc) and populate harbor from that by generating the appropriate API calls. But that seems difficult and/or messy. It also assumes that the harbor operator also owns the source registry.

@ianseyer
Copy link
Author

ianseyer commented Dec 20, 2024

As for affecting all replication flows, I guess I am still curious as to why not running replications per-page would be preferable. Presumably, all Catalog implementations are paginated, and partial progress is preferable

@wy65701436
Copy link
Contributor

@ianseyer Did you notice that the timeout issue occurred during the stage of calling the catalog API? I believe there are two potential workarounds outside of Harbor that could help resolve this issue:

  • Proxy-cache: You can configure a proxy-cache to allow all your users to pull images from Harbor, while Harbor caches images from the upstream distribution. Additionally, you could set a timeline for your organization to shut down the distribution, ensuring that all active images are available in your local Harbor instance.

  • Custom Script: You can write a script to pull and push all the images from the distribution to Harbor, instead of relying on Harbor's built-in replication feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants