Paginate Replication Jobs Async #21330

ianseyer · 2024-12-17T20:58:29Z

It would appear that harbor attempts to build a list of every repository in a replication source registry. This is not scalable, as some may have a lot of repositories. This means that replication jobs sit idly, potentially allocating an array of >100,000 repositories or timing out.

However, unfortunately distribution provides no way to filter on _catalog. This means that even a very specific filter replication requires pulling a list of every single repository in a registry. Is there any work planned to make this process more scalable? I am currently trying a (sloppy) fork that runs filters per-page of results, rather than every populating a list of every single repository. However, this is still ultimately single threaded and not very efficient.

I see this as a potential solution:

in core: grab a page of _catalog (Client.Catalog() already does pagination)
pass that page to a worker so that it can asynchronously:
- filter that page
- spawn subJobs to replicate results of the filter for that page
in core: grab the next page, repeat

This way, progress is being made inside workers rather than waiting on core for an unknown amount of time for an unknown amount of repositories to be replicated.

wy65701436 · 2024-12-18T06:37:21Z

thanks for reporting the issue, @ianseyer.

What timeout did you encounter during the replication process? The proposed logic change would fundamentally rework the core replication functionality. We need to first identify the root cause of the issue and explore potential enhancements or fixes based on the current design. If that doesn't resolve the problem, we can then consider modifying the core flow.

ianseyer · 2024-12-18T16:14:39Z

I have a very large docker registry. I have spun up a replication job, and it was timing out while in the "stage" workflow. I imagine this is because of the sheer number of repositories in the source registry.

I modified the code to run the filter per page of the catalog, which has eliminated the timeout issue. However - I have been now been staging images for 16 hours (at 2.8M to be replicated). I think it will crash when it is done staging, as it will have to fetch tags for millions of repositories.

I think that the replication job should be re-worked to be incremental: it should run a replication job per-page of responses from /catalog, rather than attempting to iterate through the entire registry first. I don't believe that should result in any novel issues around duplication. I cannot think of any downside to this approach.

stonezdj · 2024-12-19T05:43:24Z

Not all replication will call the /_catalog API, if it is replication source is Harbor, it should call the list artifact API. the default page size (15) will slow down the stage process.

Harbor v2.12.0 already fixed this issue by change the pageSize to 100 with this pr: #21081, you could try Harbor v2.12.0.

ianseyer · 2024-12-19T06:05:51Z

Yes, I am using v2.12.0. A page size adjustment does not solve the issue here, as it does not run a replication job per page, but instead just collects all pages into one big array. This is the phase that is not viable for large registries (2M+ repositories).

This would be for native adapter, though I see no reason why you wouldn't want replication to happen per page for all adaptors.

The behavior I am seeing is that harbor is attempting to build a list of all images in the target registry, as seen here: https://github.com/goharbor/harbor/blob/main/src/pkg/registry/client.go#L171

Only then does it run the filter against that result and begin replicating images. For very large registries, this collation can take literal days, is non resumable, and fragile (if it fails for any reason, all that time is wasted). While this list is being built, no replication is happening. The proposal is to have each page be replicated as it is received.

So rather than:

while catalogHasPages():
  page = catalog.getNextPage()
  repositories += [page]
 
addReplicationTasks(filter(repositories))

have it be:

while catalogHasPages():
  page = catalog.getNextPage()
  addReplicationTask(filter(page))

Presuming you have override set to false, duplicates would be marked as a no-op.

wy65701436 · 2024-12-19T07:28:24Z

@ianseyer let me clarify, you're doing replication from a self-hosted docker registry to your harbor?

I understand your situation. The root cause of the issue seems to be the performance bottleneck in distribution on catalog API, as it takes time to traverse the entire structure of the registry file system.

Your suggestion would fundamentally alter the replication flow, impacting all adapters. Additionally, large mount candidates on self-hosted Docker registries aren't a common use case for replication. Given this, I'm considering whether there might be a workaround or enhancement that could be implemented within the native adapter itself.

ianseyer · 2024-12-19T16:30:14Z

Yes - exactly. The catalog is definitely the limiting factor here - I don't understand why it does not allow jumping to arbitrary points in the catalog without first having seen the previous page of responses. It forces you to proceed linearly rather than being able to divide and conquer.

One feature I have considered would be to be able to rehydrate a harbor instance from an existing storage engine. E.g. traverse the image storage (s3, gcs, etc) and populate harbor from that by generating the appropriate API calls. But that seems difficult and/or messy. It also assumes that the harbor operator also owns the source registry.

ianseyer · 2024-12-20T18:55:26Z

As for affecting all replication flows, I guess I am still curious as to why not running replications per-page would be preferable. Presumably, all Catalog implementations are paginated, and partial progress is preferable

wy65701436 · 2024-12-23T09:46:14Z

@ianseyer Did you notice that the timeout issue occurred during the stage of calling the catalog API? I believe there are two potential workarounds outside of Harbor that could help resolve this issue:

Proxy-cache: You can configure a proxy-cache to allow all your users to pull images from Harbor, while Harbor caches images from the upstream distribution. Additionally, you could set a timeline for your organization to shut down the distribution, ensuring that all active images are available in your local Harbor instance.
Custom Script: You can write a script to pull and push all the images from the distribution to Harbor, instead of relying on Harbor's built-in replication feature.

ianseyer mentioned this issue Dec 17, 2024

Allow pushes to proxy-cache projects #21228

Closed

ianseyer changed the title ~~Paginate Replication Jobs~~ Paginate Replication Jobs Async Dec 17, 2024

wy65701436 added the area/replication label Dec 18, 2024

wy65701436 assigned chlins Dec 18, 2024

Vad1mo added the kind/requirement New feature or idea on top of harbor label Dec 18, 2024

reasonerjt added the needs/follow-up label Dec 19, 2024

wy65701436 self-assigned this Dec 19, 2024

ianseyer mentioned this issue Dec 20, 2024

Transparent Proxy Cache #21342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paginate Replication Jobs Async #21330

Paginate Replication Jobs Async #21330

ianseyer commented Dec 17, 2024 •

edited

Loading

wy65701436 commented Dec 18, 2024

ianseyer commented Dec 18, 2024 •

edited

Loading

stonezdj commented Dec 19, 2024

ianseyer commented Dec 19, 2024 •

edited

Loading

wy65701436 commented Dec 19, 2024 •

edited

Loading

ianseyer commented Dec 19, 2024 •

edited

Loading

ianseyer commented Dec 20, 2024 •

edited

Loading

wy65701436 commented Dec 23, 2024

Paginate Replication Jobs Async #21330

Paginate Replication Jobs Async #21330

Comments

ianseyer commented Dec 17, 2024 • edited Loading

wy65701436 commented Dec 18, 2024

ianseyer commented Dec 18, 2024 • edited Loading

stonezdj commented Dec 19, 2024

ianseyer commented Dec 19, 2024 • edited Loading

wy65701436 commented Dec 19, 2024 • edited Loading

ianseyer commented Dec 19, 2024 • edited Loading

ianseyer commented Dec 20, 2024 • edited Loading

wy65701436 commented Dec 23, 2024

ianseyer commented Dec 17, 2024 •

edited

Loading

ianseyer commented Dec 18, 2024 •

edited

Loading

ianseyer commented Dec 19, 2024 •

edited

Loading

wy65701436 commented Dec 19, 2024 •

edited

Loading

ianseyer commented Dec 19, 2024 •

edited

Loading

ianseyer commented Dec 20, 2024 •

edited

Loading