Replies: 1 comment
-
Sadly, I lack space for such huge datasets, maybe 2gb mix of a Corpus is the best I can do |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Right now testing largely depends on running me and my private corpus of dozens of thousands of PNGs and about as many JPEGs. I cannot share the corpus because it's all the images from my drive ever and may contain personal information. Relying on a non-shareable corpus is not a good strategy in the long term.
One alternative is to use the Danbooru dataset available as an rsync mirror, so it's easy to download a subset of the 5 million files, and choose the ones with the right extensions.
Another is to scrape public images from the web. I used CommonCrawl to discover image URLs. They conveniently provide
.wat
files with just the metadata such as links, which makes it easy. The images themselves can then be scraped with any tool of your choosing, e.g.curl
orwget
.I've already scraped something like 40,000 JPEGs and 60,000 PNGs. You can download a smaller (23k) scraped JPEG corpus and the scraped PNG corpus as .tar files, but I suggest doing that soon because I won't keep them in my cloud storage indefinitely.
I also suggest supplementing these with the files I previously reported as handled incorrectly - they showcase various edge cases that may be rare in a randomly selected corpus.
Beta Was this translation helpful? Give feedback.
All reactions