Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flickr results do not use "raw" (human readable) tags #4906

Open
zackkrida opened this issue Sep 10, 2024 · 5 comments
Open

Flickr results do not use "raw" (human readable) tags #4906

zackkrida opened this issue Sep 10, 2024 · 5 comments
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server

Comments

@zackkrida
Copy link
Member

Description

Images ingested into Openverse from Flickr are using Flickr tags in a non-optimal way. Observe the following Openverse result's tags:

https://openverse.org/image/ea4dff9b-7337-47ab-9fac-c9c4bd7860a9

Screenshot from 2024-09-10 11-14-16

As you can plainly see, many of the tags are multi-word phrases that are compressed into single words with spaces removed. For example:

  • thegrapesofwrath => the grapes of wrath
  • cottondress => cotton dress

When viewing the result on Flickr, the tags look correct:

image

So, what is going on?

Well, the search endpoint in Flickr, which we use in our Flickr dag, returns the "cleaned" version of the tags. These are the version used in urls and as identifiers on Flickr, as documented here:

https://www.flickr.com/services/api/misc.tags.html

When querying the single result for an image with Flickr's getImage endpoint, like so:

http https://api.flickr.com/services/rest method==flickr.photos.getInfo api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1 | jq '.photo.tags.tag[].raw'

You can see that the "raw" human-readable tags are avaliable:

"great depression"
"national archives"
"recession"
"depression"
"cardboard house"
"cotton dress"
"poor"
"financial ruin"
"economic disaster"
"sharecroppers"
"the grapes of wrath"
"Tom Joad"
"the crisis"
"le crise"
"la crisis"
"coca-cola"
"1930"
"Farm Security Administration-Office of War Information Collection"
"FSA-OWI"
"Jack Whinery"
"homesteaders"
"Pie Town, New Mexico"
"Evan Lawrence Bench"

It is these tags we should be using in Openverse.

This presents a technical challenge to us in that these tags are only accessible via single results.

Here is the payload for a single tag, from the list of tags returned by getImage:

id	"2045382-2750282427-19380346"
author	"19762676@N00"
authorname	"austinevan"
raw	"Pie Town, New Mexico"
_content	"pietownnewmexico"
machine_tag	0

Edit: I also just noticed that tags.getListPhoto might be a better endpoint to use, as it only returns tags:

http https://api.flickr.com/services/rest method==flickr.tags.getListPhoto api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1

@zackkrida zackkrida added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: catalog Related to the catalog and Airflow DAGs ☁️ provider: images Image provider 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Sep 10, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Sep 10, 2024
@sarayourfriend
Copy link
Collaborator

This presents a technical challenge to us in that these tags are only accessible via single results.

Sounds like it might be relevant to the #4452 work, where we are specifically building the ability to pull from the single results endpoint in Flickr. After that, we will be able to backfill for existing works...

Otherwise, would pulling the tags list per result at ingestion time be the way to go for newly ingested works?

@zackkrida
Copy link
Member Author

@sarayourfriend it is most certainly relevant! I think it's very likely that whatever solution is adopted in #4452 would be the most appropriate way to solve this problem. So much of the thinking there is applicable here, including the importance of preserving the original tags, in some fashion.

Otherwise, if we did want to fix this particular issue at ingestion time, we would need to make a decision if the number of API calls to Flickr would be appropriate. Here are some quick stats from the last run of the Flickr provider DAG, keeping in mind our 3600 requests per hour limit from Flickr:

  • ~20 minutes to run pull_image_data.
  • 607 requests to the Flickr API.
  • 105,794 records from Flickr.

We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.

It's probably worth connecting with Flickr to confirm the rate limiting; I can't recall if we have any unique permissions or anything like that.

@zackkrida
Copy link
Member Author

Oh and of course, I'm very curious to hear from @WordPress/openverse-catalog here.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Sep 12, 2024

We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.

In other words, to clarify, we would have a daily, compounding deficit of 24k works to pull tags for daily. Said another way, we would be 1.5 * N hours behind on Flickr, perpetually, where N is the number of days since we started ingesting raw tags.

It would be really great to know if there's some way Flickr could enable access to the raw tags in the bulk endpoints. Doesn't seem like it's tenable to make an individual request per image, neither for a backfill using the tools from #4452 nor during initial ingestion of new works. It would prevent any other Flickr operations from happening (like targeted reingestion), because we'd be eating up our api quota at all times on pulling tags.


For posterity, there is also flickr.tags.getListPhoto for getting just the list of tags for a photo (rather than the image's full info, which may be excessive).

Example from the image you shared:

<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
  <photo id="2750282427">
    <tags>
      <tag id="2045382-2750282427-56404" author="19762676@N00" authorname="austinevan" raw="great depression" machine_tag="0">greatdepression</tag>
      <tag id="2045382-2750282427-130348" author="19762676@N00" authorname="austinevan" raw="national archives" machine_tag="0">nationalarchives</tag>
      <tag id="2045382-2750282427-378739" author="19762676@N00" authorname="austinevan" raw="recession" machine_tag="0">recession</tag>
      <tag id="2045382-2750282427-16400" author="19762676@N00" authorname="austinevan" raw="depression" machine_tag="0">depression</tag>
      <tag id="2045382-2750282427-953073" author="19762676@N00" authorname="austinevan" raw="cardboard house" machine_tag="0">cardboardhouse</tag>
      <tag id="2045382-2750282427-3045287" author="19762676@N00" authorname="austinevan" raw="cotton dress" machine_tag="0">cottondress</tag>
      <tag id="2045382-2750282427-6925" author="19762676@N00" authorname="austinevan" raw="poor" machine_tag="0">poor</tag>
      <tag id="2045382-2750282427-7870191" author="19762676@N00" authorname="austinevan" raw="financial ruin" machine_tag="0">financialruin</tag>
      <tag id="2045382-2750282427-10644475" author="19762676@N00" authorname="austinevan" raw="economic disaster" machine_tag="0">economicdisaster</tag>
      <tag id="2045382-2750282427-3141545" author="19762676@N00" authorname="austinevan" raw="sharecroppers" machine_tag="0">sharecroppers</tag>
      <tag id="2045382-2750282427-1169161" author="19762676@N00" authorname="austinevan" raw="the grapes of wrath" machine_tag="0">thegrapesofwrath</tag>
      <tag id="2045382-2750282427-3332918" author="19762676@N00" authorname="austinevan" raw="Tom Joad" machine_tag="0">tomjoad</tag>
      <tag id="2045382-2750282427-6365467" author="19762676@N00" authorname="austinevan" raw="the crisis" machine_tag="0">thecrisis</tag>
      <tag id="2045382-2750282427-36940796" author="19762676@N00" authorname="austinevan" raw="le crise" machine_tag="0">lecrise</tag>
      <tag id="2045382-2750282427-24814089" author="19762676@N00" authorname="austinevan" raw="la crisis" machine_tag="0">lacrisis</tag>
      <tag id="2045382-2750282427-23464" author="19762676@N00" authorname="austinevan" raw="coca-cola" machine_tag="0">cocacola</tag>
      <tag id="2045382-2750282427-123582" author="19762676@N00" authorname="austinevan" raw="1930" machine_tag="0">1930</tag>
      <tag id="2045382-2750282427-55789592" author="19762676@N00" authorname="austinevan" raw="Farm Security Administration-Office of War Information Collection" machine_tag="0">farmsecurityadministrationofficeofwarinformationcollection</tag>
      <tag id="2045382-2750282427-1778336" author="19762676@N00" authorname="austinevan" raw="FSA-OWI" machine_tag="0">fsaowi</tag>
      <tag id="2045382-2750282427-14992932" author="19762676@N00" authorname="austinevan" raw="Jack Whinery" machine_tag="0">jackwhinery</tag>
      <tag id="2045382-2750282427-4174958" author="19762676@N00" authorname="austinevan" raw="homesteaders" machine_tag="0">homesteaders</tag>
      <tag id="2045382-2750282427-19380346" author="19762676@N00" authorname="austinevan" raw="Pie Town, New Mexico" machine_tag="0">pietownnewmexico</tag>
      <tag id="2045382-2750282427-132732812" author="19762676@N00" authorname="austinevan" raw="Evan Lawrence Bench" machine_tag="0">evanlawrencebench</tag>
    </tags>
  </photo>
</rsp>

Could we maintain our own reverse index of tags? TL;DR: Nope, not without major drawbacks.

I was thinking of how reliable it would be if we maintained our own reverse index of Flickr's processed tags to machine tags. If it were, then we'd only needed to request tags for a photo if one of the tags it had were not already in our own index. Of course, that relies on a reverse index being reliable, and the fact that apparently "the grapes of wrath" and a mistyped "thegrapes of wrath" would both turn into "thegrapesofwrath" calls that reliability into question.

If you consider beyond English language, then I'm sure there are a lot of examples of entirely different tags in different languages normalise to the same Flickr-processed tag. Even in English: "a moral behaviour" and "amoral behaviour" are (to a large extent) opposites!

There are probably ways of deciding the language of a work's tags based on other indications, but the example photo has tags in English, French, and Spanish.

The trade-offs would be huge. But if it's the only way we could do it without getting a generous grant from Flickr to be able to pull tags more rapidly, maybe we'd need to accept those trade-offs for the benefits we would see "most of the time". The worst case scenario is potentially very bad though, and maybe even worse than the current situation.


@zackkrida maybe a good opportunity to reach out to our Flickr contacts via CC? Maybe there are ways other than regular API calls we could get access to tags, which wouldn't play into our rate limit.

@stacimc
Copy link
Collaborator

stacimc commented Sep 16, 2024

Some quick thoughts:

  • I've raised concern about the Flickr rate limiting in the past too, which is why we don't have Flickr reingestion turned on. I would be very happy if we could get confirmation that we can turn that reingestion DAG back on in addition to this.
  • The Flickr DAG works by splitting the ingestion day up into time intervals and requesting all images last updated in each interval. For reasons we don't understand, if you use a large time range (the full 24 hours being the obvious choice), the API will return only a small fraction of the actual results. If you shrink the time range you get more results but loads of duplicates. The consequence is that the Flickr DAG ingests tons of duplicates -- for the most recent run we got ~67k unique records and discarded ~86k duplicates! If we do this we absolutely need to track those at ingestion time and make sure we only hit the single image endpoint for unique results to avoid hundreds of thousands of unnecessary calls.

...Actually, if we go down this route I think we could delete the Flickr reingestion DAG and instead only do backfills like #4452 🤔 As in, rather than trying to run ingestion for past ingestion dates (which we know is not effective for backfilling because of the issues with Flickr's API), we run reingestion on sets of image ids from our own catalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents ☁️ provider: images Image provider 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants