Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📚 Inaccurate pre-trained model predictions master thread #3052

Open
ines opened this issue Dec 14, 2018 · 141 comments
Open

📚 Inaccurate pre-trained model predictions master thread #3052

ines opened this issue Dec 14, 2018 · 141 comments
Labels
models Issues related to the statistical models perf / accuracy Performance: accuracy

Comments

@ines
Copy link
Member

ines commented Dec 14, 2018

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

@ines ines added models Issues related to the statistical models perf / accuracy Performance: accuracy labels Dec 14, 2018
@ines ines pinned this issue Dec 14, 2018
@ines ines changed the title 💫 Inaccurate predictions master thread 💫 Inaccurate pre-trained model predictions master thread Dec 14, 2018
@adrianeboyd
Copy link
Contributor

@Woodchucks: We also noticed this, and it appears to be a problem related to the whitespace augmentation in the training settings for a tagger that's trained on its own rather than with a shared tok2vec, where Polish is the only language in the provided trained pipelines with a completely independent tagger component.

To be honest the behavior is pretty bizarre and surprising. It doesn't show up (at least not enough to lead to much lower TAG scores) in evaluations of the dev data, which might be due to fewer unseen tokens in the dev data from the same corpus, and it's still possible there's an underlying bug. We haven't noticed this for other languages, so it seems like training a tagger with a shared tok2vec (with a morphologizer, lemmatizer, and/or parser) prevents the model from predicting that unseen tokens might be _SP, but in this case, the tagger on its own seems to lump whitespace tokens and unseen tokens into the same category.

The upcoming v3.5.0 trained pipelines for Polish should improve this by adding IS_SPACE as a feature so that the model has enough information to differentiate whitespace tokens from other tokens.

@Woodchucks
Copy link

@adrianeboyd Thank you for the fast reply. I didn't notice your respond so I've deleted my comment and published it again as issue #12002. Sorry for the inconvenience. Glad to hear that the new version will have the IS_SPACE feature implemented.

@stefan-veezoo
Copy link

Hi, I encountered an issue where in German the token "20-Plus" is wrongly tagged as "SPACE", which could hint towards a data issue:

Screenshot from 2022-12-28 16-10-25

https://demos.explosion.ai/displacy?text=Kunden%20mit%20dem%20Produkt%2020-Plus&model=de_core_news_sm&cpu=1&cph=1

@adrianeboyd
Copy link
Contributor

This is related to the same underlying issue as #12002, where data augmentation involving whitespace seems to sometimes lead to unknown words being tagged as SPACE.

Maybe we should just add IS_SPACE to all the models now and consider updating SHAPE to normalize spaces in v4 so that we can drop IS_SPACE, since there's a slight speed hit.

@probavee
Copy link
Contributor

probavee commented Feb 7, 2023

Hello ! Following the answer I got in this discussion, I'm reposting my issue on this master thread.
I'm using the french transformer model fr_dep_news_trf.
When processing this sentence "Je vais skier dans les Alpes de France cet hiver." The model predicts accurately that "Alpes" is a PROPN.
But when I duplicate this sentence like "Je vais skier dans les Alpes de France cet hiver. Je vais skier dans les Alpes de France cet hiver." It now tags "Alpes" as a NOUN.

Here are 2 examples with different versions of the model done in a Linux environment with python 3.10.

spacy-transformers == 1.2.0
spacy == 3.5.0
fr_dep_news_trf == 3.5.0
> doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('Alpes', 'PROPN')]

> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN')]

With another version, there is far less wrong predictions but still some at some point.

spacy-transformers == 1.1.9
spacy == 3.4.4
fr_dep_news_trf == 3.4.0
> doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('Alpes', 'PROPN')]

> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]

[('alpe', 'NOUN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('alpe', 'NOUN')]

I'd like to know if it is expected from the model or not. Like, is this just because I don't give it enough context or something else.
The word France in the sentences is always well tagged.
It seems that there is always a threshold of tokens where the predictions get wrong.

Thank you for your help!

@postnubilaphoebus
Copy link

Spacy's English named entity recognition has issues with apostrophes.
Using Spacy 3.5.0, please try the following code:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
doc = nlp("That had been Megan's plan when she got him dressed earlier.")
labels = [ent.label_ for ent in doc.ents]
entity_text = [ent.text for ent in doc.ents]
print(labels) 
print(entity_text)

This returns [ORG] for Megan insetad of [PERSON]. Similar issues occur with, for example, the word "Applebee's".

@rmitsch
Copy link
Contributor

rmitsch commented Feb 9, 2023

Thanks for reporting this, @postnubilaphoebus. The small model being doesn't do that well with names not occuring often enough in the training data. I recommend giving en_core_web_md a shot (it's inferring the correct entity label in your example).

@stestagg
Copy link

Hi!

We've spotted some NSUBJ/DOBJ mixups with parsing sentences using en_core_web_trf (3.5) that start with Make:

For example:

import spacy
print(f'Spacy={spacy.__version__}')
en = spacy.load('en_core_web_trf')
print(f'Lang={en.path.name}')
sent = en('Make the compression used between map reduce tasks configurable.')
' '.join([f'{t}({t.dep_})' for t in sent])

Outputs:

Spacy=3.5.0
Lang=en_core_web_trf-3.5.0

'Make(ROOT) the(det) compression(nsubj) used(acl) between(prep) map(nmod) reduce(compound) tasks(pobj) configurable(ccomp) .(punct)'

There should not be an nsubj in this sentence.
This should be:

'Make(ROOT) the(det) compression(dobj) used(acl) between(prep) map(nmod) reduce(compound) tasks(pobj) configurable(ccomp) .(punct)'

Other examples include:

Make the output of the reduce side plan optimized by the correlation optimizer more reader-friendly.
Make ZooKeeper easier to test - support simulating a connection loss
Make compaction more robust when stats update fails
...

All of these put an nsubj where there should be a dobj.

Note, I tested 3.3.4, and 3.4.4 and they seemed to do the same thing

@adrianeboyd
Copy link
Contributor

Imperatives and questions are two very common things that most of our trained pipelines perform poorly on because they are rare in typical newspaper training data.

@cbowdon
Copy link

cbowdon commented May 5, 2023

What is the NER training data for English please? I see some models (e.g. German) are trained on WikiNER but none of the referenced sources for English models (e.g. here) are related to NER.

Apologies if this is the wrong place to ask, I was drawn here from other related issues.

@adrianeboyd
Copy link
Contributor

Hi @cbowdon, OntoNotes does contain NER annotation, see: https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

@cbowdon
Copy link

cbowdon commented May 5, 2023

@adrianeboyd Thank you!

@giova-p
Copy link

giova-p commented Jul 28, 2023

Hi there!

I've come across an anomaly in the parsing component of the 'en_core_web_sm' model. Specifically, I've noticed that the verb 'need' is sometimes labeled as the root of the sentence, while in other cases, it's labeled as an 'aux'.

Even more strangely, when the same sentence is repeated twice or more, the behavior of the parsing component becomes erratic. Take this example: "the member states need not do something. the member states need not do something." In the first sentence, the subject is a "child" of the root verb 'do', while in the second sentence (which is identical!), the subject is the child of the 'aux'.

I've tried to replicate this behavior with other examples, but the anomaly is not always present. I'd appreciate any insights or suggestions on whether you think this could arise in other circumstances as well.

Thanks!
atb
g.

@adrianeboyd
Copy link
Contributor

Hi @giova-p, yes, the predictions of the statistical models depend on a context window that can go beyond a single sentence, so you will see differences like this in practice.

A pipeline should output the same predictions for the exact same input text string every time, but if anything is modified in the text, even adding whitespace, you may see different predictions.

@Arjuman23
Copy link

I have identified a discrepancy in the entities detected by the "en_ner_bc5cdr_md-0.5.1" model between results obtained from a Windows system and an Ubuntu system. According to the readme file of the "en_ner_bc5cdr_md-0.5.1" model, it is trained up to Spacy version 3.5.0. Interestingly, this alignment holds true for the Windows system. Whenever I adjust the Spacy version to a value above 3.5.0, the named entity recognition (NER) results are no longer produced. The model en_ner_bc5cdr_md-0.5.0 worked irrespective of the spacy version.

However, an interesting scenario emerged when I conducted the same experiment on an Ubuntu system. Here, the "en_ner_bc5cdr_md-0.5.1" model generated NER outputs regardless of the Spacy version I employed. I even tested it with versions like 3.6.1 and even lower than that.

This leads me to the question: Why is this discrepancy in behavior occurring between the Windows and Ubuntu systems? Is this a known issue? Am I missing something??

@svlandeg
Copy link
Member

Hi @Arjuman23,

If I understand you correctly, both en_ner_bc5cdr_md-0.5.0 and en_ner_bc5cdr_md-0.5.1 work fine on Ubuntu & Windows within the spaCy ranges specified for these model, right?

From the release notes, I gather that the 0.5.0 models were trained with 3.2.3 and the 0.5.1 models with 3.4.x. Note that we don't actually train or maintain these models - AllenAI does.

In general, you can run python -m spacy validate to double check whether a model in your environment is compatible with the spaCy version. If it's not, I'm afraid we can't really make any guarantees about its behaviour.

https://github.com/allenai/scispacy/issues

@Arjuman23
Copy link

Hi @svlandeg,
Thank you for your response. Much appreciated.
You've pointed it out right, both the models work fine within te spacy ranges specified in their readme files, but on windows. On Ubuntu, they work on the latest spacy versions as well, without any hassle (eg .3.6.1)
I totally agree that AllenAI maintains them, but I didn't know how to report this to them. Hence I came down to its roots :P
If you can connect me to them, it would be helpful.

@svlandeg
Copy link
Member

You could contact them through their issue tracker, but to be honest I'm not sure there's a bug to be solved here. The expected behaviour is that the models work within their range, and not outside of it. It might accidentally do work on some systems outside of the "correct" spaCy range, for various reasons I'm not sure of. Again, you can ask them / report this to them, but I don't think there's something to be fixed here (I agree it's weird behaviour though).

@Mindful
Copy link

Mindful commented Oct 29, 2023

I'm not sure if this counts as a pre-trained model prediction given that the tokenizer is rule-based, but it looks like spaCy's English tokenizer splits the verb "wed". See below:
https://demos.explosion.ai/displacy?text=The%20couple%20was%20wed%20yesterday.&model=en_core_web_sm&cpu=1&cph=1

If this isn't a mistake, I can imagine it might be a way to deal with common typos of we'd as wed, but it's a little inconvenient.

edit: the same thing happens with the noun cant. I'm not sure if there's a good way to fix this, it seems like you would need POS or syntax information to make judgements about whether something was likely to be a typo or not.

@rafa852
Copy link

rafa852 commented Nov 1, 2023

This thread is a master thread for collecting problems and reports related to incorrect and/or problematic predictions of the pre-trained models.

Why a master thread instead of separate issues?

GitHub now supports pinned issues, which lets us create master threads more easily without them getting buried.

Users often report issues that come down to incorrect predictions made by the pre-trained statistical models. Those are all good and valid, and can include very useful test cases. However, having a lot of open issues around minor incorrect predictions across various languages also makes it more difficult to keep track of the reports. Unlike bug reports, they're much more difficult to action on. Sometimes, mistakes a model makes can indicate deeper problems that occurred during training or when preprocessing the data. Sometimes they can give us ideas for how to use data augmentation to make the models less sensitive to very small variations like punctuation or capitalisation.

Other times, it's just something we have to accept. A model that's 90% accurate will make a mistake on every 10th prediction. A model that's 99% accurate will be wrong once every 100 predictions.

The main reason we distribute pre-trained models is that it makes it easier for users to build their own systems by fine-tuning pre-trained models on their data. Of course, we want them to be as good as possible, and we're always optimising for the best compromise of speed, size and accuracy. But we won't be able to ship pre-trained models that are always correct on all data ever.

For many languages, we're also limited by the resources available, especially when it comes to data for named entity recognition. We've already made substantial investments into licensing training corpora, and we'll continue doing so (including running our own annotation projects with Prodigy ✨) – but this will take some time.

Reporting incorrect predictions in this thread

If you've come across suspicious predictions in the pre-trained models (tagger, parser, entity recognizer) or you want to contribute test cases for a given language, feel free to submit them here. (Test cases should be "fair" and useful for measuring the model's general accuracy, so single words, significant typos and very ambiguous parses aren't usually that helpful.)

You can check out our new models test suite for spaCy v2.1.0 to see the tests we're currently running.

@cyriaka90
Copy link

cyriaka90 commented Dec 21, 2023

Hey, here are some inaccurate parses I encountered (all using spacy version 3.7.2):

  • Portuguese (pt_core_news_sm):

      1. doc = nlp("Reserve voos baratos.")
        print(doc.to_json())

        {'text': 'Reserve voos baratos.', 'ents': [{'start': 0, 'end': 7, 'label': 'LOC'}], 'sents': [{'start': 0, 'end': 21}], 'tokens': [{'id': 0, 'start': 0, 'end': 7, 'tag': 'PROPN', 'pos': 'PROPN', 'morph': 'Gender=Fem|Number=Sing', 'lemma': 'Reserve', 'dep': 'ROOT', 'head': 0}, {'id': 1, 'start': 8, 'end': 12, 'tag': 'NOUN', 'pos': 'NOUN', 'morph': 'Gender=Fem|Number=Plur', 'lemma': 'voos', 'dep': 'nsubj', 'head': 0}, {'id': 2, 'start': 13, 'end': 20, 'tag': 'ADJ', 'pos': 'ADJ', 'morph': 'Gender=Masc|Number=Plur', 'lemma': 'barato', 'dep': 'amod', 'head': 1}, {'id': 3, 'start': 20, 'end': 21, 'tag': 'PUNCT', 'pos': 'PUNCT', 'morph': '', 'lemma': '.', 'dep': 'punct', 'head': 0}]}

        The verb Reserve is parsed as PROPN and the lemma for voos is given as voos, but should be voo.

      1. doc = nlp("..., e a maioria das novidades já foram reveladas através de fotos vazadas.")

        ....{{'id': 14, 'start': 69, 'end': 74, 'tag': 'AUX', 'pos': 'AUX', 'morph': 'Mood=Ind|Number=Plur|Person=3|VerbForm=Fin', 'lemma': 'ser', 'dep': 'aux:pass', 'head': 15},...

        The verb foram should be parsed as past tense.

  • Greek (el_core_news_sm):
    doc = nlp(" Αφήστε τον εαυτό σας να εκπλαγείτε από τις συναρπαστικές δυνατότητες!")

    {'text': 'Αφήστε τον εαυτό σας να εκπλαγείτε από τις συναρπαστικές δυνατότητες!', 'ents': [], 'sents': [{'start': 0, 'end': 69}], 'tokens': [{'id': 0, 'start': 0, 'end': 6, 'tag': 'VERB', 'pos': 'VERB', 'morph': 'Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin|Voice=Pass', 'lemma': 'Αφήστε', 'dep': 'ROOT', 'head': 0}, ....

    The verb Αφήστε is parsed with mood Ind instead of Imp, also the lemma should be αφήνω.

  • English (en_core_web_sm):

      1. doc = nlp("90 % of Australians like him, the most of any country.")

        {'text': '90 % of Australians like him, the most of any country.', 'ents': [{'start': 0, 'end': 4, 'label': 'PERCENT'}, {'start': 8, 'end': 19, 'label': 'NORP'}], 'sents': [{'start': 0, 'end': 54}], 'tokens': [{'id': 0, 'start': 0, 'end': 2, 'tag': 'CD', 'pos': 'NUM', 'morph': 'NumType=Card', 'lemma': '90', 'dep': 'nummod', 'head': 1}, {'id': 1, 'start': 3, 'end': 4, 'tag': 'NN', 'pos': 'NOUN', 'morph': 'Number=Sing', 'lemma': '%', 'dep': 'ROOT', 'head': 1}, {'id': 2, 'start': 5, 'end': 7, 'tag': 'IN', 'pos': 'ADP', 'morph': '', 'lemma': 'of', 'dep': 'prep', 'head': 1}, {'id': 3, 'start': 8, 'end': 19, 'tag': 'NNPS', 'pos': 'PROPN', 'morph': 'Number=Plur', 'lemma': 'Australians', 'dep': 'pobj', 'head': 2}, ....

        Lemma for Australians should be Australian.

      1. doc = nlp("Then, as if to show that he could, he collapsed.")

        ...{ {'id': 8, 'start': 28, 'end': 33, 'tag': 'MD', 'pos': 'AUX', 'morph': 'VerbForm=Fin', 'lemma': 'could', 'dep': 'ccomp', 'head': 5}, ....

        Lemma for verb could should be can with tense=past.

  • German (de_core_news_sm):

      1. doc = nlp("Ein Reifen, der sich für längere Strecken genauso gut eignet wie für den Alltag.")

        ... {'id': 10, 'start': 54, 'end': 60, 'tag': 'VVPP', 'pos': 'VERB', 'morph': 'VerbForm=Part', 'lemma': 'eignen', 'dep': 'rc', 'head': 1}, {'id': 11, 'start': 61, 'end': 64, 'tag': 'KOKOM', 'pos': 'ADP', 'morph': '', 'lemma': 'wie', 'dep': 'cm', 'head': 12}, {'id': 12, 'start': 65, 'end': 68, 'tag': 'APPR', 'pos': 'ADP', 'morph': '', 'lemma': 'für', 'dep': 'cc', 'head': 5}, {'id': 13, 'start': 69, 'end': 72, 'tag': 'ART', 'pos': 'DET', 'morph': 'Case=Acc|Definite=Def|Gender=Masc|Number=Sing|PronType=Art', 'lemma': 'der', 'dep': 'nk', 'head': 14}, {'id': 14, 'start': 73, 'end': 79, 'tag': 'NN', 'pos': 'NOUN', 'morph': 'Case=Acc|Gender=Masc|Number=Sing', 'lemma': 'Alltag', 'dep': 'nk', 'head': 12}, {'id': 15, 'start': 79, 'end': 80, 'tag': '$.', 'pos': 'PUNCT', 'morph': '', 'lemma': '--', 'dep': 'punct', 'head': 1}]}

        Verb eignet should be present tense and not participle.

      1. doc = nlp("Das Epad lässt sich problemlos zum Picknick mitnehmen.")

        {'text': 'Das Epad lässt sich problemlos zum Picknick mitnehmen.', 'ents': [], 'sents': [{'start': 0, 'end': 54}], 'tokens': [{'id': 0, 'start': 0, 'end': 3, 'tag': 'ART', 'pos': 'DET', 'morph': 'Case=Nom|Definite=Def|Gender=Neut|Number=Sing|PronType=Art', 'lemma': 'der', 'dep': 'nk', 'head': 1}, {'id': 1, 'start': 4, 'end': 8, 'tag': 'NN', 'pos': 'NOUN', 'morph': 'Case=Nom|Gender=Neut|Number=Sing', 'lemma': 'Epad', 'dep': 'sb', 'head': 2}, {'id': 2, 'start': 9, 'end': 14, 'tag': 'VVFIN', 'pos': 'VERB', 'morph': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'lemma': 'lässn', 'dep': 'ROOT', 'head': 2}, ...

        The lemma for verb lässt should be lassen, not lässn.

      1. doc = nlp("So findet auch der stressigste Tag einen leckeren und entspannten Abschluss.")

        ....{'id': 7, 'start': 41, 'end': 49, 'tag': 'ADJA', 'pos': 'ADJ', 'morph': 'Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing', 'lemma': 'leck', 'dep': 'nk', 'head': 10}, {'id': 8, 'start': 50, 'end': 53, 'tag': 'KON', 'pos': 'CCONJ', 'morph': '', 'lemma': 'und', 'dep': 'cd', 'head': 7}, {'id': 9, 'start': 54, 'end': 65, 'tag': 'ADJA', 'pos': 'ADJ', 'morph': 'Case=Acc|Degree=Pos|Gender=Masc|Number=Sing', 'lemma': 'entspannt', 'dep': 'cj', 'head': 8}, {'id': 10, 'start': 66, 'end': 75, 'tag': 'NN', 'pos': 'NOUN', 'morph': 'Case=Dat|Gender=Masc|Number=Plur', 'lemma': 'Abschluss', 'dep': 'oa', 'head': 1}, {'id': 11, 'start': 75, 'end': 76, 'tag': '$.', 'pos': 'PUNCT', 'morph': '', 'lemma': '--', 'dep': 'punct', 'head': 1}]}

        The lemma for adjective leckeren should be lecker, not leck.

  • Croatian (hr_core_news_sm):
    doc = nlp("Kupi jabuku i knjigu.")

    {'text': 'Kupi jabuku i knjigu.', 'ents': [], 'sents': [{'start': 0, 'end': 21}], 'tokens': [{'id': 0, 'start': 0, 'end': 4, 'tag': 'Vmr3s', 'pos': 'VERB', 'morph': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'lemma': 'Kupi', 'dep': 'ROOT', 'head': 0},...

    Lemma for verb kupi should be kupiti.

  • Italian (it_core_news_sm):
    `doc = nlp("Prenota voli economici.")``

    {'text': 'Prenota voli economici.', 'ents': [{'start': 0, 'end': 7, 'label': 'MISC'}], 'sents': [{'start': 0, 'end': 23}], 'tokens': [{'id': 0, 'start': 0, 'end': 7, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Fem|Number=Sing', 'lemma': 'prenota', 'dep': 'nmod', 'head': 1}, {'id': 1, 'start': 8, 'end': 12, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Masc|Number=Plur', 'lemma': 'volo', 'dep': 'ROOT', 'head': 1}, {'id': 2, 'start': 13, 'end': 22, 'tag': 'A', 'pos': 'ADJ', 'morph': 'Gender=Masc|Number=Plur', 'lemma': 'economico', 'dep': 'amod', 'head': 1}, {'id': 3, 'start': 22, 'end': 23, 'tag': 'FS', 'pos': 'PUNCT', 'morph': '', 'lemma': '.', 'dep': 'punct', 'head': 1}]}

    The verb `Prenota` is parsed as `NOUN` instead of VERB.
    

@glangford
Copy link

glangford commented Jan 15, 2024

The following Portuguese sentences, which all have a verb capitalized to start the sentence, result in an incorrect lemma for the verb (pt_core_news_lg, spacy 3.7.2)

"Trabalharam com honra e dignidade e estiveram entre os melhores."
"Fale só um bocadinho sobre o Festival."
"Surge detrás das cortinas."
"Encontrei as chaves."
"Reserve voos baratos." (this one is from the earlier comment #3052 (comment))

In each case, the lemma of the first word is given as the word unchanged.

If the first word is lower cased, the correct lemmas are produced (trabalhar, falar, surgir, encontrar, reservar).

@joprice
Copy link

joprice commented Aug 21, 2024

The Portuguese word compartilharemos produces the lemma compartilharemo in the sentence Nós compartilharemos. when starting with a capital letter and compartilharer when the initial letter is lowercased, instead of compartilhar.

@jomra
Copy link

jomra commented Sep 13, 2024

In all the Spanish models I’ve tried, from small to large, the lemma of tendientes is resolved as tendient, which isn’t actually a word (it should be the singular form tendiente. tendiente resolves correctly to tendiente. All cases I’ve tried have the word in lower case, though at various locations in the original string

@ivan-kleshnin
Copy link

ivan-kleshnin commented Dec 27, 2024

Common Ex- prefix interpretation is completely broken (tested with multiple examples):

text: 'A director.'

[{'dep': 'det', 'head': director, 'pos': 'DET', 'tag': 'DT', 'token': A},
 {'dep': 'ROOT',
  'head': director,
  'pos': 'NOUN',
  'tag': 'NN',
  'token': director},
 {'dep': 'punct', 'head': director, 'pos': 'PUNCT', 'tag': '.', 'token': .}]

noun_chunks: [A director]
text: 'An ex-director.'

[{'dep': 'det', 'head': ex, 'pos': 'DET', 'tag': 'DT', 'token': An},
 {'dep': 'ROOT', 'head': ex, 'pos': 'NOUN', 'tag': 'NN', 'token': ex},
 {'dep': 'dobj', 'head': ex, 'pos': 'NOUN', 'tag': 'NN', 'token': -},
 {'dep': 'npadvmod', 'head': ex, 'pos': 'NOUN', 'tag': 'NN', 'token': director},
 {'dep': 'punct', 'head': ex, 'pos': 'PUNCT', 'tag': '.', 'token': .}]

noun_chunks: [An ex, -]

It is a systematic error, ex- is persistently treated as the main word in all ex-something combinations I tested.
Sm/md/lg models make no difference here.


Something similar happens with co-. E.g. co-founder is interpreted as co<-founder instead of co->founder (arrows point to heads). If I hack the tokenization to merge the above to a single token, especially with ex-, it significantly increases the number of nouns I should care about (in my heuristics). And other dash-split words are normally separated. So it's not a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Issues related to the statistical models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests