Releases · stanfordnlp/stanza

29 Dec 06:54

v1.10.1

af3d42b

Latest

In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.

Other notable changes:

Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. Now with fewer bugs at startup. #1422
Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

raise_for_status earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432
Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
missing text for a Document does not cause the NER model to crash: 0732628 #1428
tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

Contributors

pattersam

Assets 2

23 Dec 04:27

AngledLuffa

v1.10.0

ad17b27

v1.10.0 - rebuild with UD 2.15

Other notable changes:

Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. #1422
Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

raise_for_status earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432
Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
missing text for a Document does not cause the NER model to crash: 0732628 #1428
tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

Contributors

pattersam

Assets 2

12 Sep 23:17

AngledLuffa

v1.9.2

539760c

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
actually include the visualization: #1421 thank you @bollwyvl

Contributors

bollwyvl

Assets 2

12 Sep 19:40

AngledLuffa

v1.9.1

174768a

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
actually include the visualization: #1421 thank you @bollwyvl

Contributors

bollwyvl

Assets 2

12 Sep 07:23

AngledLuffa

v1.9.0

b999102

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0

Assets 2

0 Join discussion

20 Apr 18:58

AngledLuffa

v1.8.2

6e442a6

Old English, MWT improvements, and better memory management of Peft

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

Add Old English (ANG) annotation! Thank you to @dmetola #1365

MWT improvements

Fix words ending with -nna split into MWT stanfordnlp/handparsed-treebank@2c48d40 #1366
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) #1371 #1378
Mark start_char and end_char on an MWT if it is composed of exactly its subwords 2384089 #1361

Peft memory management

Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. huggingface/peft#1523 #1381 #1384

Other bugfixes and minor upgrades

Fix crash when trying to load previously unknown language #1360 381736f
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: d180ae0 #1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length 4271813
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) 597d48f

Other upgrades

Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation 57bfa8b #1356 (comment)
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: 4048cae 15b136b
Add a constituency model for German 7a4f48c 86ddaab #1368

Contributors

Jemoka and dmetola

Assets 2

01 Mar 06:47

AngledLuffa

v1.8.1

c2d72bd

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
NER also trained with peft: unfortunately, no consistent improvements to scores #1336
depparse includes peft: no consistent improvements yet #1337 #1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
wandb support for coref #1338
Coref annotator breaks length ties using POS if available #1326 c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: #1318 #1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
Coref training rounding error #1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
Clarify error when a language is only partially handled: da01644 #1310

Additional 1.8.1 Bugfixes

Older POS models not loaded correctly... need to use .get() 13ee3d5 #1357
Debug logging for the Constituency retag pipeline to better support someone working on Icelandic 6e2520f #1356
device arg in MultilingualPipeline would crash if device was passed for an individual Pipeline: 44058a0

Contributors

ider-zh and sterliakov

Assets 2

25 Feb 07:38

AngledLuffa

v1.8.0

17eb6fc

PEFT integration

Integrating PEFT into several different annotators

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
NER also trained with peft: unfortunately, no consistent improvements to scores #1336
depparse includes peft: no consistent improvements yet #1337 #1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
wandb support for coref #1338
Coref annotator breaks length ties using POS if available #1326 c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: #1318 #1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
Coref training rounding error #1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
Clarify error when a language is only partially handled: da01644 #1310

Contributors

ider-zh and sterliakov

Assets 2

0 Join discussion

03 Dec 06:47

AngledLuffa

v1.7.0

5948c9f

v1.7.0: Neural coref!

Neural coref processor added!

Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master

Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref

If you use Stanza's coref module in your work, please be sure to cite both of the above papers.

Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.

Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.

Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models

#1309

Interface change: English MWT

English now has an MWT model by default. Text such as won't is now marked as a single token, split into two words, will and not. Previously it was expected to be tokenized into two pieces, but the Sentence object containing that text would not have a single Token object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.

Code that used to operate with for word in sentence.words will continue to work as before, but for token in sentence.tokens will now produce one object for MWT such as won't, cannot, Stanza's, etc.

Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline creation time if the language and package includes MWT.

f22dceb 27983ae

Other updates

NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 #1295 #1298
Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) #1000 #1303
Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! #1302
Remove deprecated output methods such as conll_as_string and doc2conll_text. Use "{:C}".format(doc) instead e01650f
Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
Sentences now have a doc_id field if the document they are created from has a doc_id. 8e2201f
Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) 3d90d2b

Updated requirements

Support dropped for python 3.6 and 3.7. The peft module used for finetuning the transformer used in the coref processor does not support those versions.
Added peft as an optional dependency to transformer based installations
Added networkx as a dependency for reading enhanced dependencies. Added toml as a dependency for reading the coref config.

Contributors

Jemoka and KarelDO

Assets 2

06 Oct 05:16

AngledLuffa

v1.6.1

c65b669

Multiple default models and a combined EN NER model

V1.6.1 is a patch of a bug in the Arabic POS tagger.

We also mark Python 3.11 as supported in the setup.py classifiers. This will be the last release that supports Python 3.6

Multiple model levels

The package parameter for building the Pipeline now has three default settings:

default, the same as before, where POS, depparse, and NER use the charlm, but lemma does not
default-fast, where POS and depparse are built without the charlm, making them substantially faster on CPU. Some languages currently have non-charlm NER as well
default-accurate, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language. Suggestions for more transformers to use are welcome

Furthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into -fast and -accurate versions for each UD dataset.

PR: #1287

addresses #1259 and #1284

Multiple output heads for one NER model

The NER models now can learn multiple output layers at once.

#1289

Theoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected. The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.

Results of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:

original ontonotes on worldwide:   88.71  69.29
simplify-separate                  88.24  75.75
simplify-connected                 88.32  75.47

We also produced combined models for nocharlm and with Electra as the input encoding. The new English NER models are the packages ontonotes-combined_nocharlm, ontonotes-combined_charlm, and ontonotes-combined_electra-large.

Future plans include using multiple NER datasets for other models as well.

Other features

Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka). When creating a Pipeline, you can now provide a callable via the tokenize_postprocessor parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the Pipeline #1290
Finetuning for transformers in the NER models: have not yet found helpful settings, though 45ef544
SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code #1279 88cd0df
charlm for PT (improves accuracy on non-transformer models): c10763d
build models with transformers for a few additional languages: MR, AR, PT, JA 45b3875 0f3761e c55472a c10763d

Bugfixes

V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: b56f442
Scenegraph CoreNLP connection needed to be checked before sending messages: stanfordnlp/CoreNLP#1346 (comment) c71bf3f
run_ete.py was not correctly processing the charlm, meaning the whole thing wouldn't actually run 16f29f3
Chinese NER model was pointing to the wrong pretrain #1285 82a0215

Contributors

Jemoka

Assets 2

Releases: stanfordnlp/stanza

v1.10.1 - rebuild with UD 2.15

Contributors

v1.10.0 - rebuild with UD 2.15

Contributors

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Contributors

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Contributors

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Old English, MWT improvements, and better memory management of Peft

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

Contributors

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

Contributors

PEFT integration

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Contributors

v1.7.0: Neural coref!

Neural coref processor added!

Interface change: English MWT

Other updates

Updated requirements

Contributors

Multiple default models and a combined EN NER model

Multiple model levels

Multiple output heads for one NER model

Other features

Bugfixes

Contributors