In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.
Other notable changes:
- Include a contextual lemmatizer in English for
's
->be
orhave
in thedefault_accurate
package. Also built is a HI model. Others potentially to follow. Now with fewer bugs at startup. #1422 - Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
- Pytorch compatibility: set
weights_only=True
when loading models. #1430 #1429 - augment MWT tokenization to accommodate unexpected
'
characters, including"
used in"s
- #1437 #1436 - when training the lemmatizer, take advantage of
CorrectForm
annotations in the UD treebanks dbdf429 - add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
- add VLSP 2023 constituency dataset: 1159d0d
Bugfixes:
raise_for_status
earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432- Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
- reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
- similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
- missing text for a Document does not cause the NER model to crash: 0732628 #1428
- tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423