Releases: explosion/spaCy
v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish
✨ New features and improvements
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Ragged
with fasterAlignmentArray
inExample
for training (#10319). - Improve
Matcher
speed (#10659). - Improve serialization speed for empty
Doc.spans
(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizer
or using the quickstart. - Language updates:
- Big endian support with
thinc
v8.0.14+ andthinc-bigendian-ops
. - Config comparisons with
spacy debug diff-config
. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates
for debugging span suggesters.- The quickstart now supports adding
spancat
andtrainable_lemmatizer
components.
📦 Trained pipelines
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm |
Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md |
Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg |
Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm |
Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md |
Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg |
Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm |
Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md |
Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg |
Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md |
84.9 | 94.8 |
de_core_news_md |
73.4 | 97.7 |
el_core_news_md |
56.5 | 88.9 |
fi_core_news_md |
- | 86.2 |
it_core_news_md |
86.6 | 97.2 |
ko_core_news_md |
- | 90.0 |
lt_core_news_md |
71.1 | 84.8 |
nb_core_news_md |
76.7 | 97.1 |
nl_core_news_md |
81.5 | 94.0 |
pl_core_news_md |
87.1 | 93.7 |
pt_core_news_md |
76.7 | 96.9 |
ro_core_news_md |
81.8 | 95.5 |
sv_core_news_md |
- | 95.5 |
🔴 Bug fixes
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_cats
for missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_
value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Span
attributes consistently. - Fix issue #10073: Add
"spans"
to the output ofdoc.to_json
. - Fix issue #10086: Add tokenizer option to allow
Matcher
handling for all special cases. - Fix issue #10189: Allow
Example
to align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vec
for empty batches. - Fix issue #10347: Update basic functionality for
rehearse
. - Fix issue #10394: Fix
Vectors.n_keys
for floret vectors. - Fix issue #10400: Use
meta
inutil.load_model_from_config
. - Fix issue #10451: Fix
Example.get_matching_ents
. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain
. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizer
tag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors
.
🚀 Notes about upgrading from v3.2
- To see the speed improvements for the
Tagger
architecture, edit your configs to switch fromspacy.Tagger.v1
tospacy.Tagger.v2
and then runinit fill-config
. - Span comparisons involving ordering (
<
,<=
,>
,>=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs
now includesDoc.tensor
by default and supports excludes with anexclude
argument in the same format asDoc.to_bytes
. The supported exclude fields arespans
,tensor
anduser_data
.
📖 Documentation and examples
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
👥 Contributors
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
v3.1.6: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
v3.2.4: Workaround for Click/Typer issues
🔴 Bug fixes
- Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.
👥 Contributors
v3.2.3: Fix Tok2Vec for empty batches
v3.1.5: Bug fixes for Tok2Vec, SpanCategorizer, and more
🔴 Bug fixes
- Fix issue #9593: Use metaclass to subclass errors for easier pickling.
- Fix issue #9654: Fix
spancat
for empty docs and zero suggestions. - Fix issue #9979: Fix type of
Lexeme.rank
. - Fix issue #10324: Fix
Tok2Vec
for empty batches.
👥 Contributors
@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz
v3.0.8: Fix Tok2Vec for empty batches
v3.2.2: Improved NER and parser speeds, bug fixes and more
✨ New features and improvements
- Improved
parser
andner
speeds on long documents (see technical details in #10019). - Support for
spancat
components indebug data
. - Support for
ENT_IOB
as aMatcher
token pattern key. - Extended and improved types for many classes.
🔴 Bug fixes
- Fix issue #9735: Make floret murmurhash endian-neutral.
- Fix issue #9738: Support string IOB values for
ENT_IOB
. - Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
- Fix issue #9960: Warn about entities that cross sentence boundaries in
debug data
. - Fix issue #9979: Fix type for
Lexeme.rank
. - Fix issue #10026: Check for 0-size assets in
spacy project
. - Fix issue #10051: Consistently return scalars from similarity methods.
- Fix issue #10052: Fix spaces in
Doc.from_docs()
for empty docs. - Fix issue #10079: Fix label detection in
debug data
for components with custom names. - Fix issue #10109: Add types to
Underscore
andDependencyMatcher
and improve types inLanguage
,Matcher
andPhraseMatcher
. - Fix issue #10130: Fix
Tokenizer.explain
when infixes appear as prefixes. - Fix issue #10143: Use simple suggester in
spancat
initialization. - Fix issue #10164: Support
IS_SENT_END
inDoc.has_annotation
. - Fix issue #10192: Detect invalid package names in
spacy package
. - Fix issue #10223: Support mixed case in package names.
- Fix issue #10234: Fix type in
PhraseMatcher
.
📖 Documentation and examples
- Various documentation updates.
- New spaCy version tags in spaCy universe.
- New
Dockerfile
for repeatable website builds and easier local development. - New additions to spaCy universe:
- Augmenty: a text augmentation library
- Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
- spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
- spacypdfreader: easy PDF to text to spaCy text extraction
- textnets: text analysis with networks
👥 Contributors
@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
v3.2.1: doc_cleaner component, new Matcher attributes, bug fixes and more
✨ New features and improvements
- NEW:
doc_cleaner
component for removingdoc.tensor
,doc._._trf_data
or otherDoc
attributes at the end of the pipeline to reduce size of output docs. - NEW:
ENT_ID
andENT_KB_ID
toMatcher
pattern attributes. - Support
kb_id
for entities in displaCy fromDoc
input. - Add
Span.sents
property for spans spanning over more than one sentence. - Add
EntityRuler.remove
to remove patterns byid
. - Make the
Tagger
neg_prefix
configurable. - Use
Language.pipe
inLanguage.evaluate
for more efficient processing. - Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.
🔴 Bug fixes
- Fix issue #9638: Make
JsonlCorpus
path optional again. - Fix issue #9654: Fix
spancat
for empty docs and zero suggestions. - Fix issue #9658: Improve error message for incorrect
.jsonl
paths inEntityRuler
. - Fix issue #9674: Fix language-specific factory handling in package CLI.
- Fix issue #9694: Convert labels to strings for README in package CLI.
- Fix issue #9697: Exclude strings from source vector checks.
- Fix issue #9701: Allow
Scorer.score_spans
to handle predicted docs with missing annotation. - Fix issue #9722: Initialize
parser
from reference parse rather than aligned example. - Fix issue #9764: Set annotations more efficiently in
tagger
andmorphologizer
.
📖 Documentation and examples
- Various documentation updates:
init_tok2vec
after pretraining, batch contract for listeners. - New additions to the spaCy universe:
eng-spacysentiment
: Sentiment analysis for English.- Applied Language Technology course: NLP for newcomers using spaCy and Stanza.
👥 Contributors
@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
v3.2.0: Registered scoring functions, Doc input, floret vectors and more
✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()
andnlp.pipe()
acceptDoc
input, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite
config settings forentity_linker
,morphologizer
,tagger
,sentencizer
andsenter
.extend
config setting formorphologizer
for whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()
including IETF language tags, for examplefra
forFrench
andzh-Hans
forChinese
. - New package
spacy-loggers
for additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipy
are annotated asToken.morph
features. - Additional
morph_micro_p/r/f
scores for morphological features fromScorer.score_morph_per_feat()
. LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option forspacy pretrain
.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.pos
andToken.morph
.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes
- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer
. - Fix issue #9584: Use metaclass to subclass errors to allow better pickling.
⚠️ Backwards incompatibilities
- In the
Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].
is now° c .
instead of° c.
for most languages. - The tokenizer classes
ChineseTokenizer
,JapaneseTokenizer
,KoreanTokenizer
,ThaiTokenizer
andVietnameseTokenizer
requireVocab
rather thanLanguage
in__init__
. - In
DocBin
, user data is now always serialized according to thestore_user_data
option, see #9190.
📖 Documentation and examples
- Demo projects for floret vectors:
pipelines/floret_vectors_demo
: basic floret vector training and importing.pipelines/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
👥 Contributors
@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
v3.1.4: Python 3.10 wheels and support for AppleOps
✨ New features and improvements
- NEW: Binary wheels for Python 3.10.
- NEW: Improve performance on Apple M1 with
AppleOps
:pip install spacy[apple]
. - GPU profiling with
spacy.models_with_nvtx_range.v1
. - Full
mypy
integration in the CI and many type fixes across the code base. - Added custom
Protocol
classes inty.py
to define behavior of pipeline components. - Support for entity linking visualization in
displacy
. - Allow overriding vars in
spacy project assets
. - Standalone
train
function to run the training from Python scripts just like thespacy train
CLI. - Support for
spacy-transformers>=1.1.0
with improved IO. - Support for
thinc>=8.0.11
with improved gradient clipping.
🔴 Bug fixes
- Fix issue #5507: Improve UX for multiprocessing on GPU.
- Fix issue #9137: Fix serialization for
KnowledgeBase.set_entities
. - Fix issue #9244: Fix vectors for 0-length spans.
- Fix issue #9247: Improve UX for the
DocBin
constructor. - Fix Issue #9254: Allow unicode in a
spacy project
title. - Fix issue #9263: Make added patterns consistent in the
DependencyMatcher
. - Fix issue #9305: Restore tokenization timing during evaluation.
- Fix issue #9335: Sync vocab in vectors and sourced components.
- Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
- Fix issue #9404: Create consistent default
textcat
andtextcat_multilabel
configurations. - Fix issue #9437: Improve UX around
Doc
object creation. - Fix issue #9465: Fix minor issues with
convert
CLI. - Fix issue #9500: Include
.pyi
files in the distributed package.
📖 Documentation and examples
- Various updates to the documentation.
- New additions to the spaCy universe:
deplacy
: CUI-based dependency visualizeripymarkup
: Visualizations for NER and syntax treesPhruzzMatcher
: Find fuzzy matchesspacy-huggingface-hub
: Push spaCy pipelines to the Hugging Face HubspaCyOpenTapioca
: Entity Linking on Wikidataspacy-clausie
: Clause-based information extraction system- "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
- "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly
👥 Contributors
@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker