Releases: explosion/spaCy
v3.6.0: New span finder component and pipelines for Slovenian
✨ New features and improvements
- NEW:
span_finder
pipeline component to identify overlapping, unlabeled spans (#12507). - Language updates:
- Add option to return scores separately keyed by component name with
spacy evaluate --per-component
,Language.evaluate(per_component=True)
andScorer.score(per_component=True)
(#12540). - Support custom token/lexeme attribute for vectors (#12625).
- Support
spancat_singlelabel
inspacy debug data
CLI (#12749). - Typing updates for
PhraseMatcher
andSpanGroup
(#12642, #12714).
🔴 Bug fixes
- #12569: Require that all
SpanGroup
spans come from the current doc.
📦 Trained pipelines updates
We have added new pipelines for Slovenian that use the trainable lemmatizer and floret vectors.
Package | UPOS | Parser LAS | NER F |
---|---|---|---|
sl_core_news_sm |
96.9 | 82.1 | 62.9 |
sl_core_news_md |
97.6 | 84.3 | 73.5 |
sl_core_news_lg |
97.7 | 84.3 | 79.0 |
sl_core_news_trf |
99.0 | 91.7 | 90.0 |
- 🙏 Special thanks to @orglce for help with the new pipelines!
The English pipelines have been updated to improve handling of contractions with various apostrophes and to lemmatize "get" as a passive auxiliary.
The Danish pipeline da_core_news_trf
has been updated to use vesteinn/DanskBERT
with performance improvements across the board.
⚠️ Backwards incompatibilities
SpanGroup
spans are now required to be from the same doc. When initializing aSpanGroup
, there is a new check to verify that all added spans refer to the current doc. Without this check, it was possible to run into string store or other errors.
📖 Documentation and examples
- Various documentation corrections and updates.
- New additions to spaCy Universe:
👥 Contributors
@adrianeboyd, @bdura, @danieldk, @davidberenstein1957, @diyclassics, @essenmitsosse, @honnibal, @ines, @isabelizimm, @jmyerston, @kadarakos, @KennethEnevoldsen, @khursani8, @ljvmiranda921, @rmitsch, @shadeMe, @svlandeg, @tomaarsen, @victorialslocum, @vin-ivar, @ZiadAmerr
v3.5.4: Bug fixes for overrides with registered functions and sourced components with listeners
v3.3.3: Bug fixes for Pydantic and pip
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0
.
✨ New features and improvements
- Huge speed improvements for
spancat
, in particular on GPU (~10x-30x faster) (#12577).
🔴 Bug fixes
- Add
typing_extensions
requirement due to Pydantic incompatibility withtyping_extensions>=4.6.0
. - Remove
#egg
from download URLs due to future deprecation inpip
.
👥 Contributors
v3.2.6: Bug fixes for Pydantic and pip
This bug fix release is primarily to address Pydantic incompatibility with typing_extensions>=4.6.0
.
✨ New features and improvements
- Huge speed improvements for
spancat
, in particular on GPU (~10x-30x faster) (#12577).
🔴 Bug fixes
- Add
typing_extensions
requirement due to Pydantic incompatibility withtyping_extensions>=4.6.0
. - Remove
#egg
from download URLs due to future deprecation inpip
.
👥 Contributors
v3.5.3: Speed improvements, bug fixes and more
✨ New features and improvements
- Huge speed improvements for
spancat
, in particular on GPU (~10x-30x faster) (#12577). - Improve speed for child operators (
>+
,>-
,>++
,>--
) for the dependency matcher (#12528). - Improve loading speed for tokenizers with a large number of exceptions (#12553).
- Support
doc.spans
for displaCy output inspacy benchmark accuracy
/spacy evaluate
(#12575). - Add
MorphAnalysis.get(default=)
argument for user-provided default values similar todict
(#12545). - Only perform vectors checks during initialization if there are sourced components (#12607).
🔴 Bug fixes
- #12567: Remove
#egg
from download URLs due to future deprecation inpip
.
📖 Documentation and examples
- Various documentation corrections and updates.
- New additions to spaCy Universe:
👥 Contributors
@adrianeboyd, @andyjessen, @bdura, @davidberenstein1957, @diyclassics, @honnibal, @ines, @kadarakos, @KennethEnevoldsen, @ljvmiranda921, @moxley01, @royashcenazi, @svlandeg, @tanloong, @victorialslocum
v3.5.2: Pretraining improvements, bug fixes for spans and spancat and more
✨ New features and improvements
- Add support for floret vectors in
spacy pretrain
(#12435). - Save final model as
model-last.bin
forspacy pretrain
(#12459). - Support
Span
input fordisplacy.parse_deps
(#12477). - Extend support to CuPy 12.0 for
cupy
install extras.
🔴 Bug fixes
- #12398: Fix entity linker failure on sentence-crossing entities.
- #12405: Fix sentence indexing bug in
Span.sents
. - #12469: Fix scores attribute for
spancat_singlelabel
. - #12484: Fix
Span.sents
when the final sentence is the last token in aDoc
. - #12486: Fix pickle for the ngram suggester.
- #12493: Include
Span.kb_id
andSpan.id
strings inDoc
andDocBin
serialization.
📖 Documentation and examples
- Various documentation corrections and updates.
- New addition to spaCy Universe:
👥 Contributors
@adrianeboyd, @BLKSerene, @honnibal, @ines, @kadarakos, @prajakta-1527, @rmitsch, @shadeMe, @sloev, @svlandeg, @thomashacker, @willfrey
v3.5.1: spancat for multi-class labeling, fixes for textcat+transformers and more
💥 We'd love to hear more about your experience with spaCy! Take our survey here.
✨ New features and improvements
- NEW:
spancat_singlelabel
pipeline component for multi-class and non-overlapping span classification. Thespancat_singlelabel
component predicts at most one label for each suggested span and adds a new settingallow_overlap
to restrict the output to non-overlapping spans (#11365). - Extend to mypy v1.0 (#12245).
- Use
transformer
+ CNN for efficient GPUtextcat
withspacy init config
(#11900). - Support trainable lemmatizer in
spacy debug data
(#11419). - Add new operators to dependency matcher for left/right immediate child/parent nodes (
>+
,>-
,<+
,<-
) (#12334). - Add
spacy.PlainTextCorpusReader.v1
for plain text input (#12122). - Add
alignment_mode
andspan_id
toSpan.char_span()
(#12145, #12196). - Use string formatting types in logging calls (#12215).
🔴 Bug fixes
- #12017: Improve speed for
top_k>1
in trainable lemmatizer. - #12048: Make
test_cli_find_threshold()
test more robust. - #12227: Fix return type of
registry.find()
. - #12272: Fix speed regression for
Matcher
patterns with extension attributes. - #12287: Add
grc
to languages with lexeme norms inspacy-lookups-data
. - #12320: Make generation of empty
KnowledgeBase
instances configurable. - #12343: Fix error message for displacy
auto_select_port
. - #12347: Fix length check for knowledge base in entity linker, add
InMemoryLookupKB.is_empty
. - #12365: Fix types for
Lexeme.orth
andLexeme.lower
. - #12366: Raise error for non-default vectors with
PretrainVectors
. - #12368: Partially address pending deprecation of
pkg_resources
. - Various improvements and fixes for the test suite (#12148, #12157, #12210, #12303, #12372).
📖 Documentation and examples
- Many website updates to improve accessibility.
- Various documentation corrections and updates.
- New projects:
- Span labeling datasets
- Comparing embedding layers in spaCy from the technical report Multi hash embeddings in spaCy
👥 Contributors
@adrianeboyd, @andyjessen, @danieldk, @essenmitsosse, @honnibal, @ines, @itssimon, @kadarakos, @kwhumphreys, @ljvmiranda921, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @shadeMe, @svlandeg, @tanloong, @thomashacker, @victorialslocum
v3.5.0: New CLI commands, language updates, bug fixes and much more
✨ New features and improvements
- NEW: New
apply
CLI command to annotate new documents with a trained pipeline (#11376). - NEW: New
benchmark
CLI command to benchmark pipelines. The newbenchmark speed
subcommand measures the speed of a pipeline, thebenchmark accuracy
subcommand is a new alias forevaluate
(#11902). - NEW: New
find-threshold
CLI command to identify an optimal threshold for classification models (#11280). - NEW: New
FUZZY
Matcher
operator for fuzzy matches based on Levenshtein edit distance. In addition, theFUZZY
andREGEX
operators are now supported in combination withIN
/NOT_IN
. (#11359). - Language updates for Ancient Greek, Dutch, Russian, Slovenian and Ukrainian (#11345, #11162, #11426, #11753, #11811, #11997, more details below).
- Allow up to
typer
v0.7.x (#11720),mypy
0.990 (#11801) andtyping_extensions
v4.4.x (#12036). - New
spacy.ConsoleLogger.v3
with expanded progress tracking (#11972). - Improved scoring behavior for
textcat
withspacy.textcat_scorer.v2
(#11696 and #11971) andspacy.textcat_multilabel_scorer.v2
(#11820). - Improved customizability of the knowledge base used for entity linking, with the default implementation being the new
InMemoryLookupKB
(#11268). - Optional
before_update
callback that is invoked at the start of each training step (#11739). - Improve performance of
SpanGroup
(#11380). - Improve UX around
displacy.serve
when the default port is in use (#11948). - Patch a security vulnerability in extracting tar files (#11746).
- Add equality definition for vectors (#11806).
- Allow interpolation of variables in directory names in projects (#11235).
- Update default component configs to use the latest
tok2vec
version (#11618).
🔴 Bug fixes
- #11382: Fix lookup behavior for the French and Catalan lemmatizers.
- #11385: Ensure that downstream components can train properly on a frozen
tok2vec
ortransformer
layer. - #11762: Support local file system remotes for projects.
- #11763: Raise an error when unsupported values are used for
textcat
. - #11834: Ensure
Vocab.to_disk
respects the exclude setting forlookups
andvectors
. - #12009: Fix a few typing issues for
SpanGroup
andSpan
objects. - #12098: Correctly handle missing annotations in the edit tree lemmatizer.
⚠️ Backwards incompatibilities and model updates
The following changes may require you to update code that is using the relevant functionality:
- An error is now raised when unsupported values are given as input to train a
textcat
ortextcat_multilabel
model - ensure that values are 0.0 or 1.0 as explained in the docs. - As
KnowledgeBase
is now an abstract class, you should call the constructor of the newInMemoryLookupKB
instead when you want to use spaCy's default KB implementation. If you've written a custom KB that inherits fromKnowledgeBase
, you'll need to implement its abstract methods, or alternatively inherit fromInMemoryLookupKB
instead.
The following changes may influence the output of your language pipeline or trained models:
- Updates to language defaults:
- Updates to model defaults:
- Use the latest
tok2vec
defaults in all components (#11618). - Improve the default attributes used for the
textcat
andtextcat_multilabel
components (#11698). - Update the default scorer for
textcat
andtextcat_multilabel
to fix a bug related tothreshold
fortextcat
and to make it possible to score multipletextcat
/textcat_multilabel
components in a single pipeline with custom scorers. If no custom scorers are used, thecat_p/r/f
scores will now only reflect the final component's labels and performance (#11696, #11820). - Correct the
token_acc
score to report the intended measure (# correct tokens / # predicted tokens
, the same as in spaCy v2). Thetoken_acc
scores for v3.5 will be lower for the same performance because they were incorrectly inflated in v3.0-v3.4. Thetoken_p/r/f
scores should remain unchanged (#12073).
- Use the latest
The following functionality will be changed in the near future - so it's best to start updating your scripts now to make them more generic:
- From v4 onwards, we'll rename the
master
branch tomain
.
📦 Trained pipelines updates
- The CNN pipelines add
IS_SPACE
as atok2vec
feature fortagger
andmorphologizer
components to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformers
v1.2, which uses the exact alignment fromtokenizers
for fast tokenizers instead of the heuristic alignment fromspacy-alignments
. For all trained pipelines exceptja_core_news_trf
, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformers
changes in the v1.2.0 release notes.
📖 Documentation and examples
- We've ported our website from Gatsby to Next 🥳
- Updated the documentation on supported languages.
- Added a note about experimental M1 GPU support to the installation quickstart.
- Included documentation for the
biluo_to_iob
andiob_to_biluo
functions. - Fixed model links in the v3.4 usage documentation.
- Removed "new" tags of functionality from spaCy v2.x.
- Various small additions, spelling and typo fixes.
- spaCy Universe additions:
- greCy: Providing Ancient Greek models
- spacy-pythainlp: Add Thai support for spaCy
- New projects:
- Accelerate NER with Speedster (experimental)
👥 Contributors
@aaronzipp, @adrianeboyd, @albertvillanova, @ArchiDevil, @cfuerbachersparks, @damian-romero, @danieldk, @darigovresearch, @DSLituiev, @essenmitsosse, @gremur, @honnibal, @ines, @jmyerston, @JosPolfliet, @kadarakos, @koaning, @kwhumphreys, @ljvmiranda921, @MarcoGorelli, @orglce, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @ryndaniels, @shadeMe, @svlandeg, @thomashacker, @TrellixVulnTeam, @wannaphong, @zhiiw, @zrpxx
v3.0.9: Bug fixes and future NumPy compatibility
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.
🔴 Bug fixes
- #11331, #11701: Clean up warnings in spaCy and its test suite.
- #11845: Don't raise an error in displaCy for unset spans keys.
- #11864: Add
smart_open
requirement and update deprecated options. - #11899: Fix
spacy init config --gpu
for environments withoutspacy-transformers
. - #11933: Update for compatibility with NumPy v1.24+ integer conversions.
- #11935: Restore missing error messages for beam search.
👥 Contributors
v2.3.9: Compatibility with NumPy v1.24+
This release addresses future compatibility with NumPy v1.24+.
🔴 Bug fixes
- #11940: Update for compatibility with NumPy v1.24+ integer conversions.