This repository provides a PyTorch implementation of the paper "Filtered Semi-Markov CRF," published in the Findings of EMNLP 2023. It focuses on implementing span-based Named Entity Recognition (NER) models that feature structured training and decoding.
The repository includes a variety of training and decoding algorithms to cater to different needs and use-cases for span-based NER.
Here are the different training algorithms that have been implemented:
-
Filtered Semi-Markov CRF: This algorithm utilizes global span selection but adds label-dependent scoring and transition scores. It is essentially a filtered version of the original semi-CRF algorithm.
-
Standard Span-Based NER with Local Objective: This is the baseline algorithm for training the span-based NER model. Our paper contains details about this:
-
Global Span Selection: An implementation based on the model from Zaratiana et al., 2022a.
The implemented decoding algorithms aim to return non-overlapping spans. The following algorithms are available:
-
Greedy Decoding: Returns the first best non-overlapping spans.
-
Exact Decoding: Returns spans with the highest sum of scores.
-
Exhaustive Search: Utilizes an arbitrary scoring function to return spans with the maximum score.
- This has been proposed in our Zaratiana et al., 2022b
To configure the model and decoding algorithm, modify the configuration file (see config/conll.yaml as an example) as described below:
Named Entity Recognition as Structured Span Prediction (Zaratiana et al., UM-IoS 2022a)
model_type: "standard"
decoding: "greedy" or "global" or "global_mean"
Filtered Semi-Markov CRF (Zaratiana et al., EMNLP 2023):
model_type: "fsemicrf"
decoding: "global"
Global Span Selection for Named Entity Recognition (Zaratiana et al., UM-IoS 2022b):
model_type: "gss"
decoding: "global"
- Options for decoding parameter:
- 'global': maximize sum of span scores
- 'global_mean': maximize average of span scores
- 'greedy': greedy span selection
- Options for model_type parameter:
- 'standard': Standard Span-Based NER loss (span-level NLL)
- 'fsemicrf': Filtered Semi-Markov CRF loss
- 'gss': Global Span Selection loss
If you find this code useful in your research, please consider citing our papers
@misc{zaratiana2023filtered,
title={Filtered Semi-Markov CRF},
author={Urchade Zaratiana and Nadi Tomeh and Niama El Khbir and Pierre Holat and Thierry Charnois},
year={2023},
eprint={2311.18028},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{zaratiana-etal-2022-global,
title = "Global Span Selection for Named Entity Recognition",
author = "Zaratiana, Urchade and
Elkhbir, Niama and
Holat, Pierre and
Tomeh, Nadi and
Charnois, Thierry",
booktitle = "Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.umios-1.2",
doi = "10.18653/v1/2022.umios-1.2",
pages = "11--17",
abstract = "Named Entity Recognition (NER) is an important task in Natural Language Processing with applications in many domains. In this paper, we describe a novel approach to named entity recognition, in which we output a set of spans (i.e., segmentations) by maximizing a global score. During training, we optimize our model by maximizing the probability of the gold segmentation. During inference, we use dynamic programming to select the best segmentation under a linear time complexity. We prove that our approach outperforms CRF and semi-CRF models for Named Entity Recognition. We will make our code publicly available.",
}
@inproceedings{zaratiana-etal-2022-named,
title = "Named Entity Recognition as Structured Span Prediction",
author = "Zaratiana, Urchade and
Tomeh, Nadi and
Holat, Pierre and
Charnois, Thierry",
booktitle = "Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.umios-1.1",
doi = "10.18653/v1/2022.umios-1.1",
pages = "1--10",
abstract = "Named Entity Recognition (NER) is an important task in Natural Language Processing with applications in many domains. While the dominant paradigm of NER is sequence labelling, span-based approaches have become very popular in recent times but are less well understood. In this work, we study different aspects of span-based NER, namely the span representation, learning strategy, and decoding algorithms to avoid span overlap. We also propose an exact algorithm that efficiently finds the set of non-overlapping spans that maximizes a global score, given a list of candidate spans. We performed our study on three benchmark NER datasets from different domains. We make our code publicly available at \url{https://github.com/urchade/span-structured-prediction}.",
}