Preview

Scientific and Technical Journal of Information Technologies, Mechanics and Optics

Advanced search

RuLegalNER: a new dataset for Russian legal named entities recognition

https://doi.org/10.17586/2226-1494-2023-23-4-854-857

Abstract

We address the scarcity of datasets specifcally tailored for legal NER in the Russian language and investigate the generalization capabilities of models towards unseen named entities. A rule-based program developed by legal experts at Tag-Consulting Company was employed to automatically annotate legal texts and create the RuLegalNER dataset. Part of the named entities only exists in the development and test splits, and they are unseen in the training set. RuBERT was utilized as the base architecture for experimental evaluation. Two different architectural extensions were explored: RuBERT with CRF and RuBERT with adapters. These architectures were used to train and evaluate NER models on the RuLegalNER dataset. Utilize RuLegalNER to train and evaluate legal NER models, enhancing performance in the legal domain and studying generalization on unseen entities. A published version of RuLegalNER is presented with detailed statistics and demonstration of the usefulness of RuLegalNER by evaluating modern architectures.

About the Authors

Z. Shaheen
ITMO University
Russian Federation

Zein Shaheen — PhD Student

sc 57209279132

Saint Petersburg, 197101



D. I. Mouromtsev
ITMO University
Russian Federation

Dmitry I. Mouromtsev — PhD, Associate Professor

sc 55575780100

Saint Petersburg, 197101



I. Postny
T.A.G. Consulting
Russian Federation

Ignat Postny — Director

Moscow, 119119



References

1. Weston L., Tshitoyan V., Dagdelen J., Kononova O., Trewartha A., Persson K.A., Ceder G., Jain A.. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature // Journal of Chemical Information and Modeling. 2019. V. 59. N 9. P. 3692–3702. https://doi.org/10.1021/acs.jcim.9b00470

2. Angelidis I., Chalkidis I., Koubarakis M. Named entity recognition, linking and generation for greek legislation // Legal Knowledge and Information Systems. 2018. V. 313. P. 1–10.

3. Zhu Y., Ye Y., Li M., Zhang J., Wu O. Investigating annotation noise for named entity recognition // Neural Computing and Applications. 2023. V. 35. N 1. P. 993–1007. https://doi.org/10.1007/s00521-022-07733-0

4. Vlasova N.A., Suleymanova E.A., Trofmov I.V. Report on Russian corpus for personal name retrieval // Proceedings of Computational and Cognitive Linguistics, TEL. 2014. P. 36–40.

5. Starostin A.S., Bocharov V.V., Alexeeva S.V., Bodrova A.A., Chuchunkov A.S., Dzhumaev S.S., Efmenko I.V., Granovsky D.V., Khoroshevsky V.F., Krylova I.V., Nikolaeva M.A., Smurov I.M., Toldova S.Y. Factrueval 2016: evaluation of named entity recognition and fact extraction systems for Russian // Proc. of the International Conference “Dialogue 2016”. 2016. P. 702–720.

6. Gareev R., Tkachenko M., Solovyev V., Simanovsky A., Ivanov V. Introducing baselines for russian named entity recognition // Lecture Notes in Computer Science. 2013. V. 7816. P. 329–342. https://doi.org/10.1007/978-3-642-37247-6_27

7. Loukachevitch N., Artemova E., Batura T., Braslavski P., Denisov I., Ivanov V., Manandhar S., Pugachev A., Tutubalina E. Nerel: A Russian dataset with nested named entities, relations and events // Proc. of Recent Advances in Natural Language Processing. 2021. P. 876–885 https://doi.org/10.26615/978-954-452-072-4_100

8. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2019”. 2019.

9. Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., De Laroussilhe Q., Gesmundo A., Attariyan M., Gelly S. Parametereffcient transfer learning for NLP // Proc. of the 36th International Conference on Machine Learning. 2019. P. 2790–2799.

10. Panchendrarajan R., Amaresan A. Bidirectional LSTM-CRF for named entity recognition // Proc. of the 32nd Pacifc Asia Conference on Language, Information and Computation. 2018. P. 531–540.


Review

For citations:


Shaheen Z., Mouromtsev D.I., Postny I. RuLegalNER: a new dataset for Russian legal named entities recognition. Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2023;23(4):854-857. https://doi.org/10.17586/2226-1494-2023-23-4-854-857

Views: 28


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2226-1494 (Print)
ISSN 2500-0373 (Online)