SYN2020: A New Corpus of Czech with an Innovated Annotation

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1351 Accesses
4 Citations

Abstract

The paper introduces the SYN2020 corpus, a newly released representative corpus of written Czech following the tradition of the Czech National Corpus SYN series. The design of SYN2020 incorporates several substantial new features in the area of segmentation, lemmatization and morphological tagging, such as a new treatment of lemma variants, a new system for identifying morphological categories of verbs or a new treatment of multiword tokens. The annotation process, including data and tools used, is described, and the tools and accuracy of the annotation are discussed as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Large Scale Syntactic Annotation of Written Dutch: Lassy

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Article 05 July 2023

Case Study: The Manually Annotated Sub-Corpus

Notes

1.
https://www.korpus.cz.
2.
Herein we basically follow a list of categories recently introduced for the morphological dictionary MorfFlex (see [7, 15]) which was used within our annotation process, see Sect. 3.1.
3.
Prague Dependency Treebank.
4.
The system of notation for the glosses and abbreviations used adheres to The Leipzig Glossing Rules [4] http://www.eva.mpg.de/lingua/resources/glossing-rules.php.
5.
https://dumps.wikimedia.org/cswiki.
6.
http://hdl.handle.net/11234/1-3698.
7.
For amalgamated forms, see Sect. 2.3, the values were calculated on their multiword representations, i.e. before their reamalgamation.

References

Bański, P., Przepiórkowski, A.: Stand-off TEI annotation: the case of the National Corpus of Polish. In: Proceedings of the Third Linguistic Annotation Workshop (LAW III), pp. 64–67 (2009)
Google Scholar
Bejček, E., Panevová, J., Popelka, J., Straňák, P., Ševčíková, M., Štěpánek, J., Žabokrtský, Z.: Prague Dependency Treebank 2.5–a revisited version of PDT 2.0. In: Proceedings of COLING 2012, pp. 231–246 (2012)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Comrie, B., Haspelmath, M., Bickel, B.: The Leipzig Glossing Rules: Conventions for Interlinear Morpheme-by-morphene Glosses. Max Planck Institute for Evolutionary Anthropology (2008)
Google Scholar
Goláňová, H., et al.: Novočeský lexikální archiv a excerpce v průběhu let 1911–2011. Slovo a slovesnost 72(4), 287–300 (2011)
Google Scholar
Hajič, J.: Disambiguation of rich inflection: computational morphology of Czech. Karolinum (2004)
Google Scholar
Hajič, J., et al.: Prague dependency treebank-consolidated 1.0. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 5208–5218. ELRA, Marseille, France (2020)
Google Scholar
Jelínek, T.: FicTree: a manually annotated treebank of Czech fiction. In: Hlaváčová, J. (ed.) ITAT 2017 Proceedings, pp. 181–185 (2017)
Google Scholar
Jelínek, T., Petkevič, V.: Systém jazykového značkování současné psané češtiny. Korpusová lingvistika Praha 2011, sv. 3: Gramatika a značkování korpusů, pp. 154–170 (2011)
Google Scholar
Ling, W., Dyer, C., Black, A., Trancoso, I.: Two/too simple adaptations of word2vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of ACL: Human Language Technologies. ACL (2015)
Google Scholar
Martins, A., Almeida, M., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Annual Meeting of the ACL, pp. 617–622, August 2013
Google Scholar
Nivre, J., et al.: Universal dependencies v2: an evergrowing multilingual treebank collection (2020)
Google Scholar
Petkevič, V.: Reliable morphological disambiguation of Czech: rule-based approach is necessary. Insight into the Slovak and Czech corpus linguistics, pp. 26–44 (2006)
Google Scholar
Petkevič, V., et al.: Problémy automatické morfologické disambiguace češtiny. Naše řeč 4–5, 194–207 (2014)
Google Scholar
Štěpánková, B., Mikulová, M., Hajič, J.: The MorfFlex Dictionary of Czech as a Source of Linguistic Data. In: Euralex XIX Proceedings Book: Lexicography for inclusion. pp. 387–391 (2020)
Google Scholar
Straka, M., Straková, J., Hajič, J.: Czech text processing with contextual embeddings: POS tagging, lemmatization, parsing and NER. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 137–150. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_12
Chapter Google Scholar
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of ACL: System Demonstrations, pp. 13–18 (2014)
Google Scholar
Votrubec, J.: Morphological tagging based on averaged perceptron. In: WDS 2006 Proceedings of Contributed Papers, pp. 191–195. Matfyzpress, Charles University, Praha, Czechia (2006)
Google Scholar
Xuezhe, M., Zecong, H., Jingzhou, L., Nanyun, P., Neubig, G., Hovy, E.H.: Stack-pointer networks for dependency parsing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 1403–1414. ACL, Melbourne, Australia (2018)
Google Scholar

Download references

Acknowledgements

This paper and the creation of the corpus SYN2020 have been supported by the Ministry of Education of the Czech Republic, through the project Czech National Corpus, no. LM2018137.

Author information

Authors and Affiliations

Faculty of Arts, Charles University, Prague, Czech Republic
Tomáš Jelínek, Jan Křivan, Vladimír Petkevič, Hana Skoumalová & Jana Šindlerová

Authors

Tomáš Jelínek
View author publications
You can also search for this author in PubMed Google Scholar
Jan Křivan
View author publications
You can also search for this author in PubMed Google Scholar
Vladimír Petkevič
View author publications
You can also search for this author in PubMed Google Scholar
Hana Skoumalová
View author publications
You can also search for this author in PubMed Google Scholar
Jana Šindlerová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomáš Jelínek .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jelínek, T., Křivan, J., Petkevič, V., Skoumalová, H., Šindlerová, J. (2021). SYN2020: A New Corpus of Czech with an Innovated Annotation. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_4
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SYN2020: A New Corpus of Czech with an Innovated Annotation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Large Scale Syntactic Annotation of Written Dutch: Lassy

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Case Study: The Manually Annotated Sub-Corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SYN2020: A New Corpus of Czech with an Innovated Annotation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Large Scale Syntactic Annotation of Written Dutch: Lassy

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Case Study: The Manually Annotated Sub-Corpus

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation