Abstract
The aim of this study is to automatically extract academic phrases in Czech using data-mining techniques as a first step towards creating a dictionary of academic words and phrases targeting university-level students (L1 and L2). The decision to use data mining was based on excellent results of data mining in automatic recognition of single-word and multi-word terms [10]. This method has identified various types of academic phrases: structurally incomplete lexical bundles with specific functions in texts (e.g. na druhou stranu – on the other hand), collocations (e.g. podrobná analýza – detailed analysis) or combinations of a content word and a typical function word (e.g. zaměřený na - focused on; podobný jako - similar to). The final list of automatically identified academic phrases is quite extensive and consists of 7,300 bigrams. Manual evaluation of the output data sample showed that precision of the automatic identification method is more than 72% and recall is 81%. The list of identified academic phrases is a very good starting point for the planned dictionary because the majority of the extracted bigrams constitute collocations typically used for academic texts. Such collocations are useful for the target audience, that is, university students interested in academic writing.
This paper has been, in part, funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures (Czech National Corpus project, LM2015044). It was also supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Data mining is a discipline partially overlaping with machine learning. In this study, we choose to use data mining terminology, because we are searching for useful information in vast amounts of language data. However, we recognize that this terminological preference is just a matter of a point of view.
- 2.
By collocation we mean a meaningful combination of frequently co-occurring words, cf. e.g. McEnery and Hardie [13].
- 3.
As these characteristics are mostly based on frequency or distribution, it can be assumed that similar values will allow automatic identification of academic phrases in other languages, e.g. English.
- 4.
We only use lemmas, because they have proved to be more effective in previous research [10]. The reason for this is due to the rich morphology of the Czech language.
- 5.
We used J48 decision tree models. These were selected as the most suitable method for automatic extraction of terms and non-terms in Kováříková [10]. Different pre-processing was chosen for each model: (1) unbalanced classes (phrases and non-phrases), (2) class balancer, and (3) resampling..
- 6.
With the exception of a study that launched our interest in academic phrase list [11].
- 7.
Currently, there is only one such list for Czech language, Akalex, that is limited in terms of size and completeness, cf. www.korpus.cz/akalex [2].
- 8.
Precision is fraction of relevant instances among the retrieved instances, recall is fraction of relevant instances that have been retrieved over the total amount of relevant instances.
References
Ackermann, K., Chen, Y.-H.: Developing the academic collocation list (ACL) – a corpus-driven and expert-judged approach. J. Engl. Acad. Purp. 12(4), 235–247 (2013)
Akalex 2018: Lexikon akademické češtiny. Akalex 2018: A Lexicon of Academic Czech (in Czech) (2018). https://korpus.cz/akalex. Accessed 15 May 2019
Biber, D., Barbieri, F.: Lexical bundles in university spoken and written registers. Engl. Specif. Purp. 26, 263–286 (2007)
Chen, Y.-H., Baker, P.: Lexical bundles in L1 and L2 student writing. Lang. Learn. Technol. 14, 30–49 (2010)
Coxhead, A.: A new academic word list. TESOL Q. 34(2), 213–238 (2000)
Durrant, P.: Investigating the viability of a collocation list for students of English for academic purposes. Engl. Specif. Purp. 28, 157–169 (2009)
Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann, Burlington (2016)
Granger, S.: Academic phraseology: a key ingredient in successful L2 academic literacy. Oslo Stud. Engl. 9(3), 9–27 (2017)
Hyland, K.: Bundles in academic discourse. Ann. Rev. Appl. Linguist. 32, 150–169 (2012)
Kováříková, D.: Kvantitativní charakteristiky termínů. Quantitative Characteristics of Terms (in Czech). LN, Praha (2017)
Kováříková, D., Lukešová, L.: Extracting multi-word expressions for the Czech academic phrase list (conference presentation)
Křen, M., et al.: SYN2015: Representative Corpus of Written Czech. Institute of the Czech National Corpus, FFUK, Prague (2015). http://www.korpus.cz. Accessed 15 May 2019
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. John Benjamins, Amsterdam (2012)
Simpson-Vlach, R., Ellis, N.: An academic formulas list: new methods in phraseology research. Appl. Linguist. 31(4), 487–512 (2010)
Vincent, B.: Investigating academic phraseology through combinations of very frequent words: a methodological exploration. J. Engl. Acad. Purp. 12, 44–56 (2013)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam (2005)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Kováříková, D., Kovářík, O. (2019). Automatic Identification of Academic Phrases for Czech. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-30135-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30134-7
Online ISBN: 978-3-030-30135-4
eBook Packages: Computer ScienceComputer Science (R0)