Iarg-AnCora: Spanish corpus annotated with implicit arguments

  • Original Paper
  • Published:
Language Resources and Evaluation


This article presents the Spanish Iarg-AnCora corpus (400 k-words, 13,883 sentences) annotated with the implicit arguments of deverbal nominalizations (18,397 occurrences). We describe the methodology used to create it, focusing on the annotation scheme and criteria adopted. The corpus was manually annotated and an interannotator agreement test was conducted (81 % observed agreement) in order to ensure the reliability of the final resource. The annotation of implicit arguments results in an important gain in argument and thematic role coverage (128 % on average). It is the first corpus annotated with implicit arguments for the Spanish language with a wide coverage that is freely available. This corpus can subsequently be used by machine learning-based semantic role labeling systems, and for the linguistic analysis of implicit arguments grounded on real data. Semantic analyzers are essential components of current language technology applications, which need to obtain a deeper understanding of the text in order to make inferences at the highest level to obtain qualitative improvements in the results.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

  1. Iarg-AnCora is freely available at:

  2. Since implicit arguments are not annotated in AnCora-Es, the percentage of realization cannot be computed. The corresponding figure (0.19 implicit arguments per verb) has been estimated from the corpus assuming that for a given predicate the number of arguments (explicit or not) is the same, on average, when realized as a verb or as deverbal nominalization.

  3. For sake of clarity we underline the discourse entities acting as antecedents of the implicit arguments.

  4. In Sect. 4.1, the annotation scheme is presented in more detail.

  5. In AnCora corpus, the tag 'S' stands for clause.







  12. The authors also provided a version of the corpus based on PropBank/NomBank annotations.

  13. Predicates annotated: ‘bid’, ‘sale’, ‘loan’, ‘cost’, ‘plan’, ‘investor’, ‘price’, ‘loss’, ‘investment’ and ‘fund’.

  14. Henceforth, we will refer to the Moor, Roth and Frank (2013) corpus as the MRF corpus.

  15. Predicates annotated: ‘give’, ‘put’, ‘leave’, ‘bring’ and ‘pay’.

  16. Note that a coreference chain may consist of only one mention, that is, a singleton.

  17. Possessive pronouns and determiners can also be discourse entities, but they do not tend to be implicit arguments of deverbal nouns since they usually appear explicitly inside of the NP headed by the nominalization. For instance, Esto permitirá al banco sanear sus cuentas, que es condición básica para continuar con su privatización, ‘This will enable the bank to consolidate its accounts, which is a basic condition for its privatization'. In this example, the possessive determiner su (‘its') is the explicit argument, with the thematic role theme, of the deverbal noun privatización (‘privatization').

  18. Not all the combinations of argument position and thematic roles are valid semantic tags.

  19. 200,000 words were extracted from the Spanish El Periódico newspaper ( and the other 200,000 words from the EFE newswire agency (, spanning from January to December 2000.

  20. We used Spanish WordNet in the Multilingual Central Repository (MCR), which is linked to Princeton WordNet (Gonzalez-Agirre, Laparra and Rigau 2012),

  21. Spanish is a pro-drop language, therefore, pronominal subjects can be omitted. The object personal pronouns often appear as clitic forms and can be adjoined to the verb.

  22. AnCora-Verb-Es lexicon is available at:

  23. AnCora-Verb contains 3934 different senses and 5117 syntactic-semantic frames in total. .



  26. AnCoraPipe is freely available, to access contact


  28. AnCoraPipe has been used for the treatment of corpora in the Amazighe, Latin and Cyrillic alphabets.

  29. For reasons of space, Fig. 3 only shows the discourse entities starting from entity12.

  30. We have split the panels in two figures in order to better visualize their content.

  31. The mean of inter-annotator agreement for the annotation of explicit arguments reached 0.75 kappa, which translated to 79.2 % observed agreement.

  32. Instances stand for the number of occurrences of argument types found in the corpus.

  33. It is worth noting that the implicit arguments of verbs are not annotated in Iarg-AnCora, so the number of occurrences and percentages for verbs only includes explicit arguments.

  34. The figures are slightly different from those reported in Sect. 4 because the comparison with G&C is performed with the subset of the 8 most frequent monosemous nominalizations.

  35. A third explanation could be the use of different criteria in the annotation of both explicit and implicit arguments in the G&C dataset and in AnCora.

  36. TCO aims to provide WordNet synsets with a neutral ontological assignment. The ontology contains 63 features organized as 1st order entities (physical things), 2nd order entities (situations) and 3rd order entities (unobservable things).

  37. Since AnCora-Es mentions are annotated with correct synsets, no Word Sense Disambiguation was needed.


We are grateful to David Bridgewater for the proofreading of English. We would also like to express our gratitude to the three anonymous reviewers for their comments and suggestions to improve this article. This work was partly supported by the DIANA (TIN2012-38,603-C02-02) and SKATER (TIN2012-38,584-C06-01) projects from the Spanish Ministry of Economy and Competitiveness.

Corresponding author

Correspondence to Mariona Taulé.

