×

CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn treebank. (English) Zbl 1234.68412

Summary: This article presents an algorithm for translating the Penn Treebank into a corpus of combinatory categorial grammar (CCG) derivations augmented with local and long-range word-word dependencies. The resulting corpus, CCGbank, includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium, and has been used to train wide-coverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses, and to eliminate noise and inconsistencies in the original annotation, an extensive analysis of the constructions and annotations in the Penn Treebank was called for, and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.

MSC:

68T50 Natural language processing
68Q42 Grammars and rewriting systems

References:

[1] Ajdukiewicz Kazimierz, Polish Logic pp 1920– (1935)
[2] DOI: 10.2307/410452 · doi:10.2307/410452
[3] DOI: 10.1007/BF00370157 · Zbl 0718.03020 · doi:10.1007/BF00370157
[4] Carpenter Bob, Computational Linguistics 17 (3) pp 301– (1991)
[5] DOI: 10.1017/S1351324905003943 · doi:10.1017/S1351324905003943
[6] Dowty David, Linguistic Inquiry 9 pp 393– (1978)
[7] DOI: 10.1023/B:ROLC.0000016736.80096.76 · Zbl 1076.68100 · doi:10.1023/B:ROLC.0000016736.80096.76
[8] Hoffman Beryl, IRCS Report pp 95– (1995)
[9] DOI: 10.1023/A:1005311532280 · doi:10.1023/A:1005311532280
[10] DOI: 10.1007/BF00984961 · doi:10.1007/BF00984961
[11] DOI: 10.1093/logcom/4.1.1 · Zbl 0802.68100 · doi:10.1093/logcom/4.1.1
[12] Marcus Mitchell P, Computational Linguistics 19 pp 313– (1993)
[13] DOI: 10.2307/413534 · doi:10.2307/413534
[14] DOI: 10.1162/089120105774321073 · Zbl 1234.68430 · doi:10.1162/089120105774321073
[15] DOI: 10.1162/0891201053630264 · doi:10.1162/0891201053630264
[16] Pollard Carl, Linguistic Inquiry 23 pp 261– (1992)
[17] DOI: 10.1016/j.cogsci.2004.05.002 · doi:10.1016/j.cogsci.2004.05.002
[18] DOI: 10.2307/414385 · doi:10.2307/414385
[19] DOI: 10.1023/A:1006409422158 · doi:10.1023/A:1006409422158
[20] DOI: 10.1007/s11168-006-9010-2 · doi:10.1007/s11168-006-9010-2
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.