Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.
Update (08/06/2023): Support Brown Corpus and English Web Treebank that are used in this study.
Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.
Update (03/10/2021): Parallel Chinese-English data is supported.
The repo handles WSJ, CTB, SPMRL, Brown Corpus, and English Web Treebank. Have a look at treebank.py
.
If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py
and put all outputs in the same folder, let us say ./data.punct
. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py
. For example, I used python clean_tb.py ./data.punct ./data.clean
. All the cleaned treebanks will reside in /data.clean
. Then simply execute the command ./batchify.sh ./data.clean/
, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh
if you want to use a different batch size or vocabulary size.
To ease evaluation I represent a gold tree as a tuple:
TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)
If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/
could help you convert gold trees into the tuple representation.
Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py
. Hopefully, this will help with the situation.
Model | WSJ | CTB | Basque | German | French | Hebrew | Hungarian | Korean | Polish | Swedish |
---|---|---|---|---|---|---|---|---|---|---|
LB | 8.7 | 7.2 | 17.9 | 10.0 | 5.7 | 8.5 | 13.3 | 18.5 | 10.9 | 8.4 |
RB | 39.5 | 25.5 | 15.4 | 14.7 | 26.4 | 30.0 | 12.7 | 19.2 | 34.2 | 30.4 |
Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.
Model | Sent. F1 | Corpus F1 | Variance | Word repr. | Punct. rm | Length | Dataset |
---|---|---|---|---|---|---|---|
PRPN | ✓ | RAW | ✓ | WSJ | |||
ON | ✓ | RAW | ✓ | WSJ | |||
DIORA | ✓ | ELMo | WSJ | ||||
URNNG | ✓ | RAW | ✗ | WSJ | |||
N-PCFG | ✓ | RAW | ✓ | WSJ / CTB | |||
C-PCFG | ✓ | RAW | ✓ | WSJ / CTB | |||
VG-NSL | ✓ | ✓ | RAW / FastText | ✗ | MSCOCO | ||
LN-PCFG | ✓ | RAW | WSJ | ||||
CT | ✓ | RoBERTa | WSJ | ||||
S-DIORA | ✓ | ELMo | WSJ | ||||
VC-PCFG | ✓ | ✓ | ✓ | RAW | ✓ | MSCOCO | |
C-PCFG (Zhao 2020) | ✓ | ✓ | ✓ | RAW | ✓ | WSJ / CTB / SPMRL |
If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.
@inproceedings{zhao-titov-2023-transferability,
title = "On the Transferability of Visually Grounded {PCFGs}",
author = "Zhao, Yanpeng and Titov, Ivan",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
@inproceedings{zhao-titov-2021-empirical,
title = "An Empirical Study of Compound {PCFG}s",
author = "Zhao, Yanpeng and Titov, Ivan",
booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
month = apr,
year = "2021",
address = "Kyiv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.adaptnlp-1.17",
pages = "166--171",
}
batchify.py
is borrowed from C-PCFGs.
MIT