Skip to content

Latest commit

 

History

History
86 lines (68 loc) · 5.34 KB

README.md

File metadata and controls

86 lines (68 loc) · 5.34 KB

XCFGs

Aiming at unifying all extensions of context-free grammars (XCFGs). X stands for weighted, (compound) probabilistic, and neural extensions, etc. Currently only the data preprocessing module has been implemented though.

Update (08/06/2023): Support Brown Corpus and English Web Treebank that are used in this study.

Update (06/02/2022): Parse MSCOCO and Flickr30k captions, create data splits, and encode images for VC-PCFG.

Update (03/10/2021): Parallel Chinese-English data is supported.

Data

The repo handles WSJ, CTB, SPMRL, Brown Corpus, and English Web Treebank. Have a look at treebank.py.

If you are looking for the data used in C-PCFGs. Follow the instructions in treebank.py and put all outputs in the same folder, let us say ./data.punct. The script only removes morphology features and creates data splits. To remove punctuation we will need clean_tb.py. For example, I used python clean_tb.py ./data.punct ./data.clean. All the cleaned treebanks will reside in /data.clean. Then simply execute the command ./batchify.sh ./data.clean/, you will have all the data needed to reproduce the results in C-PCFGs. Feel free to change parameters in batchify.sh if you want to use a different batch size or vocabulary size.

Evaluation

To ease evaluation I represent a gold tree as a tuple:

TREE: TUPLE(sentence: STR, spans: LIST[SPAN], span_labels: LIST[STR], pos_tags: LIST[STR])
SPAN: TUPLE(left_boundary: INT, right_boundary: INT)

If you have followed the instructions in the last section, this command ./binarize.sh ./data.clean/ could help you convert gold trees into the tuple representation.

Trivial baselines

Even for trivial baselines, e.g., left- and right-branching trees, you may find different F1 numbers in literature on grammar induction, partly because the authors used (slightly) different procedures for data preprocessing. To encourage truly fair comparison I also released a standard procedure baseline.py. Hopefully, this will help with the situation.

Model WSJ CTB Basque German French Hebrew Hungarian Korean Polish Swedish
LB 8.7 7.2 17.9 10.0 5.7 8.5 13.3 18.5 10.9 8.4
RB 39.5 25.5 15.4 14.7 26.4 30.0 12.7 19.2 34.2 30.4

An evaluation checklist for phrase-structure grammar induction

Below is a comparison of several cirtical training / evaluation settings of recent unsupervised parsing models.

Model Sent. F1 Corpus F1 Variance Word repr. Punct. rm Length Dataset
PRPN RAW WSJ
ON RAW WSJ
DIORA ELMo WSJ
URNNG RAW WSJ
N-PCFG RAW WSJ / CTB
C-PCFG RAW WSJ / CTB
VG-NSL RAW / FastText MSCOCO
LN-PCFG RAW WSJ
CT RoBERTa WSJ
S-DIORA ELMo WSJ
VC-PCFG RAW MSCOCO
C-PCFG (Zhao 2020) RAW WSJ / CTB / SPMRL

Citing XCFGs

If you use XCFGs in your research or wish to refer to the results in C-PCFGs, please use the following BibTeX entry.

@inproceedings{zhao-titov-2023-transferability,
    title = "On the Transferability of Visually Grounded {PCFGs}",
    author = "Zhao, Yanpeng  and Titov, Ivan",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
}
@inproceedings{zhao-titov-2021-empirical,
    title = "An Empirical Study of Compound {PCFG}s",
    author = "Zhao, Yanpeng and Titov, Ivan",
    booktitle = "Proceedings of the Second Workshop on Domain Adaptation for NLP",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.adaptnlp-1.17",
    pages = "166--171",
}

Acknowledgements

batchify.py is borrowed from C-PCFGs.

License

MIT