The code repository for AMR guided joint information extraction model (NAACL-HLT 2021).
We use the transformer-based AMR parser (Astudillo et al. 2020) in our model. We train the AMR parser on LDC AMR 3.0 data, and the code and the pretrained model are publicly available. Please follow the README file in the code directory to install and use it.
The code of the AMR parser is derived from the original Github Repo. We made some slight modifications to fit it in higher versions of PyTorch.
The core python dependencies for this code are as follows:
torch==1.6.0
, torch-scatter==2.0.6
, transformers==4.3.0
, dgl==0.6.0
.
For other dependencies, you can simply use pip install
to install the required packages that are indicated by the error messages.
The ACE-2005 and ERE datasets are only available at the LDC website. Please use the following steps for preprocessing and AMR parsing.
- Use the scripts in OneIE to obtain the data files
train.oneie.json
,dev.oneie.json
, andtest.oneie.json
for the OneIE model. - Use
data/transform_for_amrie.py
to transform the OneIE formatted data files to fit in our model. For example:python ./data/transform_for_amrie.py -i [INPUT_DATA_DIR] -o [OUTPUT_DATA_DIR]
. - Use
./process_amr.py
to conduct AMR parsing and generate the.pkl
files for training the model.
After preprocessing, each dataset split should contain a .json
file (containing the IE annotations) and a .pkl
file (containing AMR information). For example, for the training set of ACE-2005, we have train.oneie.json
and train_graphs.pkl
. You can also refer to the publicly available GENIA datasets for more detailed dataset formats in our model.
We release the preprocessed data for GENIA 2011 and 2013 along with the AMR graphs for each sentence at this link. Please unzip the file and put all the folders into ./data/
before training the models.
To train a model, you can run python train.py -c [CONFIG_FILE] -g [GPU_ID] -n [NAME_OF_THIS_RUN]
. The example config files are provided in ./config/
. For example, to train a model on GENIA 2011 joint IE dataset, you can run:
python train.py -c config/genia_2011.json -g 0 -n your_preferred_name
The average training time for a single model would be ~15 hours on Tesla V100 GPUs, with ~10GB usage of GPU memory.
To make it easier to reproduce our results, we released our trained model checkpoints on ACE-05E dataset here. Please download the model and run the following command to for testing:
python test.py -c [YOUR_CONFIG_FILE] -g [GPU_DEVICE_ID] --model_path [MODEL_CHECKPOINT_PATH]
We also share our config files for ACE-05E under the config
folder. If the pretrained model is stored in ./models/best.role.mdl
, then an example for running the test script is:
python test.py -c config/ace05.json -g 0 --model_path models/best.role.mdl
This code is largely derived from OneIE. Our great thanks to Lin et al. and Astudillo et al. for publicizing their codes for the OneIE model and the AMR parser!
Please contact [email protected] if you have any questions. If you use this code as part of your research, please cite the following paper:
@inproceedings{amrie2021,
author = {Zixuan Zhang and Heng Ji},
title = {Abstract Meaning Representation Guided Graph Encoding and Decodingfor Joint Information Extraction},
booktitle = {Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)},
year = {2021},
pages = {39--49},
url = {https://www.aclweb.org/anthology/2021.naacl-main.4/}
}