Skip to content

Latest commit

 

History

History
123 lines (85 loc) · 4.08 KB

README.md

File metadata and controls

123 lines (85 loc) · 4.08 KB

BERT ParsCit

PyTorch Lightning Config: Hydra Template
Paper Conference

Description

This is the repository of BERT ParsCit and is under active development at National University of Singapore (NUS), Singapore. The project was built upon a template by ashleve. BERT ParsCit is a BERT version of Neural ParsCit built by researchers under WING@NUS.

Installation

# clone project
git clone https://github.com/ljhgabe/BERT-ParsCit
cd BERT-ParsCit

# [OPTIONAL] create conda environment
conda create -n myenv python=3.8
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Example usage

from bert_parscit import predict_for_text

result = predict_for_text("Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")

How to train

Train model with default configuration

# train on CPU
python train.py trainer.gpus=0

# train on GPU
python train.py trainer.gpus=1

Train model with chosen experiment configuration from configs/experiment/

python train.py experiment=experiment_name.yaml

You can override any parameter from command line like this

python train.py trainer.max_epochs=20 datamodule.batch_size=64

To show the full stack trace for error occurred during training or testing

HYDRA_FULL_ERROR=1 python train.py

How to Parse Reference Strings from a PDF

Setup Doc2Json

First prepare for the environment:

cd ./tools
python setup.py develop

The current grobid2json tool uses Grobid to first process each PDF into XML, then extracts paper components from the XML.

Install Grobid

You will need to have Java installed on your machine. Then, you can install your own version of Grobid and get it running, or you can run the following script:

bash tools/scripts/setup_grobid.sh

This will setup Grobid, currently hard-coded as version 0.6.1. Then run:

bash tools/scripts/run_grobid.sh

to start the Grobid server. Don't worry if it gets stuck at 87%; this is normal and means Grobid is ready to process PDFs.

Extract Reference Strings from a PDF File

You can extract strings you need with the script. For example, to get reference strings, try:

python pdf2text.py --input_file tools/tests/pdf/2020.acl-main.207.pdf --reference
 --output_dir output/ --temp_dir temp/

With --reference, this will generate a text file of reference strings in the specified output_dir. And the JSON format of the origin PDF will be saved in the specified temp_dir. The default output_dir is output/ from your path and the default temp_dir is temp/ from your path.

Parse Reference Strings from a Text File

To predict the reference string tags, try:

from bert_parscit import predict_for_file
res = predict_for_file("output/N18-3011_ref.txt",output_dir="result")

The prediction result is saved in output_dir.If unspecified, the file will be in the result/ directory from your path.