Skip to content

Latest commit

 

History

History
93 lines (60 loc) · 2.99 KB

README.md

File metadata and controls

93 lines (60 loc) · 2.99 KB

BERT ParsCit

PyTorch Lightning Config: Hydra Template
Paper Conference

Description

This is the repository of BERT ParsCit and is under active development at National University of Singapore (NUS), Singapore. The project was built upon a template by ashleve. BERT ParsCit is a BERT version of Neural ParsCit built by researchers under WING@NUS.

Installation

# clone project
git clone https://github.com/ljhgabe/BERT-ParsCit
cd BERT-ParsCit

# [OPTIONAL] create conda environment
conda create -n myenv python=3.8
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Set up PDF parsing engine s2orc-doc2json

The current doc2json tool is used to convert PDF to JSON. It uses Grobid to first process each PDF into XML, then extracts paper components from the XML. To setup Doc2Json, you should run:

sh bin/doc2json/scripts/run.sh

This will setup Doc2Json and Grobid. And after installation, it starts the Grobid server in the background by default.

Example usage

from src.pipelines.bert_parscit import predict_for_string, predict_for_text, predict_for_pdf

str_result = predict_for_string(
    "Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64.")
text_result = predict_for_text("test.txt")
pdf_result = predict_for_pdf("test.pdf")

How to train

Train model with default configuration

# train on CPU

python train.py trainer=cpu

# train on GPU
python train.py trainer=gpu 

Train model with chosen experiment configuration from configs/experiment/

python train.py experiment=experiment_name.yaml

You can override any parameter from command line like this

python train.py trainer.max_epochs=20 datamodule.batch_size=64

To show the full stack trace for error occurred during training or testing

HYDRA_FULL_ERROR=1 python train.py