-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject.yml
222 lines (193 loc) · 9.58 KB
/
project.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
title: Affiliation Extraction and Parsing from OpenALEX Preprints
description: |
## Design
This project uses spaCy to extract and parse affiliations from preprints in PDF format. The pipeline consists of several components:
- **PDF-to-text conversion**: Extracts plain text blocks from PDF files using [pyMuPDF](https://pymupdf.readthedocs.io/en/latest/).
- **Text categorization**: Predicts text blocks containing affiliations using a binary text categorization model so they can be extracted separately.
- **Named entity recognition**: Parses the extracted text blocks to identify named entities like organizations, people, and locations.
- **Relation extraction**: Builds a graph of affiliations and their relationships to authors and institutions.
For a full overview of the project, including possible next steps, see [the report](https://docs.google.com/document/d/1kdqEFBh0IolQUq-xybDajOPejoLRKy1cF4b7LycTv60/edit?usp=sharing).
## Setup
To use the workflow in this project, you need to start by installing spaCy, ideally in a new virtual environment:
```sh
pip install spacy
```
Then, before running any of the workflows, make sure to install the dependencies by running the following command:
```sh
weasel run dependencies:install
```
This will ensure you have the latest copies of pretrained models as well as the `thinc-apple-ops` package if you are running on an M-series Mac, which significantly speeds up training times.
## Data
Preprint data was sourced from [sul-dlss-labs/preprints-evaluation-dataset](https://github.com/sul-dlss-labs/preprints-evaluation-dataset). You can pull down a copy of the spreadsheet listing all preprints with:
```sh
weasel assets
```
This command will check the CSV file's checksum to ensure you are using the same copy of the data as the project was developed with. To fetch the actual PDF files and their metadata and convert them to plaintext, run:
```sh
weasel run preprints:download
weasel run preprints:clean
```
Note that **you need to be on Stanford VPN** to fetch files from SDR.
## Annotating
Annotated training datasets are stored in the [datasets/](datasets/) directory. They have also been pre-exported in the binary format used by spaCy in the [corpus/](corpus/) directory and are already configured as training data in the `config.cfg` files in the `configs/[ner|textcat]` directory.
If you have a local copy of [prodigy](https://prodi.gy/), you can use it to annotate more data for training. To recreate the dataset for text categorization, run:
```sh
weasel run dataset:textcat:create
```
Then, you can annotate the data using:
```sh
weasel run annotate:textcat
```
To recreate the dataset for named entity recognition, run:
```sh
weasel run dataset:ner:create
```
And then annotate the data using:
```sh
weasel run annotate:ner
```
The annotation tasks will open in your browser, and you can use the Prodigy UI to annotate more data.
## Training
You can train the text categorization model using:
```sh
weasel run train_textcat
```
And the named entity recognition model using:
```sh
weasel run train_ner
```
These commands are parameterized by the `embedding` and `transformer_model_name` variables in the project.yml file. You can choose to use spaCy's built-in token-to-vector embedding layer (`tok2vec`) or a transformer-based model (`transformer`). If you choose to use a transformer-based model, you can specify the model name (`transformer_model_name`) from the Hugging Face model hub.
If you change these parameters and re-run the training command, spaCy saves the results in the `training/` directory. The best-scoring run and the last run are always saved to `training/[ner|textcat]/model-best` and `training/[ner|textcat]/model-last`, respectively.
You can further customize the training configuration by editing the `config_[embedding].cfg` file in the `configs/[ner|textcat]` directory. If you do this, it's usually worth it to run:
```sh
spacy debug config configs/[ner|textcat]/config_[embedding].cfg
```
This will check the configuration file for errors and print out a summary of the settings.
## Visualizing
There are several interfaces built with [Streamlit](https://streamlit.io/) to help debug the various parts of the pipeline. You can view these with:
```sh
streamlit run scripts/visualize.py
```
This will open a browser window with the Streamlit interface, allowing you to preview different data and model parameters.
## API
There is an example API built with [FastAPI](https://github.com/fastapi) that processes a PDF file sent as form data and returns predicted affiliations as JSON. You can start a development server with:
```sh
fastapi dev scripts/api.py
```
Then, POST a PDF file using:
```sh
curl -F file=@assets/preprints/pdf/W2901173781.pdf "http://localhost:8000/analyze"
```
vars:
embedding: "tok2vec" # tok2vec, transformer
config_file: "config_${embedding}.cfg"
transformer_model_name: "roberta-base" # any huggingface model name
gpu_id: 0 # -1 for CPU, 0 for GPU
span_types: "caption,footnote,list_item,page_footer,page_header,section_header,text,title,document_index,key_value_region" # list of DocLayNet spans to keep
directories:
- assets
- training
- results
- metrics
assets:
- dest: "assets/preprints.csv"
git:
repo: https://github.com/sul-dlss-labs/preprints-evaluation-dataset.git
branch: main
path: records-100.csv
checksum: d88ef1844e56552f90199654cf0fd3ed # md5
description: List of preprint documents in the training dataset
workflows:
prepare:
- preprints:download
- preprints:clean
- dataset:textcat:create
commands:
- name: dependencies:install
help: Install dependencies
script:
- pip install -r requirements.txt
- python -m spacy download en_core_web_lg # for NER and to compare with -trf
- python -m spacy download en_core_web_trf # for NER and to compare with -lg
- /bin/bash -c "if [[ $(arch) == 'arm64' ]]; then pip install thinc-apple-ops; fi" # for M-series macs
- name: preprints:download
help: Download preprint PDFs and metadata (**Needs Stanford VPN**)
outputs:
- assets/preprints/pdf
script:
- python scripts/download_preprints.py assets/preprints.csv assets/preprints/pdf
- python scripts/download_metadata.py assets/preprints.csv assets/preprints/json
- name: preprints:clean
help: Extract plain text from preprint PDFs
deps:
- assets/preprints/pdf
outputs:
- assets/preprints/txt
script:
- python scripts/clean_preprints.py assets/preprints/pdf assets/preprints/txt
- name: dataset:textcat:create
help: Create a dataset for annotating text categorization training data
deps:
- assets/preprints/txt
outputs:
- assets/preprints_textcat.jsonl
script:
- python scripts/create_textcat_dataset.py assets/preprints_textcat.jsonl
- name: annotate:textcat
help: Annotate binary training data for text categorization
deps:
- assets/preprints_textcat.jsonl
script:
- python -m prodigy textcat.manual preprints_textcat assets/preprints_textcat.jsonl --label AFFILIATION,AUTHOR,CITATION
- name: dataset:ner:create
help: Create a dataset for annotating NER data
deps:
- assets/preprints/txt
outputs:
- assets/preprints.jsonl
script:
- python scripts/create_ner_dataset.py assets/preprints.jsonl
- name: annotate:ner
help: Annotate training data for NER by correcting and updating an existing model
deps:
- assets/preprints.jsonl
script:
- python -m prodigy ner.correct preprints_ner en_core_web_trf assets/preprints.jsonl --label PERSON,ORG,GPE --update
- name: dataset:spancat:create
help: Create a dataset for annotating span categorization training data
deps:
- assets/preprints/pdf
outputs:
- assets/preprints_spancat.jsonl
script:
- python -m prodigy pdf.layout.fetch assets/preprints_spancat.jsonl blank:en assets/preprints/pdf --focus ${vars.span_types}
- name: annotate:spancat
help: Annotate training data for span categorization directly on PDFs
deps:
- assets/preprints_spancat.jsonl
script:
- python -m prodigy pdf.spans.manual preprints_spancat blank:en assets/preprints_spancat.jsonl --label AFFILIATION,AUTHOR,CITATION --focus text,footnote,list_item
- name: train_textcat
help: Train spaCy text categorization pipeline for affiliation extraction
deps:
- "configs/textcat_multilabel/${vars.config_file}"
- corpus/textcat_multilabel/train.spacy
- corpus/textcat_multilabel/dev.spacy
script:
- "python -m spacy train configs/textcat_multilabel/${vars.config_file} --output training/textcat_multilabel --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"
- name: train_ner
help: Train spaCy NER pipeline for affiliation parsing
deps:
- "configs/ner/${vars.config_file}"
- corpus/ner/train.spacy
- corpus/ner/dev.spacy
script:
- "python -m spacy train configs/ner/${vars.config_file} --output training/ner --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"
- name: train_spancat
help: Train spaCy span categorization pipeline for affiliation extraction
deps:
- "configs/spancat/${vars.config_file}"
- corpus/spancat/train.spacy
- corpus/spancat/dev.spacy
script:
- "python -m spacy train configs/spancat/${vars.config_file} --code scripts/utils.py --output training/spancat --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"