Skip to content

Commit

Permalink
update readme. doc msd. don't compute similarities if files score exist
Browse files Browse the repository at this point in the history
  • Loading branch information
ofilangi committed Oct 23, 2024
1 parent aff58d9 commit 932427b
Show file tree
Hide file tree
Showing 5 changed files with 86,524 additions and 76 deletions.
64 changes: 64 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Fichiers bytecode Python
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
dist/
build/
*.egg-info/
*.egg

# Environnements virtuels
venv/
env/
.venv/
.env/

# IDEs et éditeurs
.vscode/
.idea/
*.swp
*.swo

# Fichiers de logs
*.log

# Fichiers de cache
.cache/
.pytest_cache/

# Fichiers de configuration locaux
*.ini
*.cfg
config.py

# Fichiers de base de données
*.db
*.sqlite3

# Fichiers système
.DS_Store
Thumbs.db

# Fichiers de couverture de tests
.coverage
htmlcov/

# Fichiers de documentation générés
docs/_build/

# Fichiers de dépendances
pip-log.txt
pip-delete-this-directory.txt

# Jupyter Notebook
.ipynb_checkpoints

# Environnements pyenv
.python-version

# Fichiers de sauvegarde
*.bak

*_workdir/
187 changes: 112 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,86 +20,123 @@ LLMSemanticAnnotator employs Semantic Textual Similarity (STS) to annotate scien
This approach aims to significantly enrich the metadata of scientific articles, thereby facilitating experimental reproducibility, comparative analysis of studies, and large-scale knowledge extraction in the field of plant biology.



## Installation
## Run

```bash
exec.sh <json_config_file>
./exec.sh -h
Usage: ./exec.sh <config_file> <int_commande>

Commands:
1. Pseudo workflow [2,4,5,6,7]
2. Populate OWL tag embeddings
3. Populate NCBI Taxon tag embeddings
4. Populate abstract embeddings
5. Compute similarities between tags and abstract chunks
6. Display similarities information
7. Build turtle knowledge graph
8. Build dataset abstracts annotations CSV file
9. Evaluate encoder with MeSH descriptors (experimental)

Details:
2: Compute TAG embeddings for all ontologies defined in the populate_owl_tag_embeddings section
3: Compute TAG embeddings for NCBI Taxon
4: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
5: Compute similarities between TAGS and ABSTRACTS
6: Display similarities information on the console
7: Generate turtle file with information {score, tag} for each DOI
8: Generate CSV file with [doi, tag, pmid, reference_id]

```
## Configuration file (json)
example can be found :
- [planteome](./config/planteome-demo.json)
- [mesh](./config/mesh-demo.json)
- [foodon](./config/foodon-demo.json)
## Configuration main keys
### General config
```json
"encodeur" : "sentence-transformers/all-MiniLM-L6-v2",
"threshold_similarity_tag_chunk" : 0.65,
"threshold_similarity_tag" : 0.80,
"batch_size" : 32,
```
## Configuration

check exemple on [config](./config) directory

<table style="font-size: 10px;">
<tr>
<th>Action</th>
<th>Description</th>
<th>Parameter</th>
<th>Description</th>
</tr>
<tr>
<td>populate_owl_tag_embeddings</td>
<td>Populate the corpus of tags</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>debug_nb_terms_by_ontology</td>
<td>Number of terms per ontology for debugging.</td>
</tr>
<tr>
<td>populate_ncbi_abstract_embeddings</td>
<td>Populate the database with scientific articles</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>debug_nb_ncbi_request</td>
<td>Number of NCBI requests for debugging.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>debug_nb_abstracts_by_search</td>
<td>Number of abstracts per search for debugging.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>retmax</td>
<td>Maximum number of articles per search term.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>selected_term</td>
<td>Selected term to find relevant scientific articles.</td>
</tr>
<tr>
<td>compute_tag_chunk_similarities</td>
<td>Compute the similarity using cosine score</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>threshold_similarity_tag_chunk</td>
<td>Similarity threshold for tagging chunks.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>debug_nb_similarity_compute</td>
<td>Maximum number of similarities computed.</td>
</tr>
</table>
#### encoder
- sentence-transformers/all-MiniLM-L6-v2
### Managing Ontology : populate_owl_tag_embeddings
```json
"populate_owl_tag_embeddings" : {
"ontologies": {
"<name_ontology_groupe>" : {
"<ontology>": {
"url": <string>,
"prefix": <string>,
"format": <string>,
"label" :<string>,
"properties": [<string>,..]
},
...
}
}
}
```
### Specific management of NCBI Taxon : populate_ncbi_taxon_tag_embeddings
```json
"populate_ncbi_taxon_tag_embeddings" : {
"regex" : "rassica.*" ,
"tags_per_file" : 2000
},
```
### Managing Abstract : populate_abstract_embeddings
```json
"populate_abstract_embeddings" : {
"abstracts_per_file" : 50,
"from_file" : <from_file>
}
```
#### <from_file>
```json
"from_file" : {
"json_files" : [
"data/abstracts/abstracts_1.json",
"data/abstracts/abstracts_2.json"
]
}
```
```json
"from_file" :{
"json_dir" : "some_directory"
}
```
```json
"from_ncbi_api" : {
"ncbi_api_chunk_size" : 200,
"debug_nb_ncbi_request" : -1,
"retmax" : 2000,
"selected_term" : [
"Crops%2C+Agricultural%2Fmetabolism%5BMeSH%5D"
]
}
```
### Tests Execution
Expand Down
Loading

0 comments on commit 932427b

Please sign in to comment.