update readme. doc msd. don't compute similarities if files score exist

p2m2 · Oct 23, 2024 · 932427b · 932427b
1 parent aff58d9
commit 932427b
Show file tree

Hide file tree

Showing 5 changed files with 86,524 additions and 76 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,64 @@
+# Fichiers bytecode Python
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+dist/
+build/
+*.egg-info/
+*.egg
+
+# Environnements virtuels
+venv/
+env/
+.venv/
+.env/
+
+# IDEs et éditeurs
+.vscode/
+.idea/
+*.swp
+*.swo
+
+# Fichiers de logs
+*.log
+
+# Fichiers de cache
+.cache/
+.pytest_cache/
+
+# Fichiers de configuration locaux
+*.ini
+*.cfg
+config.py
+
+# Fichiers de base de données
+*.db
+*.sqlite3
+
+# Fichiers système
+.DS_Store
+Thumbs.db
+
+# Fichiers de couverture de tests
+.coverage
+htmlcov/
+
+# Fichiers de documentation générés
+docs/_build/
+
+# Fichiers de dépendances
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# Environnements pyenv
+.python-version
+
+# Fichiers de sauvegarde
+*.bak
+
+*_workdir/
diff --git a/README.md b/README.md
@@ -20,86 +20,123 @@ LLMSemanticAnnotator employs Semantic Textual Similarity (STS) to annotate scien
 This approach aims to significantly enrich the metadata of scientific articles, thereby facilitating experimental reproducibility, comparative analysis of studies, and large-scale knowledge extraction in the field of plant biology.
 
 
-
-## Installation
+## Run
 
 ```bash
-exec.sh <json_config_file>
+./exec.sh -h
+Usage: ./exec.sh <config_file> <int_commande>
+
+Commands:
+  1. Pseudo workflow [2,4,5,6,7]
+  2. Populate OWL tag embeddings
+  3. Populate NCBI Taxon tag embeddings
+  4. Populate abstract embeddings
+  5. Compute similarities between tags and abstract chunks
+  6. Display similarities information
+  7. Build turtle knowledge graph
+  8. Build dataset abstracts annotations CSV file
+  9. Evaluate encoder with MeSH descriptors (experimental)
+
+Details:
+  2: Compute TAG embeddings for all ontologies defined in the populate_owl_tag_embeddings section
+  3: Compute TAG embeddings for NCBI Taxon
+  4: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
+  5: Compute similarities between TAGS and ABSTRACTS
+  6: Display similarities information on the console
+  7: Generate turtle file with information {score, tag} for each DOI
+  8: Generate CSV file with [doi, tag, pmid, reference_id]
+
+```
+
+## Configuration file (json)  
+
+example can be found :
+
+- [planteome](./config/planteome-demo.json)
+- [mesh](./config/mesh-demo.json)
+- [foodon](./config/foodon-demo.json)
+ 
+
+## Configuration main keys
+
+### General config
+
+```json
+    "encodeur" : "sentence-transformers/all-MiniLM-L6-v2",
+    "threshold_similarity_tag_chunk" : 0.65,
+    "threshold_similarity_tag" : 0.80,
+    "batch_size" : 32,
 ```
 
-## Configuration
-
-check exemple on [config](./config) directory
-
-<table style="font-size: 10px;">
-    <tr>
-        <th>Action</th>
-        <th>Description</th>
-        <th>Parameter</th>
-        <th>Description</th>
-    </tr>
-    <tr>
-        <td>populate_owl_tag_embeddings</td>
-        <td>Populate the corpus of tags</td>
-        <td></td>
-        <td></td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>debug_nb_terms_by_ontology</td>
-        <td>Number of terms per ontology for debugging.</td>
-    </tr>
-    <tr>
-        <td>populate_ncbi_abstract_embeddings</td>
-        <td>Populate the database with scientific articles</td>
-        <td></td>
-        <td></td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>debug_nb_ncbi_request</td>
-        <td>Number of NCBI requests for debugging.</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>debug_nb_abstracts_by_search</td>
-        <td>Number of abstracts per search for debugging.</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>retmax</td>
-        <td>Maximum number of articles per search term.</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>selected_term</td>
-        <td>Selected term to find relevant scientific articles.</td>
-    </tr>
-    <tr>
-        <td>compute_tag_chunk_similarities</td>
-        <td>Compute the similarity using cosine score</td>
-        <td></td>
-        <td></td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>threshold_similarity_tag_chunk</td>
-        <td>Similarity threshold for tagging chunks.</td>
-    </tr>
-    <tr>
-        <td></td>
-        <td></td>
-        <td>debug_nb_similarity_compute</td>
-        <td>Maximum number of similarities computed.</td>
-    </tr>
-</table>
+#### encoder
+
+- sentence-transformers/all-MiniLM-L6-v2
+
+### Managing Ontology : populate_owl_tag_embeddings
+
+```json
+"populate_owl_tag_embeddings" : {
+        "ontologies": {
+            "<name_ontology_groupe>" : {
+                 "<ontology>": {
+                    "url": <string>,
+                    "prefix": <string>,
+                    "format": <string>,
+                    "label" :<string>,
+                    "properties": [<string>,..]
+                },
+                ...
+            }
+        }
+}
+```
+
+
+### Specific management of NCBI Taxon : populate_ncbi_taxon_tag_embeddings
 
+```json
+ "populate_ncbi_taxon_tag_embeddings" : {
+        "regex" : "rassica.*" ,
+        "tags_per_file" : 2000
+    },
+```
+
+### Managing Abstract : populate_abstract_embeddings
+
+```json
+"populate_abstract_embeddings" : {
+        "abstracts_per_file" : 50,
+        "from_file" : <from_file>
+    }
+```
+
+#### <from_file>
+
+```json
+"from_file" : {
+        "json_files" : [
+            "data/abstracts/abstracts_1.json",
+            "data/abstracts/abstracts_2.json"
+        ]
+    }
+```
+
+```json
+"from_file" :{
+        "json_dir" : "some_directory"
+    }
+```
+
+```json
+"from_ncbi_api" : {
+            "ncbi_api_chunk_size" : 200,
+            "debug_nb_ncbi_request" : -1,
+            "retmax" : 2000,
+            "selected_term" : [
+                "Crops%2C+Agricultural%2Fmetabolism%5BMeSH%5D"
+            ]
+        }
+```
 
 ### Tests Execution