Skip to content

Commit

Permalink
update script google collab + readme help
Browse files Browse the repository at this point in the history
  • Loading branch information
ofilangi committed Oct 28, 2024
1 parent 782a325 commit a8c1e69
Show file tree
Hide file tree
Showing 3 changed files with 103 additions and 68 deletions.
25 changes: 10 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,29 +23,24 @@ This approach aims to significantly enrich the metadata of scientific articles,
## Run

```bash
./exec.sh -h
Usage: ./exec.sh <config_file> <int_commande>

Commands:
1. Pseudo workflow [2,4,5,6,7]
2. Populate OWL tag embeddings
3. Populate NCBI Taxon tag embeddings
4. Populate abstract embeddings
5. Compute similarities between tags and abstract chunks
6. Display similarities information
7. Build turtle knowledge graph
8. Build dataset abstracts annotations CSV file
9. Evaluate encoder with MeSH descriptors (experimental)
3. Populate abstract embeddings
4. Compute similarities between tags and abstract chunks
5. Display similarities information
6. Build turtle knowledge graph
7. Build dataset abstracts annotations CSV file

Details:
2: Compute TAG embeddings for all ontologies defined in the populate_owl_tag_embeddings section
3: Compute TAG embeddings for NCBI Taxon
4: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
5: Compute similarities between TAGS and ABSTRACTS
6: Display similarities information on the console
7: Generate turtle file with information {score, tag} for each DOI
8: Generate CSV file with [doi, tag, pmid, reference_id]

3: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
4: Compute similarities between TAGS and ABSTRACTS
5: Display similarities information on the console
6: Generate turtle file with information {score, tag} for each DOI
7: Generate CSV file with [doi, tag, pmid, reference_id]
```
## Configuration file (json)
Expand Down
22 changes: 13 additions & 9 deletions config/planteome-demo.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,40 @@
"batch_size" : 32,

"populate_owl_tag_embeddings" : {
"prefix" : {
"rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"rdfs" : "http://www.w3.org/2000/01/rdf-schema#",
"owl" : "http://www.w3.org/2002/07/owl#"
},
"ontologies": {
"planteome_link" : {
"peco": {
"url": "http://purl.obolibrary.org/obo/peco.owl",
"prefix": "http://purl.obolibrary.org/obo/PECO_",
"format": "xml",
"label" : "<http://www.w3.org/2000/01/rdf-schema#label>",
"properties": ["<http://purl.obolibrary.org/obo/IAO_0000115>"]
"label" : "rdfs:label",
"properties": ["obo:IAO_0000115","rdfs:comment","owl:annotatedTarget"]
},
"po": {
"url": "http://purl.obolibrary.org/obo/po.owl",
"prefix": "http://purl.obolibrary.org/obo/PO_",
"format": "xml",
"label" : "<http://www.w3.org/2000/01/rdf-schema#label>",
"properties": ["<http://purl.obolibrary.org/obo/IAO_0000115>"]
"label" : "rdfs:label",
"properties": ["obo:IAO_0000115","rdfs:comment","owl:annotatedTarget"]
},
"pso": {
"url": "http://purl.obolibrary.org/obo/pso.owl",
"prefix": "http://purl.obolibrary.org/obo/PSO_",
"format": "xml",
"label" : "<http://www.w3.org/2000/01/rdf-schema#label>",
"properties": ["<http://purl.obolibrary.org/obo/IAO_0000115>"]
"label" : "rdfs:label",
"properties": ["obo:IAO_0000115","rdfs:comment","owl:annotatedTarget"]
},
"to": {
"url": "http://purl.obolibrary.org/obo/to.owl",
"prefix": "http://purl.obolibrary.org/obo/TO_",
"format": "xml",
"label" : "<http://www.w3.org/2000/01/rdf-schema#label>",
"properties": ["<http://purl.obolibrary.org/obo/IAO_0000115>"]
"label" : "rdfs:label",
"properties": ["obo:IAO_0000115","rdfs:comment","owl:annotatedTarget"]
}
}
}
Expand All @@ -48,6 +53,5 @@
"Crops%2C+Agricultural%2Fmetabolism%5BMeSH%5D"
]
}

}
}
124 changes: 80 additions & 44 deletions llm-semantic-annotator.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,66 +48,74 @@
"\n",
"data = {\n",
" \"encodeur\" : \"sentence-transformers/all-MiniLM-L6-v2\",\n",
" \"threshold_similarity_tag_chunk\" : 0.67,\n",
" \"threshold_similarity_tag\" : 0.75,\n",
" \"threshold_similarity_tag_chunk\" : 0.70,\n",
" \"threshold_similarity_tag\" : 0.80,\n",
" \"batch_size\" : 32,\n",
"\n",
" \"populate_owl_tag_embeddings\" : {\n",
" \"prefix\" : {\n",
" \"rdf\" : \"http://www.w3.org/1999/02/22-rdf-syntax-ns#\",\n",
" \"rdfs\" : \"http://www.w3.org/2000/01/rdf-schema#\",\n",
" \"owl\" : \"http://www.w3.org/2002/07/owl#\"\n",
" },\n",
" \"ontologies\": {\n",
" \"planteome_link\" : {\n",
" \"peco\": {\n",
" \"url\": \"http://purl.obolibrary.org/obo/peco.owl\",\n",
" \"prefix\": \"http://purl.obolibrary.org/obo/PECO_\",\n",
" \"format\": \"xml\",\n",
" \"label\" : \"<http://www.w3.org/2000/01/rdf-schema#label>\",\n",
" \"properties\": [\"<http://purl.obolibrary.org/obo/IAO_0000115>\"]\n",
" \"format\": \"xml\", \n",
" \"label\" : \"rdfs:label\",\n",
" \"properties\": [\"obo:IAO_0000115\",\"rdfs:comment\",\"owl:annotatedTarget\"]\n",
" },\n",
" \"po\": {\n",
" \"url\": \"http://purl.obolibrary.org/obo/po.owl\",\n",
" \"prefix\": \"http://purl.obolibrary.org/obo/PO_\",\n",
" \"format\": \"xml\",\n",
" \"label\" : \"<http://www.w3.org/2000/01/rdf-schema#label>\",\n",
" \"properties\": [\"<http://purl.obolibrary.org/obo/IAO_0000115>\"]\n",
" \"label\" : \"rdfs:label\",\n",
" \"properties\": [\"obo:IAO_0000115\",\"rdfs:comment\",\"owl:annotatedTarget\"]\n",
" },\n",
" \"pso\": {\n",
" \"url\": \"http://purl.obolibrary.org/obo/pso.owl\",\n",
" \"prefix\": \"http://purl.obolibrary.org/obo/PSO_\",\n",
" \"format\": \"xml\",\n",
" \"label\" : \"<http://www.w3.org/2000/01/rdf-schema#label>\",\n",
" \"properties\": [\"<http://purl.obolibrary.org/obo/IAO_0000115>\"]\n",
" \"label\" : \"rdfs:label\",\n",
" \"properties\": [\"obo:IAO_0000115\",\"rdfs:comment\",\"owl:annotatedTarget\"]\n",
" },\n",
" \"to\": {\n",
" \"url\": \"http://purl.obolibrary.org/obo/to.owl\",\n",
" \"prefix\": \"http://purl.obolibrary.org/obo/TO_\",\n",
" \"format\": \"xml\",\n",
" \"label\" : \"<http://www.w3.org/2000/01/rdf-schema#label>\",\n",
" \"properties\": [\"<http://purl.obolibrary.org/obo/IAO_0000115>\"]\n",
" \"label\" : \"rdfs:label\",\n",
" \"properties\": [\"obo:IAO_0000115\",\"rdfs:comment\",\"owl:annotatedTarget\"]\n",
" }\n",
" }\n",
" },\n",
" \"debug_nb_terms_by_ontology\" : -1\n",
" },\n",
" \"populate_ncbi_taxon_tag_embeddings\" : {\n",
" \"tags_per_file\" : 40000\n",
" }\n",
" },\n",
" \"populate_abstract_embeddings\" : {\n",
" \"abstracts_per_file\" : 500,\n",
" \"from_ncbi_api\" : {\n",
" \"ncbi_api_chunk_size\" : 200,\n",
" \"debug_nb_ncbi_request\" : -1,\n",
" \"debug_nb_abstracts_by_search\" : -1,\n",
" \"retmax\" : 500,\n",
" \"retmax\" : 2000,\n",
" \"selected_term\" : [\n",
" \"human+AND+genetics\",\n",
" \"abiotic+AND+mass+AND+spectrometry+AND+glucosinolate+AND+brassicaceae\",\n",
" \"plant+AND+development+AND+spectrometry\",\n",
" \"Crops%2C+Agricultural%2Fmetabolism%5BMeSH%5D\"\n",
" ]\n",
" }\n",
" }\n",
"}\n",
"\n",
"\n",
"with open('config.json', 'w') as fichier:\n",
" json.dump(data, fichier, indent=4)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Populate OWL tag embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -120,7 +128,14 @@
},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json populate_owl_tag_embeddings"
"!llm-semantic-annotator config.json 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Populate abstract embeddings"
]
},
{
Expand All @@ -135,19 +150,17 @@
},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json populate_ncbi_taxon_tag_embeddings"
"!llm-semantic-annotator config.json 3"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "j9DsD4l48FLe"
"id": "yfgMXvYb8Fjh"
},
"source": [
"\n",
"## Abstract\n",
"---\n",
"\n"
"## Compute similarities between tags and abstract chunks\n",
"------"
]
},
{
Expand All @@ -157,37 +170,60 @@
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DrfFNtbS8FZU",
"outputId": "69284d62-bce2-49d8-8348-b006006a48d1"
"id": "HFGu6xtd8e6J",
"outputId": "62021bb9-51b0-4442-956d-8e3efc18a284"
},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json populate_abstract_embeddings"
"!llm-semantic-annotator config.json 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yfgMXvYb8Fjh"
},
"metadata": {},
"source": [
"## Calcul des simlarités entres les tags et les abstracts\n",
"------"
"## Display similarities information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HFGu6xtd8e6J",
"outputId": "62021bb9-51b0-4442-956d-8e3efc18a284"
},
"metadata": {},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build turtle knowledge graph"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json 6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build dataset abstracts annotations CSV file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!llm-semantic-annotator config.json compute_tag_chunk_similarities"
"!llm-semantic-annotator config.json 7"
]
},
{
Expand Down

0 comments on commit a8c1e69

Please sign in to comment.