Add prediction workflow based on BERTMap #170

buzgalbraith · 2024-11-06T05:53:56Z

Add code for generating mappings using BERTMap, and updated predictions.tsv

cthoyt

hi @buzgalbraith, thanks for making a submission. This is a really interesting idea, but this needs some refectoring before it can be considered. Here's some feedback:

High Level

A few specific high-level things to start with:

Make sure you pass CI. Run tox locally and address all issues. During the process of contribution, I will probably add some other CI checks too to further automated giving code pre-review
Please put the code that is reusable in the library itself under src/biomappings/bertmap.py - this should include everything except the CLI entrypoint. I already made and pushed a stub file for you.
Please make sure there are no places you're using 1-letter variable names. This might have been a nice shorthand when you were writing, but they're difficult to understand later
Make sure there are 100% type hints everywhere and that all variables are documented fully in docstrings. The CI will check this for you, or you can run tox locally to check anytime

If all of these things can be addressed, I can give a proper code review.

Specifics

It appears you created your own code for downloading ontologies. Since Biomappings is built on top of PyOBO, which already implements that, then it would make more sense to reuse this code. I can give more specific suggestions later when refactoring makes what you've written a bit more approachable
There's a reference to an externally defined model on HuggingFace. If code in Biomappings is to use it, then we need to have documentation on how it was generated

Disclaimer

This is your first PR to one of my repos - I am very proactive with making changes I want to see so keep a heads up and make sure you pull your repo before you start making changes in case I added/changed something

codecov · 2024-11-06T07:41:38Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

cthoyt · 2024-11-06T10:37:55Z

After some refactoring, I was able to get it to run locally from beginning to end. It doesn't appear to make any new predictions on the second run.

The research question to go with this method is: is it worth using more complicated methods? How often does a BERT-based method find something we couldn't using lexical matching?

I suggest going through and exhaustively curating all of the new predictions that have been made that are novel. It looks like there are less than 100. Then, I would suggest relaxing the confidence cutoff and curating additional predictions as well. A question for you @buzgalbraith - how knowledgable are you about chemistry? There is a lot of nuance in the mappings for ChEBI and MeSH. Maybe it's best to focus on the other ontologies if you don't have some domain knowledge.

Then, it makes sense to run a benchmarking, since I'm guessing that a lot of the predictions that are being produced are already covered by the predictions that have been made with simpler lexical matching / ones that have already been curated.

Summary questions:

What percentage of the predictions produced by this method are correct/incorrect?
What percentage of the predictions produced by this method are also produced by lexical matching?

Another question that comes to mind is: is there a benefit of finetuning vs. using the base BERT model? It's not clear what the code is doing so it's not obvious to me what finetuning even accomplishes. This would be important to document

cthoyt · 2024-11-06T10:38:36Z

scripts/generate_BERTMap_predictions.py

+            x["source prefix"].lower() == source_ontology_train.lower()
+        )
+
+    def iri_map(x):


these functions don't make sense - you could accomplish the same thing below in a list comprehension that would be much more readable and understandable

cthoyt · 2024-11-06T10:38:57Z

scripts/generate_BERTMap_predictions.py

+        source_identifier = get_luid(source_prefix, src_class_iri)
+        target_identifier = get_luid(target_prefix, tgt_class_iri)
+        # check if in provided map
+        target_known_maps = [


please refactor to reduce code that is duplicated, and also add comments to explain what the point of this is

it's a bit unclear what's going on here, are you assuming that any annotation including the other prefix is a mapping?

cthoyt · 2024-11-06T10:59:18Z

scripts/generate_BERTMap_predictions.py

+        config.global_matching.enabled = False
+        if not os.path.isdir("bertmap"):
+            print("downloading model from hugging face")
+            snapshot_download(repo_id="buzgalbraith/BERTMAP-BioMappings", local_dir="./")


where does this come from? is it the result of some of the code that's run here? if so, how was it uploaded to huggingface?

bgyori · 2024-11-06T11:55:45Z

Hi @cthoyt, @buzgalbraith works with me on this project where we are evaluating language-model-based and graph-based predictions to create mappings. Part of the plan is to curate new predictions made with various approaches and compare against predictions we've made in the past using Gilda. But the principled way to do these curations is to add them to Biomapping's predictions and then curate them through the UI, hence the structure of this PR. I am happy to discuss about research ideas but beyond that, want to make clear that decisions about how and what to do are not at your sole discretion.

bgyori · 2024-11-06T23:41:58Z

Upon further reflection, we will not pursue contributing this code into this repo so closing the PR.

buz galbriath and others added 4 commits November 4, 2024 23:47

add bertmap script

c5f08ea

made complient with flake8

2975bb6

updated predictions.tsv

ed239d8

Create bertmap.py

64dc931

cthoyt requested changes Nov 6, 2024

View reviewed changes

Merge branch 'master' into pr/170

187571a

cthoyt changed the title ~~Bert map~~ Add prediction workflow based on BERTMap Nov 6, 2024

cthoyt added 8 commits November 6, 2024 08:50

Format and add comment

a8e4703

Move constants

914aa78

Update path usage

d31ffdf

Update tox.ini

cf0c834

Additional refactoring

41444db

Update generate_BERTMap_predictions.py

0e799c0

Standardize ontology download path and improve logging

f3d9b28

Minor refactoring

cec948a

cthoyt reviewed Nov 6, 2024

View reviewed changes

bgyori closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prediction workflow based on BERTMap #170

Add prediction workflow based on BERTMap #170

buzgalbraith commented Nov 6, 2024

cthoyt left a comment •

edited

Loading

codecov bot commented Nov 6, 2024

cthoyt commented Nov 6, 2024 •

edited

Loading

cthoyt Nov 6, 2024

cthoyt Nov 6, 2024 •

edited

Loading

cthoyt Nov 6, 2024

cthoyt Nov 6, 2024

bgyori commented Nov 6, 2024

bgyori commented Nov 6, 2024

Add prediction workflow based on BERTMap #170

Add prediction workflow based on BERTMap #170

Conversation

buzgalbraith commented Nov 6, 2024

cthoyt left a comment • edited Loading

Choose a reason for hiding this comment

High Level

Specifics

Disclaimer

codecov bot commented Nov 6, 2024

Welcome to Codecov 🎉

cthoyt commented Nov 6, 2024 • edited Loading

cthoyt Nov 6, 2024

Choose a reason for hiding this comment

cthoyt Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

cthoyt Nov 6, 2024

Choose a reason for hiding this comment

cthoyt Nov 6, 2024

Choose a reason for hiding this comment

bgyori commented Nov 6, 2024

bgyori commented Nov 6, 2024

cthoyt left a comment •

edited

Loading

cthoyt commented Nov 6, 2024 •

edited

Loading

cthoyt Nov 6, 2024 •

edited

Loading