Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prediction workflow based on BERTMap #170

Closed
wants to merge 13 commits into from

Conversation

buzgalbraith
Copy link

Add code for generating mappings using BERTMap, and updated predictions.tsv

Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @buzgalbraith, thanks for making a submission. This is a really interesting idea, but this needs some refectoring before it can be considered. Here's some feedback:

High Level

A few specific high-level things to start with:

  1. Make sure you pass CI. Run tox locally and address all issues. During the process of contribution, I will probably add some other CI checks too to further automated giving code pre-review
  2. Please put the code that is reusable in the library itself under src/biomappings/bertmap.py - this should include everything except the CLI entrypoint. I already made and pushed a stub file for you.
  3. Please make sure there are no places you're using 1-letter variable names. This might have been a nice shorthand when you were writing, but they're difficult to understand later
  4. Make sure there are 100% type hints everywhere and that all variables are documented fully in docstrings. The CI will check this for you, or you can run tox locally to check anytime

If all of these things can be addressed, I can give a proper code review.

Specifics

  1. It appears you created your own code for downloading ontologies. Since Biomappings is built on top of PyOBO, which already implements that, then it would make more sense to reuse this code. I can give more specific suggestions later when refactoring makes what you've written a bit more approachable
  2. There's a reference to an externally defined model on HuggingFace. If code in Biomappings is to use it, then we need to have documentation on how it was generated

Disclaimer

This is your first PR to one of my repos - I am very proactive with making changes I want to see so keep a heads up and make sure you pull your repo before you start making changes in case I added/changed something

@cthoyt cthoyt changed the title Bert map Add prediction workflow based on BERTMap Nov 6, 2024
Copy link

codecov bot commented Nov 6, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@cthoyt
Copy link
Member

cthoyt commented Nov 6, 2024

After some refactoring, I was able to get it to run locally from beginning to end. It doesn't appear to make any new predictions on the second run.

The research question to go with this method is: is it worth using more complicated methods? How often does a BERT-based method find something we couldn't using lexical matching?

I suggest going through and exhaustively curating all of the new predictions that have been made that are novel. It looks like there are less than 100. Then, I would suggest relaxing the confidence cutoff and curating additional predictions as well. A question for you @buzgalbraith - how knowledgable are you about chemistry? There is a lot of nuance in the mappings for ChEBI and MeSH. Maybe it's best to focus on the other ontologies if you don't have some domain knowledge.

Then, it makes sense to run a benchmarking, since I'm guessing that a lot of the predictions that are being produced are already covered by the predictions that have been made with simpler lexical matching / ones that have already been curated.

Summary questions:

  • What percentage of the predictions produced by this method are correct/incorrect?
  • What percentage of the predictions produced by this method are also produced by lexical matching?

Another question that comes to mind is: is there a benefit of finetuning vs. using the base BERT model? It's not clear what the code is doing so it's not obvious to me what finetuning even accomplishes. This would be important to document

x["source prefix"].lower() == source_ontology_train.lower()
)

def iri_map(x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions don't make sense - you could accomplish the same thing below in a list comprehension that would be much more readable and understandable

source_identifier = get_luid(source_prefix, src_class_iri)
target_identifier = get_luid(target_prefix, tgt_class_iri)
# check if in provided map
target_known_maps = [
Copy link
Member

@cthoyt cthoyt Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refactor to reduce code that is duplicated, and also add comments to explain what the point of this is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit unclear what's going on here, are you assuming that any annotation including the other prefix is a mapping?

config.global_matching.enabled = False
if not os.path.isdir("bertmap"):
print("downloading model from hugging face")
snapshot_download(repo_id="buzgalbraith/BERTMAP-BioMappings", local_dir="./")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this come from? is it the result of some of the code that's run here? if so, how was it uploaded to huggingface?

@bgyori
Copy link
Contributor

bgyori commented Nov 6, 2024

Hi @cthoyt, @buzgalbraith works with me on this project where we are evaluating language-model-based and graph-based predictions to create mappings. Part of the plan is to curate new predictions made with various approaches and compare against predictions we've made in the past using Gilda. But the principled way to do these curations is to add them to Biomapping's predictions and then curate them through the UI, hence the structure of this PR. I am happy to discuss about research ideas but beyond that, want to make clear that decisions about how and what to do are not at your sole discretion.

@bgyori
Copy link
Contributor

bgyori commented Nov 6, 2024

Upon further reflection, we will not pursue contributing this code into this repo so closing the PR.

@bgyori bgyori closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants