-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prediction workflow based on BERTMap #170
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @buzgalbraith, thanks for making a submission. This is a really interesting idea, but this needs some refectoring before it can be considered. Here's some feedback:
High Level
A few specific high-level things to start with:
- Make sure you pass CI. Run
tox
locally and address all issues. During the process of contribution, I will probably add some other CI checks too to further automated giving code pre-review - Please put the code that is reusable in the library itself under
src/biomappings/bertmap.py
- this should include everything except the CLI entrypoint. I already made and pushed a stub file for you. - Please make sure there are no places you're using 1-letter variable names. This might have been a nice shorthand when you were writing, but they're difficult to understand later
- Make sure there are 100% type hints everywhere and that all variables are documented fully in docstrings. The CI will check this for you, or you can run
tox
locally to check anytime
If all of these things can be addressed, I can give a proper code review.
Specifics
- It appears you created your own code for downloading ontologies. Since Biomappings is built on top of PyOBO, which already implements that, then it would make more sense to reuse this code. I can give more specific suggestions later when refactoring makes what you've written a bit more approachable
- There's a reference to an externally defined model on HuggingFace. If code in Biomappings is to use it, then we need to have documentation on how it was generated
Disclaimer
This is your first PR to one of my repos - I am very proactive with making changes I want to see so keep a heads up and make sure you pull your repo before you start making changes in case I added/changed something
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
After some refactoring, I was able to get it to run locally from beginning to end. It doesn't appear to make any new predictions on the second run. The research question to go with this method is: is it worth using more complicated methods? How often does a BERT-based method find something we couldn't using lexical matching? I suggest going through and exhaustively curating all of the new predictions that have been made that are novel. It looks like there are less than 100. Then, I would suggest relaxing the confidence cutoff and curating additional predictions as well. A question for you @buzgalbraith - how knowledgable are you about chemistry? There is a lot of nuance in the mappings for ChEBI and MeSH. Maybe it's best to focus on the other ontologies if you don't have some domain knowledge. Then, it makes sense to run a benchmarking, since I'm guessing that a lot of the predictions that are being produced are already covered by the predictions that have been made with simpler lexical matching / ones that have already been curated. Summary questions:
Another question that comes to mind is: is there a benefit of finetuning vs. using the base BERT model? It's not clear what the code is doing so it's not obvious to me what finetuning even accomplishes. This would be important to document |
x["source prefix"].lower() == source_ontology_train.lower() | ||
) | ||
|
||
def iri_map(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions don't make sense - you could accomplish the same thing below in a list comprehension that would be much more readable and understandable
source_identifier = get_luid(source_prefix, src_class_iri) | ||
target_identifier = get_luid(target_prefix, tgt_class_iri) | ||
# check if in provided map | ||
target_known_maps = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refactor to reduce code that is duplicated, and also add comments to explain what the point of this is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit unclear what's going on here, are you assuming that any annotation including the other prefix is a mapping?
config.global_matching.enabled = False | ||
if not os.path.isdir("bertmap"): | ||
print("downloading model from hugging face") | ||
snapshot_download(repo_id="buzgalbraith/BERTMAP-BioMappings", local_dir="./") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this come from? is it the result of some of the code that's run here? if so, how was it uploaded to huggingface?
Hi @cthoyt, @buzgalbraith works with me on this project where we are evaluating language-model-based and graph-based predictions to create mappings. Part of the plan is to curate new predictions made with various approaches and compare against predictions we've made in the past using Gilda. But the principled way to do these curations is to add them to Biomapping's predictions and then curate them through the UI, hence the structure of this PR. I am happy to discuss about research ideas but beyond that, want to make clear that decisions about how and what to do are not at your sole discretion. |
Upon further reflection, we will not pursue contributing this code into this repo so closing the PR. |
Add code for generating mappings using BERTMap, and updated predictions.tsv