Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load user-defined cpd aliase from XCA (compounds_auto.csv) #1262

Open
phraenquex opened this issue Jan 12, 2024 · 30 comments
Open

Load user-defined cpd aliase from XCA (compounds_auto.csv) #1262

phraenquex opened this issue Jan 12, 2024 · 30 comments
Assignees

Comments

@phraenquex
Copy link
Collaborator

phraenquex commented Jan 12, 2024

Need a mechanism to upload compound aliases. Probably a CSV, indexed by smiles string.

Need a way to see aliases in LHS. Probably tooltip.

Need a way to find by alias. Likely, search button should search through all aliases

Need a way to switch which alias is displayed in LHS. E.g. "switch alias" modal from top of Hit Navigator

@phraenquex phraenquex changed the title Allow upload of aliases (csv?) Handle compound aliases Jan 12, 2024
@mwinokan mwinokan added 2024-03-13 green Data dissemination and removed 2023-11-02 yellow Too big for V2 labels Apr 4, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Apr 4, 2024

My conversations with Jenke about including ASAP IDs in the download makes this ticket relevant again

@mwinokan
Copy link
Collaborator

In the legacy/v1 implementation, aliases could be defined in the metadata.csv.

The ASAP ID's are in SoakDB so maybe we don't need to place the onus for those on the curators, can they just be extracted from SoakDB.

We still need to think about how the aliases are shown in the frontend (see #1322)

@mwinokan
Copy link
Collaborator

mwinokan commented Apr 16, 2024

@phraenquex says the simplest solution may be for XCA to put the compound aliases (crystal_name, zinc_id, asap_id) into a new CSV file in extra_files. This will mean that the ASAP ID's will be in the download.

After XCA the file exists in extra_files and the curator can add/change the aliases before upload

The f/e work is separate, see #1412.

@tdudgeon to get the ASAP ID's from SoakDB (@Waztom to provide example) and create the csv in the extra_files output

@mwinokan
Copy link
Collaborator

mwinokan commented Apr 18, 2024

@tdudgeon was unable to find the ASAP ID's. @Waztom please help Tim with this.

@tdudgeon are you able to proceed with generating the CSV using the existing zinc/compound IDs while you wait for the ASAP ID's from SoakDb? Can the XCA generated sites be included in the CSV too?

@phraenquex please provide @tdudgeon with the required properties

@mwinokan mwinokan changed the title Handle compound aliases XCA summary file including aliases and sites Apr 18, 2024
@phraenquex
Copy link
Collaborator Author

phraenquex commented Apr 18, 2024

Columns to include:

  • observation (long code, not the one from the loader)
  • dataset
  • compound ID
  • compound alias 1
  • observation (SC)
  • canonical site (short code)
  • canonical site (long code)
  • conformation site (SC)
  • conformation site (LC)
  • crystalform site (SC)
  • crystalform site (LC)
  • assembly (SC)
  • assembly (LC)
  • crystalform (SC)
  • crysatlform (LC)

XCA generates columns 1-4; Loader will append the remaining columns.
For upload 1, leave column 4 blank.
For upload 2+, carry over any alias columns from previous uploads.

@kaliif
Copy link
Collaborator

kaliif commented Apr 19, 2024

@phraenquex SC vs. LC for conformation site, crystalform site, assembly, and crystalform - there's currently no separate short and long code for these. Do you mean the short name generated for a tag?

@phraenquex
Copy link
Collaborator Author

I meant the short name, the thing that goes before the dash.

Come to think of it, just put in a single column for each of those - it's anyway just for info.

@mwinokan
Copy link
Collaborator

@tdudgeon has implemented the following:

  • XCA generated compounds_auto.csv
  • The user can specify their own identifiers by copying to compounds_manual.csv

@kaliif the columns 5 and onwards from Frank's spec are no longer needed in the loader (as they are covered by the metadata.csv)

@mwinokan
Copy link
Collaborator

@kaliif please verify that any extra columns added to the summary csv will be loaded by the target loader, and also included in the metadata.csv generated for the download.

@mwinokan
Copy link
Collaborator

mwinokan commented May 2, 2024

@kaliif is still working on this

@kaliif
Copy link
Collaborator

kaliif commented May 2, 2024

@mwinokan what needs to happen on data load? The way it's implemented now is on download, the extra columns are added to the file.

@phraenquex
Copy link
Collaborator Author

@kaliif yes that will be sufficient.

@mwinokan
Copy link
Collaborator

mwinokan commented May 9, 2024

@kaliif says this is in staging but testing has only been local so far.

Likely one of Ryan's targets will be first to test these manual alias files, once they've been uploaded we should test the download as well.

@phraenquex phraenquex added 2024-09-17 olive data curation big items (too big for mint) and removed 2024-06-14 mint Data dissemination 2 labels Sep 26, 2024
@kaliif kaliif moved this from olive to In Progress (DEV) in Fragalysis Oct 10, 2024
@mwinokan mwinokan added 2024-06-14 mint Data dissemination 2 and removed spec needed labels Oct 10, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Oct 10, 2024

Spec has been updated above, and a spin-out ticket for the f/e and API changes #1540.

@tdudgeon adds that we will need to support combi-soaks for the compounds_auto.csv.

@tdudgeon says we could add a ligand_name column to the generated CSV:

xtal ligand_name compound_code
CHIKV_MacB-x0270 LIG Z100643660
CHIKV_MacB-x0270 LG1 Z100643662
CHIKV_MacB-x0270 LG2 Z100643663
CHIKV_MacB-x0281 LIG Z1041785508
CHIKV_MacB-x0289 LIG Z104492884

The user can then duplicate this file (compounds_manual.csv) and add columns, e.g.:

xtal ligand_name compound_code enamine_id ASAP_id CDD_id
CHIKV_MacB-x0270 LIG Z100643660 EN-121 ASAP-12129384293 CDD-23123313
CHIKV_MacB-x0270 LG1 Z100643662 EN-122 ASAP-12129384294 CDD-23123314
CHIKV_MacB-x0270 LG2 Z100643663 EN-123 ASAP-12129384295 CDD-23123315
CHIKV_MacB-x0281 LIG Z1041785508
CHIKV_MacB-x0289 LIG Z104492884
  • For every new column create a new alias type, named after the header of that column, and populate the DB with the associated values
  • Also create a special case compound_code_update that supersedes what comes from SoakDB and updates the default compound code alias (@kaliif please shout if it particularly difficult to implement)
  • N.B. no changes to the compound_code in the compounds_manual.csv should be permitted, @kaliif throw an error.

@phraenquex phraenquex removed the 2024-09-17 olive data curation big items (too big for mint) label Oct 10, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Oct 15, 2024

@tdudgeon has added the ligand_name column to the compounds_auto.csv and all the data is in the metadata.yaml

@tdudgeon says that the compounds_auto.csv is not used by the loader, but it serves a purpose in the final target download and to serve as a template for compounds_manual.csv

@phraenquex clarifies that there is no explicit format specification for the new aliases

@mwinokan
Copy link
Collaborator

@kaliif is starting to work on this ticket now

@kaliif
Copy link
Collaborator

kaliif commented Oct 18, 2024

Also create a special case compound_code_update that supersedes what comes from SoakDB and updates the default compound code alias (@kaliif please shout if it particularly difficult to implement)

@mwinokan I don't understand this point, I'm afraid. I don't know what comes from SoakDB or what needs to be updated.

@mwinokan
Copy link
Collaborator

mwinokan commented Oct 22, 2024

@tdudgeon has added the ligand_name column.

@kaliif Regarding the compound_code_update column:

  • Assume that in the compounds_manual.csv the compound_code column is unchanged and corresponds exactly to what is in SoakDB
  • If the user provides the compound_code_update column, these values should be used to supersede the default alias.

So a user might provide (for a given dataset/lig name):

  • compound_code: from SoakDB and is what is currently shown in the f/e
  • compound_code_update: column for user to override the SoakDB value, use as the default alias and served to the f/e
  • Any other column please store as a separate alias in the DB, but is currently not used by the f/e

e.g. in this compounds_manual.csv example the compound_code for the first ligand is overridden by the compound_code_update:

xtal ligand_name compound_code compound_code_update enamine_id ASAP_id CDD_id
CHIKV_MacB-x0270 LIG Z100643660 Z100643661 EN-121 ASAP-12129384293 CDD-23123313
CHIKV_MacB-x0270 LG1 Z100643662 EN-122 ASAP-12129384294 CDD-23123314
CHIKV_MacB-x0270 LG2 Z100643663 EN-123 ASAP-12129384295 CDD-23123315
CHIKV_MacB-x0281 LIG Z1041785508
CHIKV_MacB-x0289 LIG Z104492884

@tdudgeon says that the lookup can be done using the xtal and ligand_name columns alone, the compound_code column is there to match against what is in the YAML. And the loader should throw an error if the compound_code column does not match what is expected. Throw a useful error that says exactly what doesn't match and that it should go into the compound_code_update column instead.

@kaliif
Copy link
Collaborator

kaliif commented Oct 23, 2024

Dev done, needs testing, not merged to staging yet

@phraenquex phraenquex moved this from In Progress (DEV) to Review Done - deploy to dev env (DEV) in Fragalysis Oct 24, 2024
@phraenquex
Copy link
Collaborator Author

@mwinokan please test once the clusters are back

@phraenquex phraenquex assigned mwinokan and unassigned tdudgeon and kaliif Oct 29, 2024
@mwinokan mwinokan moved this from Review Done - deploy to dev env (DEV) to In staging - assess function vs spec in Fragalysis Nov 7, 2024
@mwinokan
Copy link
Collaborator

b/e PR692 for staging fix implemented. @mwinokan to test

@mwinokan
Copy link
Collaborator

@kaliif As described in #1540 (comment) the upload A71EV2A_xca_staging_20241104_fake_aliases.tar.gz fails on both Matej's stack and staging

@mwinokan mwinokan moved this from In staging - assess function vs spec to In Progress (DEV) in Fragalysis Nov 19, 2024
@mwinokan
Copy link
Collaborator

@kaliif's latest backend has fixed the tarball issue (see comment in 1540).

@mwinokan mwinokan moved this from In Progress (DEV) to QA done in dev.env - move to staging in Fragalysis Nov 27, 2024
@mwinokan mwinokan moved this from QA done in dev.env - move to staging to In staging - assess function vs spec in Fragalysis Dec 3, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Dec 3, 2024

@kaliif says the b/e part is in staging

@mwinokan
Copy link
Collaborator

mwinokan commented Dec 6, 2024

@matej-vavrek is there a b/e difference between your stack and staging currently that affects this ticket?

@mwinokan mwinokan moved this from In staging - assess function vs spec to In production (Done) in Fragalysis Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In production (Done)
Development

No branches or pull requests

5 participants