Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemblies should include sample name as part of the file name #26

Open
apetkau opened this issue May 5, 2022 · 7 comments
Open

Assemblies should include sample name as part of the file name #26

apetkau opened this issue May 5, 2022 · 7 comments

Comments

@apetkau
Copy link
Member

apetkau commented May 5, 2022

For assemblies imported into Galaxy, they currently are named contigs.fasta.

image

However, in IRIDA the assembly has the sample name included as part of the file name:

image

This is likely due to this particular line of code:

name = resource['fileName']

This is grabbing the fileName value instead of the label from the REST API. The label should have the label which includes the sample name: https://phac-nml.github.io/irida-documentation/developer/rest/#assemblies

This is called from this bit of code:

sample.add_file(self.get_sample_file(assembly))

Since this same function is used for getting sequence files from a sample, a bit of additional logic will likely need to be added to decide if the file is an assembly file.

@ericenns
Copy link
Member

ericenns commented May 5, 2022

One thing to note that the files are named contigs.fasta because of the Assembly and Annotation pipeline. Should we also consider updating that to prefix the file with the sample name?

@apetkau
Copy link
Member Author

apetkau commented May 5, 2022

Thanks @ericenns . This is actually what the label should be doing already. At least I think that's where it is, I can't find the exact code right now. But we did add a feature to prefix output files with the sample name and store this in the database.

It's not the file name saved on the filesystem, but the prefix (or the full label) should be used when naming files returned to a user.

@apetkau
Copy link
Member Author

apetkau commented May 5, 2022

@ericenns
Copy link
Member

ericenns commented May 5, 2022

Yeah I saw that for analysis output files, but for assembly files they actually show up as contig.fasta vs uploaded assembly files that actually use the name uploaded i.e. myuploadedfile.fasta. I guess the question is do we want to relabel those files? We are not doing this for sequence data (i.e. fastq files).

@apetkau
Copy link
Member Author

apetkau commented May 5, 2022

I may be misunderstanding, but isn't this what we already do?

For genomes that are uploaded, the getLabel() returns the filename: https://github.com/phac-nml/irida/blob/development/src/main/java/ca/corefacility/bioinformatics/irida/model/assembly/GenomeAssembly.java#L66-L69

For genomes that are assembled in IRIDA, the getLabel() forwards the request to the AnalysisOutputFile.getLabel() which adds the prefix so that contigs.fasta will be returned as sample-contigs.fasta: https://github.com/phac-nml/irida/blob/development/src/main/java/ca/corefacility/bioinformatics/irida/model/assembly/GenomeAssemblyFromAnalysis.java#L76-L78

@ericenns
Copy link
Member

ericenns commented May 6, 2022

My mistake, I was under the wrong impression that they showed up as contigs.fasta in IRIDA, but I now see that show up as SAMPLENAME-contigs.fasta.

@apetkau
Copy link
Member Author

apetkau commented May 10, 2022

I believe the other issue is with this line of code:

galaxy_sample_file_name = sample_folder_path + '/' + sample_file.name

It is not using the sample name as derived from the JSON representation to name the data in Galaxy, but using the physical file name of the filesystem path to name the data in Galaxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants