Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on multisampling #421

Open
sparthib opened this issue Apr 4, 2024 · 8 comments
Open

question on multisampling #421

sparthib opened this issue Apr 4, 2024 · 8 comments

Comments

@sparthib
Copy link

sparthib commented Apr 4, 2024

I am trying the multisample mode, and running into out of memory issues which I expected to occur, but I can only assign a certain amount of RAM on my cluster. Is there a work-around for this.

So for example, I am able to run bambu with Quant = TRUE which still outputs some novel isoforms for each of my samples separetely, how would I go about comparing which novel transcripts were common between samples? Thanks!

@andredsim
Copy link
Collaborator

Hi,

Could you share me the script you are using to run Bambu, then I can suggest changes based on what you have already tried.

@sparthib
Copy link
Author

sparthib commented Apr 5, 2024

library(bambu)
library(BiocFileCache)
library(readr)
library(sessioninfo)
library(dplyr)

# gtf.file <- "path_to_gtf/gencode_annotation.gtf"
# bambuAnnotations <- prepareAnnotations(gtf.file)
# saveRDS(bambuAnnotations, "path_to_annotation/annotations.rds")

task_id <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))

annotation <- readRDS("path_to_annotation/annotations.rds")


samples <- vector_of_sample_names

se_output_dir <- paste0("output_path", samples[task_id], "/")


se_quantOnly <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"),
                                  annotations = annotation,
                                  genome = fa.file,
                                  quant = TRUE)

writeBambuOutput(se_quantOnly, 
                 path = se_output_dir,
                 prefix = samples[task_id])

sessioninfo::session_info()

This is what I have so far!

Thanks,
Sowmya

@andredsim
Copy link
Collaborator

Hi Sowmya,

Here are a few options to do multiple sample analysis when memory is limiting.

  1. You can try running all samples together but set lowMemory = TRUE
  2. You can produce the read classes separately first, then generate the annotations together, then do quantification separately (or today)
    se_readClass1 <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"), annotations = annotation, genome = fa.file, discovery = FALSE, quant = FALSE, rcOutDir = "output_path")
    This will produce .rds files in the output directory (rcOutDir) when can be used in future bambu runs to skip the first (memory intensive step). Alternatively you can save the se_readClass variable using saveRDS.
    extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = annotation, genome = fa.file, discovery = TRUE, quant = **FALSE**)
    The below step you can do 1 file at a time if you want to distribute it instead of the example where I have provided multiple files at once. As long as you use extendedAnnotations, the novel transcripts will be comparable between the outputs.
    extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = **extendedAnnotations**, genome = fa.file, discovery = **FALSE**, quant = **TRUE**)

Just a side point, I noticed you named your variable se_quantOnly. If you only want to do quantification and have no novel isoforms, then you need to set discovery = FALSE.

I hope this helps,
Kind Regards,
Andre Sim

@sparthib
Copy link
Author

sparthib commented Apr 8, 2024

thank you, Andre!

@sparthib sparthib closed this as completed Apr 8, 2024
@sparthib sparthib reopened this Apr 9, 2024
@sparthib
Copy link
Author

sparthib commented Apr 9, 2024

Suppose I split the samples by chromosome, and generated read classes for these separately, can I put these chromosome_wise_ read class files together before generating the extended annotation?

@andredsim
Copy link
Collaborator

Yes this would be possible. This is how low_memory mode is meant to work, albeit it still needs to load the whole bam into memory.
There are 2 caveats to splitting the bam file in advance by chromosome.

  1. There will be a lot less read classes for bambu to train on for each run, so this will impact performance of transcript discovery and to a lesser extent the junction correction. You could minimize this impact by turning off fitting and using the default model (this could either improve or decrease performance depending on read depth, so you would need to test this)
  2. If you fuse the read class files together manually in R (not with extend annotations) you will need to take care you rename all the read classes as they will by default have overlapping names which may cause problems.

Kind Regards,
Andre Sim

@sparthib
Copy link
Author

sparthib commented Apr 11, 2024

I was able to save my extended annotation to a gtf for quant = FALSE and discovery = TRUE. This is what it looks like when I load it.

extendedAnnotations <- bambu(reads = rds_files,
                             annotations = annotation, 
                             genome = fa.file, 
                             discovery = TRUE, 
                             quant = FALSE)
  writeToGTF(extendedAnnotations, "../multisample_output.gtf")   

I loaded this GTF file using the readGFF function from rtracklayer.

This is what the first row looks like.

seqid source       type start    end score strand phase.        gene_id     transcript_id exon_number
1 GL000009.2  Bambu transcript 56140  58376    NA      -    NA.      ENSG00000278704.1 ENST00000618686.1        <NA>

Does this result make sense? How do you move onto visualization from here?

Thanks,
Sowmya

@andredsim
Copy link
Collaborator

Hi Sowmya,

Yes that looks fine. With regards to visualization it depends on what you are wanting to see, and there are lots of packages available for plotting the granges object stored in extendAnnotations. Bambu's inbuilt plotting functions require you to also perform quantification as we plot the expression of each model as well.
Kind Regards,
Andre Sim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants