Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues: Timeout error for large data set snakemake pipeline #529

Open
jolvhull opened this issue Jan 6, 2025 · 7 comments
Open

Comments

@jolvhull
Copy link

jolvhull commented Jan 6, 2025

Dear SCENIC+ team,

I was wondering what options there are to speed up or increase the scalability for the SCENIC+ snakemake pipeline.
I have a data set of around 117000 cells and 340000 peaks. I was able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch which resolved out of memory issues. I am now trying to run the snakemake pipeline, but the job does not finish within the 72 hours (out of time issue), which is the maximum I can get at our HPC infrastructure for running jobs.

I provided 48 cores, 480 gb ram and 72:00:00 wall time. In issue #453 I saw fewer resources were recommended for more cells but for me the process gets killed at the localrule region_to_gene step at only 51% because of a Timeout (> 72:00:00) error. I have tried running the snakemake step using both the development and main branch scenicplus versions, but the time issue remains the same.

It is possible that the speed-issues might also be HPC related, but I was wondering whether there are ways to split the pipeline into multiple steps or resume the pipeline where it stopped in order to circumvent the 72 hour time limit?

Or would running on a GPU cluster speed up the pipeline? Or are there other ways to speed up the snakemake pipeline?

As an alternative, I also tried to continue the pipeline locally (as I don't have a time limit here) by copying the ACC_GEX.h5mu file that had already been generated in the first steps of the pipeline. Unfortunately, because it has a file size of >160gb I get an 'ArrayMemoryError: Unable to allocate 150 GiB for an array with shape (117000x340000) and data type int32' error, since I only have 128 gb ram available. Is there a way to for example read in this file in chunks to prevent running out of memory?

Thank you in advance!

Version information
SCENIC+: 1.0a1

@ChenWeiyan
Copy link

Dear SCENIC+ team,

I encountered similar situation. I have a data set of ~300k cell and ~570k peaks. And I was also able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch.

When it came to SCENIC+ pipeline, I provided 768GB RAM, 36 cores and 48 hours. But it seems the memory was exceeded only during the peak imputation step.

Is there any ways to modify the step, e.g chunk reading, other than requesting more RAM? Because 768GB is all I could get.
Or, if I am going to reduce the peak number, is there a proper threshold to apply, e.g based on overall sd/var?

Thanks in advance!

Best,
Weiyan

@SeppeDeWinter
Copy link
Collaborator

Hi @ChenWeiyan

Are you also using the updated version of pycisTopic in the environment to run SCENIC+?

All the best,

Seppe

@ChenWeiyan
Copy link

Hi Seppe,

Thanks for your reply!

I am not using the updated pycisTopic version in my SCENIC+ environment.
When I tried to updated it before, there are some library conflicts between pycisTopic and SCENIC+. So I just opened up a separate env for pycisTopic and kept SCENIC+ env intact.

If this is the issue, I will try to fix the update then.

Best,
Weiyan

@ChenWeiyan
Copy link

Hi Seppe,

I managed to run peak imputation after I scaled down my peak number to ~390K. But when it comes to saving the .h5mu data, the RAM usage increased nearly double.

I am wondering if there is any way to split the .h5mu file into small ones?

Best,
Weiyan

@SeppeDeWinter
Copy link
Collaborator

Hi @ChenWeiyan

Which library conflicts were you having?
How much memory is the process using at the moment of writing the h5mu file?

All the best,

Seppe

@ChenWeiyan
Copy link

Hi @SeppeDeWinter ,

I did several things after I replied last time.

  1. After modifying the requirements.txt file, I managed to run SCENIC+ with pycisTopic_1xx. And it indeed accelerated the region imputation steps.
  2. But when it came to writing .h5mu file, the memory scaled up to 1100GB, while I only have ~700GB. So it crashed.
  3. Then I inspected the python function of process_multiome_data in the script named adata_cistopic_wrangling.py. It turned out the memory consuming step is "scATAC":AnnData(X=imputed_acc_obj.mtx.T,obs=ACC_cell_metadata.infer_objects(),var=ACC_region_metadata.infer_objects(),obsm=ACC_dr_cell).

It seems to be the imputed_acc_obj.mtx.T that requires a lot of memory, although I do not know why because the original matrix is in spare format. However, I did the following change to the script adata_cistopic_wrangling.py by adding following lines before saving .h5mu data:

import dask.array as da
dask_matrix = da.from_array(imputed_acc_obj.mtx.T, chunks=(100000, 50000))
adata_ACC=AnnData(X=dask_matrix,obs=ACC_cell_metadata.infer_objects(),var=ACC_region_metadata.infer_objects(),obsm=ACC_dr_cell)

This way I reduced the memory down to ~550GB.

  1. Then I proceeded further and now all the way to snakemake pipeline rule tf_to_gene. Here I finished all the middle steps, but again failed on the final writing to tf_to_gene_adj.tsv. I didn't check yet but seems to be a memory issue too. I will come back to you later.

Hope such info helps! And thanks to your reply!

Best,
Weiyan

@ghuls
Copy link
Member

ghuls commented Jan 23, 2025

It might be an interesting test to see if more recent versions of AnnData fix this memory issue (as there seem to be some sparse matrix fixes).

What type does the following print?

print(type(imputed_acc_obj.mtx.T))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants