-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability issues: Timeout error for large data set snakemake pipeline #529
Comments
Dear SCENIC+ team, I encountered similar situation. I have a data set of ~300k cell and ~570k peaks. And I was also able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch. When it came to SCENIC+ pipeline, I provided 768GB RAM, 36 cores and 48 hours. But it seems the memory was exceeded only during the peak imputation step. Is there any ways to modify the step, e.g chunk reading, other than requesting more RAM? Because 768GB is all I could get. Thanks in advance! Best, |
Hi @ChenWeiyan Are you also using the updated version of pycisTopic in the environment to run SCENIC+? All the best, Seppe |
Hi Seppe, Thanks for your reply! I am not using the updated pycisTopic version in my SCENIC+ environment. If this is the issue, I will try to fix the update then. Best, |
Hi Seppe, I managed to run peak imputation after I scaled down my peak number to ~390K. But when it comes to saving the .h5mu data, the RAM usage increased nearly double. I am wondering if there is any way to split the .h5mu file into small ones? Best, |
Hi @ChenWeiyan Which library conflicts were you having? All the best, Seppe |
Hi @SeppeDeWinter , I did several things after I replied last time.
It seems to be the
This way I reduced the memory down to ~550GB.
Hope such info helps! And thanks to your reply! Best, |
It might be an interesting test to see if more recent versions of What type does the following print? print(type(imputed_acc_obj.mtx.T)) |
Dear SCENIC+ team,
I was wondering what options there are to speed up or increase the scalability for the SCENIC+ snakemake pipeline.
I have a data set of around 117000 cells and 340000 peaks. I was able to run the pycisTopic pipeline by using the updated code from the polars_1xx branch which resolved out of memory issues. I am now trying to run the snakemake pipeline, but the job does not finish within the 72 hours (out of time issue), which is the maximum I can get at our HPC infrastructure for running jobs.
I provided 48 cores, 480 gb ram and 72:00:00 wall time. In issue #453 I saw fewer resources were recommended for more cells but for me the process gets killed at the localrule region_to_gene step at only 51% because of a Timeout (> 72:00:00) error. I have tried running the snakemake step using both the development and main branch scenicplus versions, but the time issue remains the same.
It is possible that the speed-issues might also be HPC related, but I was wondering whether there are ways to split the pipeline into multiple steps or resume the pipeline where it stopped in order to circumvent the 72 hour time limit?
Or would running on a GPU cluster speed up the pipeline? Or are there other ways to speed up the snakemake pipeline?
As an alternative, I also tried to continue the pipeline locally (as I don't have a time limit here) by copying the ACC_GEX.h5mu file that had already been generated in the first steps of the pipeline. Unfortunately, because it has a file size of >160gb I get an 'ArrayMemoryError: Unable to allocate 150 GiB for an array with shape (117000x340000) and data type int32' error, since I only have 128 gb ram available. Is there a way to for example read in this file in chunks to prevent running out of memory?
Thank you in advance!
Version information
SCENIC+: 1.0a1
The text was updated successfully, but these errors were encountered: