Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation of resources for the workflow #455

Open
ypriverol opened this issue Dec 5, 2024 · 13 comments
Open

Optimisation of resources for the workflow #455

ypriverol opened this issue Dec 5, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@ypriverol
Copy link
Member

ypriverol commented Dec 5, 2024

Description of feature

Currently, quantms have seven major categories for resources or processes:

withLabel:process_single {
        cpus   = { 1                   }
        memory = { 6.GB * task.attempt }
        time   = { 4.h  * task.attempt }
    }
    withLabel:process_low {
        cpus   = { 4     * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 6.h   * task.attempt }
    }
    withLabel:process_very_low {
        cpus   = { 2     * task.attempt}
        memory = { 4.GB  * task.attempt}
        time   = { 3.h   * task.attempt}
    }
    withLabel:process_medium {
        cpus   = { 8     * task.attempt }
        memory = { 36.GB * task.attempt }
        time   = { 8.h   * task.attempt }
    }
    withLabel:process_high {
        cpus   = { 12    * task.attempt }
        memory = { 72.GB * task.attempt }
        time   = { 16.h  * task.attempt }
    }
    withLabel:process_long {
        time   = { 20.h  * task.attempt }
    }
    withLabel:process_high_memory {
        memory = { 200.GB * task.attempt }
    }

However, some of my current analyses showed that resource usage, for example, for DIA analysis, could be optimized much more at the process level. See some results from my analyses.

### Dataset: PXD030304

CPU Usage:

Screenshot 2024-12-05 at 09 08 27

Memory Usage:

Screenshot 2024-12-05 at 09 08 43

IO Usage:

Screenshot 2024-12-05 at 09 08 55

Most of the processes are under 50% of usage of memory and CPU which looks like a waist of resources?

@ypriverol ypriverol added the enhancement New feature or request label Dec 5, 2024
@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Dec 5, 2024

My plan was always to use the results of your thousands of runs to learn a simple regression model for each step based on file size and or number of spectra. But I am not sure if you ever saved the execution logs.

@ypriverol
Copy link
Member Author

I did it for most of the runs. However, you don't really need a huge data to be able to learn simple things. Some conclusions, easy ones:

  • samplecheet_check and sdrf_parsing are way-way over their memory requirements, they can now go easily to 1GB memory, currently we are given to them 6GB.

@jpfeuffer
Copy link
Collaborator

Well, yes, but I wasnt talking about those easy things. Of course you can add smaller labels for those.

@ypriverol
Copy link
Member Author

I think the orher ones depends heavily of the mzML size, number of MS and MS/MS I guess, even type of instrument, or file size.

@jpfeuffer
Copy link
Collaborator

That's why I said learning from your results..

@jpfeuffer
Copy link
Collaborator

All this information is available when starting a run..

@jpfeuffer
Copy link
Collaborator

Would be a unique and potentially publishable feature of the pipeline.
There is still the retry functionality if the resources were not enough. But I assume there should be some very informative features that allow for a very accurate prediction of resource usage.

@ypriverol
Copy link
Member Author

Yes, the idea is to optimize the pipeline for each process for 80% of the runs, if the 20 fails, it can go to the next retry. Before doing the research we have to think if is needed to have the info inside the files, MS and MS/MS. because if for the model that information is needed, then we will need to block all process until mzml_statistics finish?

@jpfeuffer
Copy link
Collaborator

Yes that is true.

@ypriverol
Copy link
Member Author

I will argue that in the first iteration, we look for simple variables, file size, instrument, experiment type - DIA - DDA, and search parameters (database, mods, etc). That could be the first iteration.

@ypriverol
Copy link
Member Author

This is DDA-TMT dataset PXD010557:

Memory Usage:

Screenshot 2024-12-05 at 09 51 00

CPU Usage:

Screenshot 2024-12-05 at 09 51 11

IO Usage:

Screenshot 2024-12-05 at 09 51 23

@jpfeuffer
Copy link
Collaborator

I think we cannot predict CPU usage. We need to know from the implementation if it benefits from multiple cores.
Depending on the implementation multiple cores also means a bit more RAM because more data is loaded at the same time or copies are made for thread-safety

@timosachsenberg
Copy link

You can also subsample and average statistics from some files to get a much better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants