Optimisation of resources for the workflow #455

ypriverol · 2024-12-05T09:13:14Z

Description of feature

Currently, quantms have seven major categories for resources or processes:

withLabel:process_single {
        cpus   = { 1                   }
        memory = { 6.GB * task.attempt }
        time   = { 4.h  * task.attempt }
    }
    withLabel:process_low {
        cpus   = { 4     * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 6.h   * task.attempt }
    }
    withLabel:process_very_low {
        cpus   = { 2     * task.attempt}
        memory = { 4.GB  * task.attempt}
        time   = { 3.h   * task.attempt}
    }
    withLabel:process_medium {
        cpus   = { 8     * task.attempt }
        memory = { 36.GB * task.attempt }
        time   = { 8.h   * task.attempt }
    }
    withLabel:process_high {
        cpus   = { 12    * task.attempt }
        memory = { 72.GB * task.attempt }
        time   = { 16.h  * task.attempt }
    }
    withLabel:process_long {
        time   = { 20.h  * task.attempt }
    }
    withLabel:process_high_memory {
        memory = { 200.GB * task.attempt }
    }

However, some of my current analyses showed that resource usage, for example, for DIA analysis, could be optimized much more at the process level. See some results from my analyses.

### Dataset: PXD030304

CPU Usage:

Memory Usage:

IO Usage:

Most of the processes are under 50% of usage of memory and CPU which looks like a waist of resources?

jpfeuffer · 2024-12-05T09:18:26Z

My plan was always to use the results of your thousands of runs to learn a simple regression model for each step based on file size and or number of spectra. But I am not sure if you ever saved the execution logs.

ypriverol · 2024-12-05T09:22:55Z

I did it for most of the runs. However, you don't really need a huge data to be able to learn simple things. Some conclusions, easy ones:

samplecheet_check and sdrf_parsing are way-way over their memory requirements, they can now go easily to 1GB memory, currently we are given to them 6GB.

jpfeuffer · 2024-12-05T09:34:04Z

Well, yes, but I wasnt talking about those easy things. Of course you can add smaller labels for those.

ypriverol · 2024-12-05T09:37:17Z

I think the orher ones depends heavily of the mzML size, number of MS and MS/MS I guess, even type of instrument, or file size.

jpfeuffer · 2024-12-05T09:39:04Z

That's why I said learning from your results..

jpfeuffer · 2024-12-05T09:39:25Z

All this information is available when starting a run..

jpfeuffer · 2024-12-05T09:41:14Z

Would be a unique and potentially publishable feature of the pipeline.
There is still the retry functionality if the resources were not enough. But I assume there should be some very informative features that allow for a very accurate prediction of resource usage.

ypriverol · 2024-12-05T09:43:36Z

Yes, the idea is to optimize the pipeline for each process for 80% of the runs, if the 20 fails, it can go to the next retry. Before doing the research we have to think if is needed to have the info inside the files, MS and MS/MS. because if for the model that information is needed, then we will need to block all process until mzml_statistics finish?

jpfeuffer · 2024-12-05T09:44:14Z

Yes that is true.

ypriverol · 2024-12-05T09:46:38Z

I will argue that in the first iteration, we look for simple variables, file size, instrument, experiment type - DIA - DDA, and search parameters (database, mods, etc). That could be the first iteration.

ypriverol · 2024-12-05T09:53:04Z

This is DDA-TMT dataset PXD010557:

Memory Usage:

CPU Usage:

IO Usage:

jpfeuffer · 2024-12-05T09:54:54Z

I think we cannot predict CPU usage. We need to know from the implementation if it benefits from multiple cores.
Depending on the implementation multiple cores also means a bit more RAM because more data is loaded at the same time or copies are made for thread-safety

timosachsenberg · 2024-12-05T12:09:04Z

You can also subsample and average statistics from some files to get a much better idea.

ypriverol added the enhancement New feature or request label Dec 5, 2024

ypriverol assigned ypriverol, jpfeuffer and daichengxin Dec 5, 2024

ypriverol mentioned this issue Dec 5, 2024

Optimization of process resources #456

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation of resources for the workflow #455

Optimisation of resources for the workflow #455

ypriverol commented Dec 5, 2024 •

edited

Loading

jpfeuffer commented Dec 5, 2024 •

edited

Loading

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

timosachsenberg commented Dec 5, 2024

Optimisation of resources for the workflow #455

Optimisation of resources for the workflow #455

Comments

ypriverol commented Dec 5, 2024 • edited Loading

Description of feature

jpfeuffer commented Dec 5, 2024 • edited Loading

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

ypriverol commented Dec 5, 2024

ypriverol commented Dec 5, 2024

jpfeuffer commented Dec 5, 2024

timosachsenberg commented Dec 5, 2024

ypriverol commented Dec 5, 2024 •

edited

Loading

jpfeuffer commented Dec 5, 2024 •

edited

Loading