[DNM] Add kubernetes backend #589

lilyminium · 2024-12-04T07:42:15Z

Add trial Dask Kubernetes addition. Do not merge -- this is just a demo PR for easy diffing.

cc @mattwthompson -- thanks so much for having a look at this 🙏

for more information, see https://pre-commit.ci

lilyminium · 2024-12-04T07:50:41Z

openff/evaluator/backends/dask_kubernetes.py

+            if len(args) >= 2:
+                # schema is the second argument
+                # awful temporary terribad hack
+                schema_json = args[1]
+                if (
+                    '".allow_gpu_platforms": true' in schema_json
+                    or "energy_minimisation" in schema_json
+                ):
+                    resources["GPU"] = 0.5
+                    resources["notGPU"] = 0
+                else:
+                    resources["GPU"] = 0
+                    resources["notGPU"] = 1
+            kwargs["resources"] = resources
+            logger.info(f"Annotating resources: {resources}")


This is a fairly hacky hard-coding that I did to pick up OpenMM minimization and simulation jobs. It's probably not the best solution for a number of reasons.

An alternative solution for identifying which protocols to divert to gpu/cpu workers include modifying somewhere upstream with direct access to the protocol (I think specifically https://github.com/openforcefield/openff-evaluator/blob/main/openff/evaluator/workflow/protocols.py#L1036-L1045) to either: a) pick up the Protocol by type, which could be hardcoded in a list, or b) add the allow_gpu_platforms attribute to the EnergyMinimisation protocol and look for that.

An alternative solution for choosing which workers to divert the protocol to is keeping the actual worker IDs in a dict somewhere and specifying them using the workers kwarg of Client.submit (https://distributed.dask.org/en/stable/api.html#distributed.Client.submit). This avoids hardcoding the very quickly-named custom resources of GPU and notGPU.

However a pro of this custom resources solution is it's fairly isolated to just the DaskKubernetesBackend and shouldn't modify the existing behaviour with HPC workers.

These are the resources being used per worker -- they can be arbitrarily named but I think need to be numerical. I was setting GPU=0.5 in the protocols and GPU=1 on the worker to try to get two protocols to execute at once on the same GPU worker, but not sure it was more efficient.

lilyminium · 2024-12-04T07:54:46Z

openff/evaluator/backends/dask_kubernetes.py

+        self._cluster = KubeCluster(
+            namespace=self._namespace, custom_cluster_spec=spec, **self._cluster_kwargs
+        )


To do: a PVC would need to be started already for the connected Dask cluster to work. This can be done via the Kubernetes API, although I currently do it separately to the DaskKubernetesBackend. Should it be migrated into this class to autostart the pvc?

lilyminium · 2024-12-04T07:57:07Z

openff/evaluator/backends/dask_kubernetes.py

+        self._namespace = namespace
+        self._annotate_resources = annotate_resources
+        self._image = image
+        self._other_resources = other_resources


Since this is currently only GPU/CPU we could also just skip this and set self._cpu_resources_per_worker=... etc.

lilyminium · 2024-12-04T08:07:06Z

openff/evaluator/_tests/test_backends/test_dask_kubernetes.py

+        secret = KubernetesSecret(
+            name="openeye-license",
+            secret_name="oe-license-feb-2024",
+            mount_path="/secrets/oe_license.txt",
+            sub_path="oe_license.txt",
+        )


This is a very narrow view of a secret, which can be done in a few ways but for an OpenEye license as a file is easiest... this takes a previously-configured k8s secret called oe-license-feb-2024 and mounts it at the /secrets/oe_license.txt path to make it available

codecov · 2024-12-04T08:14:27Z

Codecov Report

Attention: Patch coverage is 71.48438% with 73 lines in your changes missing coverage. Please review.

Project coverage is 85.90%. Comparing base (e406ec9) to head (3509457).
Report is 8 commits behind head on main.

Additional details and impacted files

mattwthompson

It'd be nice to have a couple of paragraphs of prose (don't care where, could be in docs or just a comment here) going through the basics of the design and how to use it on a K8 cluster. Otherwise LGTM & please let me know when you consider it complete enough to merge!

lilyminium and others added 13 commits November 6, 2024 08:37

add dask-kubernetes

1dba8c1

add backends

15768ab

fix submit_task

3e05f91

add kwargs

7d1a29d

add gpu specification

3cfd496

add a notGPU flag

e54f105

extend gpu resources to minimisation

655f4ef

halve gpu resources

0b3e827

minor updates

de3bc7b

move logging directive out

8cf3f69

fix indent

368341c

update branch

03aa640

[pre-commit.ci] auto fixes from pre-commit.com hooks

89d9cf9

for more information, see https://pre-commit.ci

lilyminium commented Dec 4, 2024

View reviewed changes

add dask-k8s to dependencies

4d42740

lilyminium commented Dec 4, 2024

View reviewed changes

lilyminium and others added 3 commits December 4, 2024 21:36

add readiness probe back in

039a57f

remove self

902bc4c

Match timeout

3509457

lilyminium mentioned this pull request Jan 6, 2025

Add Equilibration and PreequilibratedSimulation phases #594

Closed

2 tasks

mattwthompson approved these changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Add kubernetes backend #589

[DNM] Add kubernetes backend #589

lilyminium commented Dec 4, 2024 •

edited

Loading

lilyminium Dec 4, 2024

lilyminium Dec 4, 2024

lilyminium Dec 4, 2024

lilyminium Dec 4, 2024

lilyminium Dec 4, 2024 •

edited

Loading

codecov bot commented Dec 4, 2024 •

edited

Loading

mattwthompson left a comment

[DNM] Add kubernetes backend #589

Are you sure you want to change the base?

[DNM] Add kubernetes backend #589

Conversation

lilyminium commented Dec 4, 2024 • edited Loading

lilyminium Dec 4, 2024

Choose a reason for hiding this comment

lilyminium Dec 4, 2024

Choose a reason for hiding this comment

lilyminium Dec 4, 2024

Choose a reason for hiding this comment

lilyminium Dec 4, 2024

Choose a reason for hiding this comment

lilyminium Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Dec 4, 2024 • edited Loading

Codecov Report

mattwthompson left a comment

Choose a reason for hiding this comment

lilyminium commented Dec 4, 2024 •

edited

Loading

lilyminium Dec 4, 2024 •

edited

Loading

codecov bot commented Dec 4, 2024 •

edited

Loading