Skip to content

Commit

Permalink
LIU-420: Initial documentation additions.
Browse files Browse the repository at this point in the history
  • Loading branch information
myxie committed Nov 14, 2024
1 parent 1a32409 commit ba9620e
Show file tree
Hide file tree
Showing 3 changed files with 152 additions and 10 deletions.
28 changes: 19 additions & 9 deletions docs/deployment.rst → docs/deployment/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ As mentioned above, |daliuge| has been developed to enable processing of data fr

.. _dataflow.fig.funcs:

.. figure:: images/dfms_func_as_graphs.jpg
.. figure:: ../images/dfms_func_as_graphs.jpg

Graph-based Functions of the |daliuge| Prototype

The :doc:`architecture/graphs` section describes the implementation details for each function.
The :doc:`../architecture/graphs` section describes the implementation details for each function.
Here we briefly discuss how they work together to fullfill the SKA requirements.

* First of all, the *Logical Graph Template* (topleft in
Expand All @@ -39,8 +39,8 @@ Here we briefly discuss how they work together to fullfill the SKA requirements.

* Before an observation starts, the |daliuge| engine de-serializes a physical graph JSON string and turns all the nodes into Drop objects and then deploys all the Drops onto the allocated resources as per the
location information stated in the physical graph. The deployment process is
facilitated through :doc:`architecture/managers`, which are daemon processes managing the deployment of Drops
onto the designated resources. Note that the :doc:`architecture/managers` do _not_ control the Drops or the execution, but they do monitor the state of them during the execution.
facilitated through :doc:`../architecture/managers`, which are daemon processes managing the deployment of Drops
onto the designated resources. Note that the :doc:`../architecture/managers` do _not_ control the Drops or the execution, but they do monitor the state of them during the execution.

* Once an observation starts, the graph :ref:`graph.execution` cascades down the graph edges through either data Drops that triggers its next consumers or application Drops
that produces its next outputs. When all Drops are in the **COMPLETED** state, some data Drops
Expand All @@ -61,8 +61,15 @@ The translator is able to determine which of the following options is available
Deployment in HPC Centers
~~~~~~~~~~~~~~~~~~~~~~~~~

For current deployment in HPC systems that do not support OOD, please refer to :ref:`slurm_deployment`.

When trying to deploy |daliuge| inside a HPC centre the basic concept as described above does not apply, since in general it is not possible to have the managers running on nodes in a daemon-like way. Typically a user has to submit a job into a batch queue system like SLURM or Torque and that is pretty much all that can be done by a normal user. In order to address this use case, the |daliuge| code base contains example code (daliuge-engine/dlg/deploy/pawsey/start_dfms_cluster.py) which essentially allows to submit not just the workflow, but also the |daliuge| engine as a job. The first thing that job is then doing is to start the managers and then submit the graph. It also allows to start a proxy server, which provides access to the managers' web interfaces via an external machine in order to be able to monitor the running graph. The best way to get access to the |daliuge| code base is to ask the support team to create a load module specifically for |daliuge|. If that is not possible, then users can just load an appropriate Python version (3.7 or 3.8) and install |daliuge| locally. In many cases it is not possible to run docker containers on HPC infrastructure.

.. toctree::
:maxdepth: 1

slurm_deployment

Deployment with OpenOnDemand
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -73,7 +80,7 @@ Importantly, the physical graph deployment is triggered by the user's browser di

.. _deployment.fig.ood:

.. figure:: images/deploy_ood.jpeg
.. figure:: ../images/deploy_ood.jpeg

Sequence diagram of graph deployment in OOD envrionment.

Expand All @@ -92,7 +99,7 @@ The server deployment option assumes the machine hosting the translator can comm

.. _deployment.fig.server:

.. figure:: images/deploy_server.jpeg
.. figure:: ../images/deploy_server.jpeg

Sequence diagram of direct graph deployment.

Expand All @@ -109,7 +116,7 @@ locally, make sure that your host descriptions in EAGLE and the translator are '

.. _deployment.fig.browser:

.. figure:: images/deploy_browser.jpeg
.. figure:: ../images/deploy_browser.jpeg

Sequence diagram of restful graph deployment.

Expand All @@ -128,10 +135,12 @@ The user will need to monitor the k8s environment directly.

.. _deployment.fig.helm:

.. figure:: images/deploy_helm.jpeg
.. figure:: ../images/deploy_helm.jpeg

Sequence diagram of graph deployment in helm environment.



Component Deployment
====================

Expand Down Expand Up @@ -159,4 +168,5 @@ In order to be able to use Python components, it must be possible for the engine
docker exec -ti daliuge-engine bash -c "pip install --prefix=\$DLG_ROOT/code dlg_example_cmpts"
Please note that the '\' character is required for this to work correctly. In the case of running |daliuge| in docker containers $DLG_ROOT is mounted from the host and thus also the subdirectory code is visible directly on the host. In a typical HPC deployment scenario that directory will be on the user's home directory, or a shared volume, visible to all compute nodes.
Please note that the '\' character is required for this to work correctly. In the case of running |daliuge| in docker containers $DLG_ROOT is mounted from the host and thus also the subdirectory code is visible directly on the host. In a typical HPC deployment scenario that directory will be on the user's home directory, or a shared volume, visible to all compute nodes.

132 changes: 132 additions & 0 deletions docs/deployment/slurm_deployment.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
.. _slurm_deployment:

Slurm Deployment
=====================================

Usage and options
-----------------

- Non-OOD support requires the use of the create_dlg_job.py script.

Script has two configuration approaches:

- Command line interface (CLI)
- Configuration files:
- Environment INI [Experimental]
- Slurm template [Experimental]

Command-line Interface (CLI)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The CLI allows the user to submit a remote SLURM job from their local machine, which will
spin up the requested number of DALiuGE Island and Node Managers and run the graph.

The minimal requirements for submitting a job via the command-line are:

- The facility (e.g. Setonix, Hyades, Galaxy)
- The graph (either logical or physical, but not both).
- Specifying if remote or local submission
- The remote user account

All other options have defaults provided. Thus the most basic job submission will look like::

python create_dlg_job.py -a 1 -f setonix -L /path/to/graph/ArrayLoop.graph -U user_name

However, the defaults for jobs submissions will lead to limited use of the available resources (i.e. number of nodes provisioned) and won't account for specific job durations. DALiuGE Translator options are also available, so it is possible to specify what partitioning algorithm is preferred. A more complete job submission, that takes advantage of the SLURM and environment options, will look something like::

python create_dlg_job.py -a 1 -n 32 -s 1 -t 60 -A pso -u -f setonix -L/path/to/graph/ArrayLoop.graph -v 4 --remote --submit -U user_name

This performs the following:

- Submits and runs a remote job to Pawsey's Setonix (`-f setonix`) machine
- Uses 1 data island manager (-s 1) and requests 32 nodes (-n 32) for a job duration of 60 minutes (-t)
- Translates the Logical Graph (-L) using the PSO algorithm (-A PSO).

Environment INI
~~~~~~~~~~~~~~~~~~~~~
TBC

SLURM Template
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TBC

Complete command-line options
-----------------------------

Help output::

create_dlg_job.py -a [1|2] -f <facility> [options]

create_dlg_job.py -h for further help

Options:
-h, --help show this help message and exit
-a ACTION, --action=ACTION
1 - create/submit job, 2 - analyse log
-l LOG_ROOT, --log-root=LOG_ROOT
The root directory of the log file
-d LOG_DIR, --log-dir=LOG_DIR
The directory of the log file for parsing
-L LOGICAL_GRAPH, --logical-graph=LOGICAL_GRAPH
The filename of the logical graph to deploy
-A ALGORITHM, --algorithm=ALGORITHM
The algorithm to be used for the translation
-O ALGORITHM_PARAMS, --algorithm-parameters=ALGORITHM_PARAMS
Parameters for the translation algorithm
-P PHYSICAL_GRAPH, --physical-graph=PHYSICAL_GRAPH
The filename of the physical graph (template) to
deploy
-t JOB_DUR, --job-dur=JOB_DUR
job duration in minutes
-n NUM_NODES, --num_nodes=NUM_NODES
number of compute nodes requested
-i, --visualise_graph
Whether to visualise graph (poll status)
-p, --run_proxy Whether to attach proxy server for real-time
monitoring
-m MON_HOST, --monitor_host=MON_HOST
Monitor host IP (optional)
-o MON_PORT, --monitor_port=MON_PORT
The port to bind DALiuGE monitor
-v VERBOSE_LEVEL, --verbose-level=VERBOSE_LEVEL
Verbosity level (1-3) of the DIM/NM logging
-c CSV_OUTPUT, --csvoutput=CSV_OUTPUT
CSV output file to keep the log analysis result
-z, --zerorun Generate a physical graph that takes no time to run
-y, --sleepncopy Whether include COPY in the default Component drop
-T MAX_THREADS, --max-threads=MAX_THREADS
Max thread pool size used for executing drops. 0
(default) means no pool.
-s NUM_ISLANDS, --num_islands=NUM_ISLANDS
The number of Data Islands
-u, --all_nics Listen on all NICs for a node manager
-S, --check_with_session
Check for node managers' availability by
creating/destroy a session
-f FACILITY, --facility=FACILITY
The facility for which to create a submission job
Valid options: ['galaxy_mwa', 'galaxy_askap',
'magnus', 'galaxy', 'setonix', 'shao', 'hyades',
'ood', 'ood_cloud']
--submit If set to False, the job is not submitted, but the
script is generated
--remote If set to True, the job is submitted/created for a
remote submission
-D DLG_ROOT, --dlg_root=DLG_ROOT
Overwrite the DLG_ROOT directory provided by the
config
-C, --configs Display the available configurations and exit
-U USERNAME, --username=USERNAME
Remote username, if different from local

Experimental Options:
Caution: These are not properly tested and likely tobe rough around
the edges.

--config_file=CONFIG_FILE
Use INI configuration file.
--slurm_template=SLURM_TEMPLATE
Use SLURM template file for job submission. WARNING:
Using this command will over-write other job-
parameters passed here.

2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ and is performed by the `DIA team <http://www.icrar.org/our-research/data-intens
running
basics
architecture/index
deployment
deployment/overview
graph_development
development/dev_index
usage/index
Expand Down

0 comments on commit ba9620e

Please sign in to comment.