diff --git a/README.md b/README.md index 0cfd0f8..129088e 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. +**December 26th 8:00am PST: ALL S3DF services are currently DOWN/unavailable. We are investigating and will provide an update later today.** ## Quick Reference diff --git a/accounts-and-access.md b/accounts-and-access.md index df7b5fc..7ccd4c4 100644 --- a/accounts-and-access.md +++ b/accounts-and-access.md @@ -3,37 +3,41 @@ ## How to get an account :id=access If you are a SLAC employee, affiliated researcher, or experimental -facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC Unix account. The legacy SDF 1.0 environment requires a SLAC Active Directory account. They are not the same password system.*** - - -1. If you don't already have a SLAC UNIX account (that allowed logins to the rhel6-64 and centos7 clusters), you'll need to get one by following these instructions. **If you already have one, skip to step 2**: - * Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) - * Take Cyber 100 training via the [SLAC training portal](http://training.slac.stanford.edu/web-training.asp) - * Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID, and your preferred account name (and second choice). -2. Enable the SLAC UNIX account into S3DF: - * Log into [coact](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account and follow the instructions to enable your account in S3DF. If the account creation process fails for any reason, we'll let you know. Otherwise, you can assume your account will be enabled within 1 hour. - -?> In some cases, e.g. for Rubin and LCLS, you may want to ask your +facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC UNIX account. The legacy SDF 1.0 environment required a SLAC Active Directory (Windows) account. These are not the same authentication system.*** + + +1. If you don't already have a SLAC UNIX account (the credentials used to log in to SLAC UNIX clusters such as `rhel6-64` and `centos7`), you will need to acquire one by following these instructions. **If you already have an active SLAC UNIX account, skip to step 2**: + * Affiliated users/experimental facility users: Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) form (SLAC employees should already have a SLAC ID number). + * Take the appropriate cybersecurity SLAC training course via the [SLAC training portal](https://slactraining.slac.stanford.edu/how-access-the-web-training-portal): + * All lab users and non-SLAC/Stanford employees: "CS100: Cyber Security for Laboratory Users Training". + * All SLAC/Stanford employees or term employees of SLAC or the University: "CS200: Cyber Security Training for Employees". + * Depending on role, you may be required to take additional cybersecurity training. Consult with your supervisor or SLAC Point of Contact (POC) for more details. + * Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID and your preferred account name (include a second choice in case your preferred username is unavailable). +2. Register your SLAC UNIX account in S3DF: + * Log into the [Coact S3DF User Portal](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account via the "Log in with S3DF (unix)" option. + * Click on "Repos" in the menu bar. + * Click the "Request Access to Facility" button and select a facility from the dropdown. + * Include your affiliation and other contextual information for your request in the "Notes" field, then submit. + * A czar for the S3DF facility you requested access to will review your request. **Once approved by a facility czar**, the registration process should be completed in about 1 hour. + +?> To access files and folders in facilities such as Rubin and LCLS, you will need to ask your SLAC POC to add your username to the [POSIX -group](contact-us.md#facpoc) that manages access to your facility's -storage space. This is needed because S3DF is not the source of truth -for SLAC POSIX groups. S3DF is working with SLAC IT to deploy a -centralized database that will grant S3DF the ability to modify group -membership. +group](contact-us.md#facpoc) that manages access to that facility's +storage space. In the future, access to facility storage will be part of the S3DF registration process in Coact. -?> SLAC is currently working on providing federated access to SLAC -resources so that you will be able to authenticate with your home -institution's account as opposed to your SLAC account. We expect -federated authentication to be available in late 2024. +?> SLAC IT is currently working on providing federated access to SLAC +resources, which will enable authentication to SLAC computing systems +with a user's home institution account rather than a SLAC account. +Federated authentication is expected to be available in late 2024. ## Managing your UNIX account password -You can change your password yourself via [this password update site](https://unix-password.slac.stanford.edu/) +You can change your password via [the SLAC UNIX self-service password update site](https://unix-password.slac.stanford.edu/). -If you've forgotten your password and you want to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support) +If you have forgotten your password and need to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support). -Make sure you comply with SLAC training and cyber requirements to avoid getting your account disabled. You will be notified of these requirements via email. +Make sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email. ## How to connect @@ -68,6 +72,6 @@ use applications like Jupyter, you can also launch a web-based terminal using OnDemand:\ [`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand).\ You can find more information about using OnDemand in the [OnDemand -reference](reference.md#ondemand). +reference](interactive-compute.md#ondemand). ![S3DF users access](assets/S3DF_users_access.png) diff --git a/batch-compute.md b/batch-compute.md index 7d68086..9aee351 100644 --- a/batch-compute.md +++ b/batch-compute.md @@ -8,7 +8,7 @@ that the compute resources available in S3DF are fairly and efficiently shared and distributed for all users. This page describes S3DF specific Slurm information. If you haven't used Slurm before, you can find general information on using this workflow manager in our -[Slurm reference FAQ](reference.md#slurm-daq). +[Slurm reference FAQ](reference.md#slurm-faq). ## Clusters & Repos @@ -84,7 +84,7 @@ cluster/partition. | milano | Milan 7713 | 120 | 480 GB | - | - | 300 GB | 136 | | ampere | Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 | | turing | Intel Xeon Gold 5118 | 40 (hyperthreaded) | 160 GB | NVIDIA GeForce 2080Ti | 10 | 300 GB | 27 | -| ada | AMD EPYC 9454 | 72(hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 | +| ada | AMD EPYC 9454 | 72 (hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 | ### Banking diff --git a/changelog.md b/changelog.md index 49ce815..4976cd1 100644 --- a/changelog.md +++ b/changelog.md @@ -1,5 +1,44 @@ # Status & Outages +## Support during Winter Shutdown + +S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue as detailed below. + +**Contacting S3DF staff for issues:** +Users should email s3df-help@slac.stanford.edu for ALL issues (critical and non-critical) providing full details of the problem (including what resources were being used, the impact and other information that may be useful in resolving the issue). +We will update the #comp-sdf Slack channel for critical issues as they are being worked on with status updates. +[This S3DF status web-page](https://s3df.slac.stanford.edu/#/changelog) will also have any updates on current issues. +If critical issues are not responded to within 2 hours of reporting the issue please contact your [Facility Czar](https://s3df.slac.stanford.edu/#/contact-us) for escalation. + +**Critical issues** will be responded to as we become aware of them, except for the period of Dec 24-25 and Jan 31-1, which will be handled as soon as possible depending on staff availability. +* Critical issues are defined as full (a system-wide) outages that impact: + * Access to S3DF resources including + * All SSH logins + * All IANA interactive resources + * B50 compute resources(*) + * Bullet Cluster + * Access to all of the S3DF storage + * Home directories + * Group, Data and Scratch filesystems + * B50 Lustre, GPFS and NFS storage(*) + * Batch system access to S3DF Compute resources + * S3DF Kubernetes vClusters + * VMware clusters + * S3DF virtual machines + * B50 virtual machines(*) +* Critical issues for other SCS-managed systems and services for Experimental system support will be managed in conjunction with the experiment as appropriate. This includes + * LCLS workflows + * Rubin USDF resources + * CryoEM workflows + * Fermi workflows +(*) B50 resources are also dependent on SLAC-IT resources being available. + +**Non-critical issues** will be responded to in the order they were received in the ticketing system when normal operations resume after the Winter Shutdown. Non-critical issues include: + * Individual node-outages in the compute or interactive pool + * Variable or unexpected performance issues for compute, storage or networking resources. + * Batch job errors (that do not impact overall batch system scheduling) + * Tape restores and data transfer issues + ## Outages ### Current @@ -10,6 +49,9 @@ |When |Duration | What | | --- | --- | --- | +|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| +| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. +|Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| |Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions. |Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins |Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter diff --git a/contact-us.md b/contact-us.md index 6467774..2594e15 100644 --- a/contact-us.md +++ b/contact-us.md @@ -21,21 +21,30 @@ S3DF and you don't see your facility in this table. |Facility | PoC | Primary POSIX group| |--- |--- |--- | -|Rubin | Richard Dubois | rubin_users | +|Rubin | James Chiang, Adam Bolton | rubin_users | |SuperCDMS | Concetta Cartaro | cdms | |LCLS | pcds-datamgt-l@slac.stanford.edu | ps-users | -|MLI| Daniel Ratner | - | -|Neutrino| Kazuhiro Terao | - | -|AD | Greg White | - | +|MLI| Daniel Ratner | mli | +|Neutrino| Kazuhiro Terao | nu | +|AD | Greg White | cd | |SUNCAT | Johannes Voss| suncat-norm | -|Fermi | Richard Dubois| glast-pipeline | +|Fermi | Seth Digel, Nicola Omodei| glast-pipeline | |EPPTheory | Tom Rizzo | theorygrp | -|FACET | Nathan Majernik | - | -|DESC | Tom Glanzman | desc | -|KIPAC | Stuart Marshall | ki | +|FACET | Nathan Majernik | facet | +|DESC | Heather Kelly | desc | +|KIPAC | Marcelo Alvarez | ki | |RFAR | David Bizzozero | rfar | -|SIMES | Tom Devereaux | - | -|CryoEM | Yee Ting Li | - | -|SSRL | Riti Sarangi | - | -|LDMX | Omar Moreno | - | -|HPS | Omar Moreno | - | +|SIMES | Tom Devereaux, Brian Moritz | simes | +|CryoEM | Patrick Pascual | cryo-data | +|SSRL | Riti Sarangi | ssrl | +|LDMX | Omar Moreno | ldmx | +|HPS | Mathew Graham | hps | +|EXO | Brian Mong | exo | +|ATLAS | Wei Yang, Michael Kagan | atlas | +|CDS | Ernest Williams | cds | +|SRS | Tony Johnson | srs | +|FADERS | Ryan Herbst | faders | +|TOPAS | Joseph Perl | topas | +|RP | Thomas Frosio | esh-rp | +|Projects | Yemi Adesanya, Ryan Herbst | - | +|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - | diff --git a/interactive-compute.md b/interactive-compute.md index b8b135a..1e1de5e 100644 --- a/interactive-compute.md +++ b/interactive-compute.md @@ -1,14 +1,14 @@ # Interactive Compute -## Terminal +## Using A Terminal -### Interactive Pools +### Interactive Pools :id=interactive-pools -Once you land on the login nodes, either via a terminal or via a NoMachine desktop, you will need to ssh to one of the interactive pools to access the data, build/debug your code, run simple analyses, or submit jobs to the [batch system](batch-compute.md). If your organization has acquired dedicated resources for the interactive pools, use them; otherwise, connect to the S3DF shared interactive pool. +In order to access compute and storage resources in S3DF, you will need to log onto our interactive nodes. After, login to our bastion hosts either via a [ssh terminal session or via NoMachine](accounts-and-access.md#how-to-connect), you will then need to ssh to one of the interactive pools to access the data, build/debug your code, run simple analyses, or submit jobs to the [batch system](batch-compute.md). If your organization has acquired dedicated resources for the interactive pools, use them; otherwise, you can connect to the S3DF shared interactive pool. -?> Note: After log in into our bastion hosts with `ssh s3dflogin.slac.stanford.edu`, you will need to then also log into our interactive nodes to access batch compute and data. You can do this via `ssh ` within your ssh session (same terminal) to get into the bastion hosts. +?> Note: After log in into our bastion hosts with `ssh s3dflogin.slac.stanford.edu`, you will need to then need to log into our interactive nodes to access batch compute and data. You can do this via `ssh ` within your ssh session (same terminal) to get into the bastion hosts. -The currently available pools are shown in the table below. (The facility can be any organization, program, project, or group that interfaces with S3DF to acquire resources.) +The currently available pools are shown in the table below (The facility can be any organization, program, project, or group that interfaces with S3DF to acquire resources). |Pool name | Facility | Resources | | --- | --- | --- | @@ -25,7 +25,7 @@ The currently available pools are shown in the table below. (The facility can be |neutrino | Neutrino | (points to iana) | |mli | MLI (ML Initiative) | (points to iana) | -### Interactive session using Slurm +### Interactive Compute Session Using Slurm Under some circumstances, for example if you need more, or different, resources than available in the interactive pools, you may want to run an interactive session using resources from the batch system. This can be achieved through the Slurm command srun: @@ -33,30 +33,34 @@ Under some circumstances, for example if you need more, or different, resources srun --partition --account -n 1 --time=01:00:00 --pty /bin/bash ``` -This will execute `/bin/bash` on a (scheduled) server in the Slurm partition `` (see [partition names](batch-compute.md#partitions-amp-accounts)), allocating a single CPU for one hour, charging the time to account `` (you'll have to get this from whoever gave you access to S3DF), and launching a pseudo terminal (pty) where bash will run. See [batch banking](batch-compute.md#banking) to understand how your organization is charged (computing time, not money) when you use the batch system. +This will execute `/bin/bash` on a (scheduled) server in the Slurm partition `` (see [partition names](batch-compute.md#partitions-amp-accounts)), allocating a single CPU for one hour, charging the time to account `` (you'll have to get this information from whoever gave you access to S3DF), and launching a pseudo terminal (pty) where `/bin/bash` will run (hence giving you an interactive terminal). See [batch banking](batch-compute.md#banking) to understand how your organization is charged (computing time, not money) when you use the batch system. Note that when you 'exit' the interactive session, it will relinquish the resources for someone else to use. This also means that if your terminal is disconnected (you turn your laptop off, loose network etc), then the job will also terminate (similar to ssh). -To support X11, add the "--x11" option: +To support tunnelling X11 back to your computer, add the "--x11" option: ``` -srun --x11 --partition --account -n 1 --time=01:00:00 --pty /bin/bash +srun --x11 --partition --account -n 1 --time=01:00:00 --pty /usr/bin/xterm ``` -## Browser +## Using A Browser and OnDemand :id=ondemand -Users can also access S3DF through Open [OnDemand](https://s3df.slac.stanford.edu/ondemand) via any (modern) browser. This solution is recommended for users who want to run Jupyter notebooks, or don't want to learn SLURM, or don't want to download a terminal or the NoMachine remote desktop on their system. After login, you can select which Jupyter image to run and which hardware resources to use (partition name and number of hours/cpu-cores/memory/gpu-cores). The partition can be the name of an interactive pool or the name of a SLURM partition. You can choose an interactive pool as partition if you want a long-running session requiring sporadic resources; otherwise slect a SLURM partition. Note that no GPUs are currently available on the interactive pools. +Users can also access S3DF through [Open OnDemand](https://s3df.slac.stanford.edu/ondemand) via any (modern) browser. This solution is recommended for users who want to run Jupyter notebooks, or don't want to learn SLURM, or don't want to download a terminal or the NoMachine remote desktop on their system. After login, you can select which Jupyter image to run and which hardware resources to use (partition name and number of hours/cpu-cores/memory/gpu-cores). The partition can be the name of an interactive pool or the name of a SLURM partition. You can choose an interactive pool as partition if you want a long-running session requiring sporadic resources; otherwise slect a SLURM partition. Note that no GPUs are currently available on the interactive pools. -### Shell +### Web Shell -After login onto OnDemand, select the Clusters tab and then select the -interactive pool from the pull down menu. This will allow you to +After login onto [OnDemand]((https://s3df.slac.stanford.edu/ondemand), select the Clusters tab and then select the +desired [interactive pool](#interactive-pools) from the pull down menu. This will allow you to obtain a shell on the interactive pools without using a terminal. +You can also obtain direct access to the [Interactive Analysis Web Shell](https://s3df.slac.stanford.edu/pun/sys/shell/ssh/iana.sdf.slac.stanford.edu) without needing to go through a separate bastion. + + ### Jupyter :id=jupyter We provide automatic tunnels through our [ondemand](https://openondemand.org/) proxy of [Jupyter](https://jupyter.org/) instances. This means that in order to run Jupyter kernels on S3DF, you do not need to setup a chain of SSH tunnels in order to show the Jupyter web instance. +You can [launch a new juptyer session via the provided web form](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sys/slac-ood-jupyter/session_contexts/new). You may choose to run your jupyter instance on either Batch nodes or the Interactive nodes via the **Run on cluster type** dropdown. Note that with the former, you will need to select the appropriate cpu and memory resources in advance to run your notebook. Hoewver, for the latter, you will most likely be contending your jupyter resources against others who are also logged on to the interactive node. ### 'bring-your-own-Jupyter' @@ -102,5 +106,6 @@ Fill the rest of the form as you would for any provided Jupyter Instance and cli #### Debugging your interactive session :id=debugging -If you get an error while using your Jupyter instance, go to the [My Interactive sessions page](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sessions), identify the session you want to debu and click on the **Session ID** link. You can then *View* the `output.log` file to troubleshoot. +If you get an error while using your Jupyter instance, go to the [My Interactive sessions page](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sessions), identify the session you want to debug and click on the **Session ID** link. You can then *View* the `output.log` file to troubleshoot. + diff --git a/reference.md b/reference.md index cd62e8d..46d64af 100644 --- a/reference.md +++ b/reference.md @@ -11,17 +11,6 @@ In order to access S3DF, host is set to s3dfnx.slac.stanford.edu, port to 22, an ![NX-connection](assets/nx-connection.png) ![NX-session](assets/nx-session.png) -### OnDemand :ondemand - -[Open OnDemand](https://openondemand.org/) is a web-based terminal. As long as you keep your web browser open, or are not using your browsers private browsing feature, you should only need to authenticate again about once a day. - -?> __TODO__ more about module avail etc. - -We also provide common compilation tools... - -?> __TODO__ describe compilation tools etc. - - ## FAQ :faq @@ -40,10 +29,7 @@ The SLAC-wide legacy file systems AFS, GPFS, and SDF Lustre will be mounted read-only, and only on the interactive pools, to enable the migration of legacy data to S3DF storage: -- AFS: `/fs/afs` The current plan is to use the afsnfs translator - since AFS ACLs do not map to POSIX anyway. Some experimentation is - underway to see what issues might exist in any potential transfer to - S3DF. +- AFS: For now, AFS is available read-only at the standard /afs path. Once AFS is retired, the current plan is to make a portion of it available read-only at the /fs/afs path. - GPFS: `/fs/gpfs` The current plan is to use NFS as the access method. Affected systems: ACD, ATLAS, CryoEM, DES + DarkSky, Fermi, @@ -96,7 +82,7 @@ The documentation uses [docsify.js](https://docsify.js.org/) to render [markdown -## Slurm FAQ :SlurmFAQ +## Slurm FAQ :id=slurm-faq The official Slurm documentation can be found at the [SchedMD site](https://slurm.schedmd.com/documentation.html).