From afa69f14ef0b3c2b26dd8ad13e1b5944f02a142d Mon Sep 17 00:00:00 2001 From: pjpascual Date: Fri, 1 Nov 2024 15:05:13 -0700 Subject: [PATCH 01/31] Update accounts-and-access.md Clarified steps for S3DF user registration and expected turnaround time. --- accounts-and-access.md | 50 +++++++++++++++++++++++------------------- 1 file changed, 27 insertions(+), 23 deletions(-) diff --git a/accounts-and-access.md b/accounts-and-access.md index df7b5fc..8b4f507 100644 --- a/accounts-and-access.md +++ b/accounts-and-access.md @@ -3,37 +3,41 @@ ## How to get an account :id=access If you are a SLAC employee, affiliated researcher, or experimental -facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC Unix account. The legacy SDF 1.0 environment requires a SLAC Active Directory account. They are not the same password system.*** - - -1. If you don't already have a SLAC UNIX account (that allowed logins to the rhel6-64 and centos7 clusters), you'll need to get one by following these instructions. **If you already have one, skip to step 2**: - * Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) - * Take Cyber 100 training via the [SLAC training portal](http://training.slac.stanford.edu/web-training.asp) - * Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID, and your preferred account name (and second choice). -2. Enable the SLAC UNIX account into S3DF: - * Log into [coact](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account and follow the instructions to enable your account in S3DF. If the account creation process fails for any reason, we'll let you know. Otherwise, you can assume your account will be enabled within 1 hour. - -?> In some cases, e.g. for Rubin and LCLS, you may want to ask your +facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC UNIX account. The legacy SDF 1.0 environment required a SLAC Active Directory (Windows) account. These are not the same authentication system.*** + + +1. If you don't already have a SLAC UNIX account (the credentials used to log in to SLAC UNIX clusters such as `rhel6-64` and `centos7`), you will need to acquire one by following these instructions. **If you already have an active SLAC UNIX account, skip to step 2**: + * Affiliated users/experimental facility users: Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) form (SLAC employees should already have a SLAC ID number). + * Take the appropriate cybersecurity SLAC training course via the [SLAC training portal](https://slactraining.slac.stanford.edu/how-access-the-web-training-portal): + * All lab users and non-SLAC/Stanford employees: "CS100: Cyber Security for Laboratory Users Training". + * All SLAC/Stanford employees or term employees of SLAC or the University: "CS200: Cyber Security Training for Employees". + * Depending on role, you may be required to take additional cybersecurity training. Consult with your supervisor or SLAC Point of Contact (POC) for more details. + * Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID and your preferred account name (include a second choice in case your preferred username is unavailable). +2. Register your SLAC UNIX account in S3DF: + * Log into the [Coact S3DF User Portal](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account via the "Log in with S3DF (unix)" option. + * Click on "Repos" in the menu bar. + * Click the "Request Access to Facility" button and select a facility from the dropdown. + * Include your affiliation and other contextual information for your request in the "Notes" field, then submit. + * A czar for the S3DF facility you requested access to will review your request. **Once approved by a facility czar**, the registration process should be completed in about 1 hour. + +?> To access files and folders in facilities such as Rubin and LCLS, you will need to ask your SLAC POC to add your username to the [POSIX -group](contact-us.md#facpoc) that manages access to your facility's -storage space. This is needed because S3DF is not the source of truth -for SLAC POSIX groups. S3DF is working with SLAC IT to deploy a -centralized database that will grant S3DF the ability to modify group -membership. +group](contact-us.md#facpoc) that manages access to that facility's +storage space. In the future, access to facility storage will be part of the S3DF registration process in Coact. -?> SLAC is currently working on providing federated access to SLAC -resources so that you will be able to authenticate with your home -institution's account as opposed to your SLAC account. We expect -federated authentication to be available in late 2024. +?> SLAC IT is currently working on providing federated access to SLAC +resources, which will enable authentication to SLAC computing systems +with a user's home institution account rather than a SLAC account. +Federated authentication is expected to be available in late 2024. ## Managing your UNIX account password -You can change your password yourself via [this password update site](https://unix-password.slac.stanford.edu/) +You can change your password via [the SLAC UNIX self-service password update site](https://unix-password.slac.stanford.edu/). -If you've forgotten your password and you want to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support) +If you have forgotten your password and need to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support). -Make sure you comply with SLAC training and cyber requirements to avoid getting your account disabled. You will be notified of these requirements via email. +Make sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email. ## How to connect From b5abb3303f58fa605832dad9771d2982813a3047 Mon Sep 17 00:00:00 2001 From: lnakata Date: Sun, 10 Nov 2024 23:04:19 -0800 Subject: [PATCH 02/31] Update reference.md Correct /fs/afs path to /afs since that's how it's currently mounted. --- reference.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/reference.md b/reference.md index cd62e8d..48f2d49 100644 --- a/reference.md +++ b/reference.md @@ -40,10 +40,7 @@ The SLAC-wide legacy file systems AFS, GPFS, and SDF Lustre will be mounted read-only, and only on the interactive pools, to enable the migration of legacy data to S3DF storage: -- AFS: `/fs/afs` The current plan is to use the afsnfs translator - since AFS ACLs do not map to POSIX anyway. Some experimentation is - underway to see what issues might exist in any potential transfer to - S3DF. +- AFS: For now, AFS is available read-only at the standard /afs path. Once AFS is retired, the current plan is to make a portion of it available read-only at the /fs/afs path. - GPFS: `/fs/gpfs` The current plan is to use NFS as the access method. Affected systems: ACD, ATLAS, CryoEM, DES + DarkSky, Fermi, From d93f2eb9eaef28025c1882882c0d20a1ec30101f Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 14 Nov 2024 09:37:37 -0800 Subject: [PATCH 03/31] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 0cfd0f8..f8e3d1b 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,10 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. +****Thursday 11/14/24 09:35am: We are experiencing a SLAC IT Networking outage. It is impacting inbound internet connectivity to S3DF services. We are working with SLAC IT and we'll provide status updates**** + + + ## Quick Reference From 59d288a6d20c89f42d3ed95d578679bf19b5ece2 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 15 Nov 2024 08:38:50 -0800 Subject: [PATCH 04/31] Update README.md --- README.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/README.md b/README.md index f8e3d1b..190ae23 100644 --- a/README.md +++ b/README.md @@ -6,11 +6,6 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -****Thursday 11/14/24 09:35am: We are experiencing a SLAC IT Networking outage. It is impacting inbound internet connectivity to S3DF services. We are working with SLAC IT and we'll provide status updates**** - - - - ## Quick Reference | Access | Address | From b7c1cb92b66ed1cdf4f818fcb1cce82475c83a0a Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 13:53:26 -0800 Subject: [PATCH 05/31] Update README.md Scheduled maintenance for Slurm 2024-12-03 --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 190ae23..60a45ba 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,12 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. +****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release. +Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu. + + ## Quick Reference | Access | Address | From e32912db8d0730e899ab98443a575c3c123e4816 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 14:40:13 -0800 Subject: [PATCH 06/31] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 60a45ba..817db0f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ concurrency storage systems. ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release. Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. -We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu. +We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** ## Quick Reference From 1a68260ea5d4c6e14e77ee58ff1b7cc37da38009 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 14:48:22 -0800 Subject: [PATCH 07/31] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 817db0f..9ff5747 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ concurrency storage systems. ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release. + Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** From 2250a3fa3c39c8e4487b9af08ae55a8c5756a6e4 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 14:56:52 -0800 Subject: [PATCH 08/31] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9ff5747..c659396 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,9 @@ data analytics and is characterized by large, massive throughput, high concurrency storage systems. ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd -On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release. +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** -Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** From bce04119c82e42e3690889cfc95475ba50fea583 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 15:08:06 -0800 Subject: [PATCH 09/31] Update README.md --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index 190ae23..c659396 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,13 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. +****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** + +****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** + + ## Quick Reference | Access | Address | From 2a675dd9121693616278697e916cdf48b84dacd9 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 17:15:33 -0800 Subject: [PATCH 10/31] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c659396..2995d37 100644 --- a/README.md +++ b/README.md @@ -6,11 +6,11 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd +> ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** ****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. -We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** +We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| ## Quick Reference From 9b711ad8f468960bc95da5b02cea7c06e4f2bcb9 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 17:15:52 -0800 Subject: [PATCH 11/31] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2995d37..03a8598 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ concurrency storage systems. > ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** -****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +> ****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| From bbb3791c884e7c2703a0b1b5049c0f49b5a01881 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 17:17:21 -0800 Subject: [PATCH 12/31] Update README.md --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index 03a8598..ed2fcff 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,7 @@ data analytics and is characterized by large, massive throughput, high concurrency storage systems. > ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd -On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** - -> ****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| From b33947e9b1b7c474b993732f5460bf9e4e361189 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Thu, 21 Nov 2024 17:18:39 -0800 Subject: [PATCH 13/31] Update README.md --- README.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index c659396..9607d76 100644 --- a/README.md +++ b/README.md @@ -6,10 +6,8 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd -On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.**** - -****Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +> ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.**** From 0df81f1b67872d1521a0d436ffe8d051eeeff1c1 Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Fri, 22 Nov 2024 09:54:14 -0800 Subject: [PATCH 14/31] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ed2fcff..25cc4dc 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ data analytics and is characterized by large, massive throughput, high concurrency storage systems. > ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd -On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. +On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| From 3fa0e2cec44db4ea6e0db82771e191080e05070b Mon Sep 17 00:00:00 2001 From: pav511 <38131208+pav511@users.noreply.github.com> Date: Fri, 22 Nov 2024 13:11:17 -0800 Subject: [PATCH 15/31] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 25cc4dc..33b0109 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -> ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd +> ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd

On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| From 2a542dc7b642dcdb3a2313a59bed4015bc8acfc6 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 22 Nov 2024 13:50:25 -0800 Subject: [PATCH 16/31] Update changelog.md --- changelog.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/changelog.md b/changelog.md index 49ce815..f71144f 100644 --- a/changelog.md +++ b/changelog.md @@ -5,6 +5,9 @@ ### Current ### Upcoming +|When |Duration | What | +| --- | --- | --- | +| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) ***during*** the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. | ### Past From ffbad6441a3b573df014ebb4e0f0304cf27363fa Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 22 Nov 2024 13:54:35 -0800 Subject: [PATCH 17/31] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index f71144f..73af274 100644 --- a/changelog.md +++ b/changelog.md @@ -7,7 +7,7 @@ ### Upcoming |When |Duration | What | | --- | --- | --- | -| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) ***during*** the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. | +| Dec 3 2024 | 1 hr (planned) | On 12/3/24 between between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) ***during*** the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. | ### Past From dc6ef34e636f1a301a6f79edbac811820f22624b Mon Sep 17 00:00:00 2001 From: slac-jonl Date: Mon, 25 Nov 2024 15:28:59 -0800 Subject: [PATCH 18/31] Update changelog.md updated current outages to include staas fs1 --- changelog.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/changelog.md b/changelog.md index 73af274..170a291 100644 --- a/changelog.md +++ b/changelog.md @@ -3,6 +3,9 @@ ## Outages ### Current +|When |Duration | What | +| --- | --- | --- | +| Nov 25 2024 | open | We are experiencing a partial outage of the /gpfs/slac/staas/fs1 file system as one disk array has gone offline. Work is ongoing. ### Upcoming |When |Duration | What | From 6398923d1d68deb0f0095fe9918b195c43c18c12 Mon Sep 17 00:00:00 2001 From: slac-jonl Date: Tue, 26 Nov 2024 15:31:27 -0800 Subject: [PATCH 19/31] Update changelog.md removed staas outage from Current heading --- changelog.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/changelog.md b/changelog.md index 170a291..73af274 100644 --- a/changelog.md +++ b/changelog.md @@ -3,9 +3,6 @@ ## Outages ### Current -|When |Duration | What | -| --- | --- | --- | -| Nov 25 2024 | open | We are experiencing a partial outage of the /gpfs/slac/staas/fs1 file system as one disk array has gone offline. Work is ongoing. ### Upcoming |When |Duration | What | From e56534b0ac5396e321e84415338ad3b813849462 Mon Sep 17 00:00:00 2001 From: YemBot Date: Wed, 4 Dec 2024 10:42:14 -0800 Subject: [PATCH 20/31] Update README.md --- README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/README.md b/README.md index 33b0109..0cfd0f8 100644 --- a/README.md +++ b/README.md @@ -6,10 +6,6 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -> ****Scheduled Maintenance Notice: Slurm Upgrade on December 3rd

-On Tuesday December 3rd between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
After the upgrade, all slurm commands should continue to run as expected. We shall be introducing support for pmix with this release, but generally this should be considered a patch release.

Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) /during/ the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. -We recommend planning your work accordingly to minimize disruption. If you have any questions or concerns, please contact s3df-help@slac.stanford.edu.****| - ## Quick Reference From 25141554a4c9b8ba02759b388c1bb05f5db4e65b Mon Sep 17 00:00:00 2001 From: YemBot Date: Wed, 4 Dec 2024 11:00:21 -0800 Subject: [PATCH 21/31] Update changelog.md --- changelog.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/changelog.md b/changelog.md index 73af274..a5d17c7 100644 --- a/changelog.md +++ b/changelog.md @@ -5,14 +5,12 @@ ### Current ### Upcoming -|When |Duration | What | -| --- | --- | --- | -| Dec 3 2024 | 1 hr (planned) | On 12/3/24 between between 13:00 and 14:00 PDT, we will perform a mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. Impact on Users: All job submissions (sbatch/srun/squeue) and slurm database queries (sacct) ***during*** the upgrade window will fail. Any jobs that are running before the upgrade window will continue to run. Completed jobs during the window will be accounted for after the system is back up. We recommend planning your work accordingly to minimize disruption. | ### Past |When |Duration | What | | --- | --- | --- | +| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. |Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions. |Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins |Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter From 9b33c17f43964dc753a746fb51c103e791978ed6 Mon Sep 17 00:00:00 2001 From: YemBot Date: Mon, 9 Dec 2024 15:47:57 -0800 Subject: [PATCH 22/31] Update contact-us.md --- contact-us.md | 35 ++++++++++++++++++++++------------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/contact-us.md b/contact-us.md index 6467774..2594e15 100644 --- a/contact-us.md +++ b/contact-us.md @@ -21,21 +21,30 @@ S3DF and you don't see your facility in this table. |Facility | PoC | Primary POSIX group| |--- |--- |--- | -|Rubin | Richard Dubois | rubin_users | +|Rubin | James Chiang, Adam Bolton | rubin_users | |SuperCDMS | Concetta Cartaro | cdms | |LCLS | pcds-datamgt-l@slac.stanford.edu | ps-users | -|MLI| Daniel Ratner | - | -|Neutrino| Kazuhiro Terao | - | -|AD | Greg White | - | +|MLI| Daniel Ratner | mli | +|Neutrino| Kazuhiro Terao | nu | +|AD | Greg White | cd | |SUNCAT | Johannes Voss| suncat-norm | -|Fermi | Richard Dubois| glast-pipeline | +|Fermi | Seth Digel, Nicola Omodei| glast-pipeline | |EPPTheory | Tom Rizzo | theorygrp | -|FACET | Nathan Majernik | - | -|DESC | Tom Glanzman | desc | -|KIPAC | Stuart Marshall | ki | +|FACET | Nathan Majernik | facet | +|DESC | Heather Kelly | desc | +|KIPAC | Marcelo Alvarez | ki | |RFAR | David Bizzozero | rfar | -|SIMES | Tom Devereaux | - | -|CryoEM | Yee Ting Li | - | -|SSRL | Riti Sarangi | - | -|LDMX | Omar Moreno | - | -|HPS | Omar Moreno | - | +|SIMES | Tom Devereaux, Brian Moritz | simes | +|CryoEM | Patrick Pascual | cryo-data | +|SSRL | Riti Sarangi | ssrl | +|LDMX | Omar Moreno | ldmx | +|HPS | Mathew Graham | hps | +|EXO | Brian Mong | exo | +|ATLAS | Wei Yang, Michael Kagan | atlas | +|CDS | Ernest Williams | cds | +|SRS | Tony Johnson | srs | +|FADERS | Ryan Herbst | faders | +|TOPAS | Joseph Perl | topas | +|RP | Thomas Frosio | esh-rp | +|Projects | Yemi Adesanya, Ryan Herbst | - | +|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - | From c670e5653d2fad68904665e6c8c52405e72f35a9 Mon Sep 17 00:00:00 2001 From: yee379 Date: Tue, 10 Dec 2024 09:53:30 -0800 Subject: [PATCH 23/31] chore: rewording of text --- interactive-compute.md | 35 ++++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/interactive-compute.md b/interactive-compute.md index b8b135a..1355f3e 100644 --- a/interactive-compute.md +++ b/interactive-compute.md @@ -1,14 +1,14 @@ # Interactive Compute -## Terminal +## Using A Terminal -### Interactive Pools +### Interactive Pools :id=interactive-pools -Once you land on the login nodes, either via a terminal or via a NoMachine desktop, you will need to ssh to one of the interactive pools to access the data, build/debug your code, run simple analyses, or submit jobs to the [batch system](batch-compute.md). If your organization has acquired dedicated resources for the interactive pools, use them; otherwise, connect to the S3DF shared interactive pool. +In order to access compute and storage resources in S3DF, you will need to log onto our interactive nodes. After, login to our bastion hosts either via a [ssh terminal session or via NoMachine](accounts-and-access.md#how-to-connect), you will then need to ssh to one of the interactive pools to access the data, build/debug your code, run simple analyses, or submit jobs to the [batch system](batch-compute.md). If your organization has acquired dedicated resources for the interactive pools, use them; otherwise, you can connect to the S3DF shared interactive pool. -?> Note: After log in into our bastion hosts with `ssh s3dflogin.slac.stanford.edu`, you will need to then also log into our interactive nodes to access batch compute and data. You can do this via `ssh ` within your ssh session (same terminal) to get into the bastion hosts. +?> Note: After log in into our bastion hosts with `ssh s3dflogin.slac.stanford.edu`, you will need to then need to log into our interactive nodes to access batch compute and data. You can do this via `ssh ` within your ssh session (same terminal) to get into the bastion hosts. -The currently available pools are shown in the table below. (The facility can be any organization, program, project, or group that interfaces with S3DF to acquire resources.) +The currently available pools are shown in the table below (The facility can be any organization, program, project, or group that interfaces with S3DF to acquire resources). |Pool name | Facility | Resources | | --- | --- | --- | @@ -25,7 +25,7 @@ The currently available pools are shown in the table below. (The facility can be |neutrino | Neutrino | (points to iana) | |mli | MLI (ML Initiative) | (points to iana) | -### Interactive session using Slurm +### Interactive Compute Session Using Slurm Under some circumstances, for example if you need more, or different, resources than available in the interactive pools, you may want to run an interactive session using resources from the batch system. This can be achieved through the Slurm command srun: @@ -33,30 +33,34 @@ Under some circumstances, for example if you need more, or different, resources srun --partition --account -n 1 --time=01:00:00 --pty /bin/bash ``` -This will execute `/bin/bash` on a (scheduled) server in the Slurm partition `` (see [partition names](batch-compute.md#partitions-amp-accounts)), allocating a single CPU for one hour, charging the time to account `` (you'll have to get this from whoever gave you access to S3DF), and launching a pseudo terminal (pty) where bash will run. See [batch banking](batch-compute.md#banking) to understand how your organization is charged (computing time, not money) when you use the batch system. +This will execute `/bin/bash` on a (scheduled) server in the Slurm partition `` (see [partition names](batch-compute.md#partitions-amp-accounts)), allocating a single CPU for one hour, charging the time to account `` (you'll have to get this information from whoever gave you access to S3DF), and launching a pseudo terminal (pty) where `/bin/bash` will run (hence giving you an interactive terminal). See [batch banking](batch-compute.md#banking) to understand how your organization is charged (computing time, not money) when you use the batch system. Note that when you 'exit' the interactive session, it will relinquish the resources for someone else to use. This also means that if your terminal is disconnected (you turn your laptop off, loose network etc), then the job will also terminate (similar to ssh). -To support X11, add the "--x11" option: +To support tunnelling X11 back to your computer, add the "--x11" option: ``` -srun --x11 --partition --account -n 1 --time=01:00:00 --pty /bin/bash +srun --x11 --partition --account -n 1 --time=01:00:00 --pty /usr/bin/xterm ``` -## Browser +## Using A Browser -Users can also access S3DF through Open [OnDemand](https://s3df.slac.stanford.edu/ondemand) via any (modern) browser. This solution is recommended for users who want to run Jupyter notebooks, or don't want to learn SLURM, or don't want to download a terminal or the NoMachine remote desktop on their system. After login, you can select which Jupyter image to run and which hardware resources to use (partition name and number of hours/cpu-cores/memory/gpu-cores). The partition can be the name of an interactive pool or the name of a SLURM partition. You can choose an interactive pool as partition if you want a long-running session requiring sporadic resources; otherwise slect a SLURM partition. Note that no GPUs are currently available on the interactive pools. +Users can also access S3DF through [Open OnDemand](https://s3df.slac.stanford.edu/ondemand) via any (modern) browser. This solution is recommended for users who want to run Jupyter notebooks, or don't want to learn SLURM, or don't want to download a terminal or the NoMachine remote desktop on their system. After login, you can select which Jupyter image to run and which hardware resources to use (partition name and number of hours/cpu-cores/memory/gpu-cores). The partition can be the name of an interactive pool or the name of a SLURM partition. You can choose an interactive pool as partition if you want a long-running session requiring sporadic resources; otherwise slect a SLURM partition. Note that no GPUs are currently available on the interactive pools. -### Shell +### Web Shell -After login onto OnDemand, select the Clusters tab and then select the -interactive pool from the pull down menu. This will allow you to +After login onto [OnDemand]((https://s3df.slac.stanford.edu/ondemand), select the Clusters tab and then select the +desired [interactive pool](#interactive-pools) from the pull down menu. This will allow you to obtain a shell on the interactive pools without using a terminal. +You can also obtain direct access to the [Interactive Analysis Web Shell](https://s3df.slac.stanford.edu/pun/sys/shell/ssh/iana.sdf.slac.stanford.edu) without needing to go through a separate bastion. + + ### Jupyter :id=jupyter We provide automatic tunnels through our [ondemand](https://openondemand.org/) proxy of [Jupyter](https://jupyter.org/) instances. This means that in order to run Jupyter kernels on S3DF, you do not need to setup a chain of SSH tunnels in order to show the Jupyter web instance. +You can [launch a new juptyer session](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sys/slac-ood-jupyter/session_contexts/new) via the provided web form. You can choose to run your jupyter instance on either Batch nodes or the Interactive nodes via the 'Run on cluster type' dropdown. Note that with the former, you will need to select the appropriate cpu and memory resources in advance to run your notebook. Hoewver, for the latter, you will most likely be contending your jupyter resources against others who are also logged on to the interactive node. ### 'bring-your-own-Jupyter' @@ -102,5 +106,6 @@ Fill the rest of the form as you would for any provided Jupyter Instance and cli #### Debugging your interactive session :id=debugging -If you get an error while using your Jupyter instance, go to the [My Interactive sessions page](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sessions), identify the session you want to debu and click on the **Session ID** link. You can then *View* the `output.log` file to troubleshoot. +If you get an error while using your Jupyter instance, go to the [My Interactive sessions page](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sessions), identify the session you want to debug and click on the **Session ID** link. You can then *View* the `output.log` file to troubleshoot. + From 77d578210694478f82dae7d40c3e4700cddc916c Mon Sep 17 00:00:00 2001 From: yee379 Date: Tue, 10 Dec 2024 10:02:22 -0800 Subject: [PATCH 24/31] chore: remove old ondemand reference --- accounts-and-access.md | 2 +- reference.md | 11 ----------- 2 files changed, 1 insertion(+), 12 deletions(-) diff --git a/accounts-and-access.md b/accounts-and-access.md index 8b4f507..7ccd4c4 100644 --- a/accounts-and-access.md +++ b/accounts-and-access.md @@ -72,6 +72,6 @@ use applications like Jupyter, you can also launch a web-based terminal using OnDemand:\ [`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand).\ You can find more information about using OnDemand in the [OnDemand -reference](reference.md#ondemand). +reference](interactive-compute.md#ondemand). ![S3DF users access](assets/S3DF_users_access.png) diff --git a/reference.md b/reference.md index 48f2d49..30410bb 100644 --- a/reference.md +++ b/reference.md @@ -11,17 +11,6 @@ In order to access S3DF, host is set to s3dfnx.slac.stanford.edu, port to 22, an ![NX-connection](assets/nx-connection.png) ![NX-session](assets/nx-session.png) -### OnDemand :ondemand - -[Open OnDemand](https://openondemand.org/) is a web-based terminal. As long as you keep your web browser open, or are not using your browsers private browsing feature, you should only need to authenticate again about once a day. - -?> __TODO__ more about module avail etc. - -We also provide common compilation tools... - -?> __TODO__ describe compilation tools etc. - - ## FAQ :faq From c77b56777fc57811465e45c33dad905ad2635505 Mon Sep 17 00:00:00 2001 From: yee379 Date: Tue, 10 Dec 2024 10:02:46 -0800 Subject: [PATCH 25/31] chore: make ondemand more obvious --- interactive-compute.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/interactive-compute.md b/interactive-compute.md index 1355f3e..1e1de5e 100644 --- a/interactive-compute.md +++ b/interactive-compute.md @@ -43,7 +43,7 @@ To support tunnelling X11 back to your computer, add the "--x11" option: srun --x11 --partition --account -n 1 --time=01:00:00 --pty /usr/bin/xterm ``` -## Using A Browser +## Using A Browser and OnDemand :id=ondemand Users can also access S3DF through [Open OnDemand](https://s3df.slac.stanford.edu/ondemand) via any (modern) browser. This solution is recommended for users who want to run Jupyter notebooks, or don't want to learn SLURM, or don't want to download a terminal or the NoMachine remote desktop on their system. After login, you can select which Jupyter image to run and which hardware resources to use (partition name and number of hours/cpu-cores/memory/gpu-cores). The partition can be the name of an interactive pool or the name of a SLURM partition. You can choose an interactive pool as partition if you want a long-running session requiring sporadic resources; otherwise slect a SLURM partition. Note that no GPUs are currently available on the interactive pools. @@ -60,7 +60,7 @@ You can also obtain direct access to the [Interactive Analysis Web Shell](https: We provide automatic tunnels through our [ondemand](https://openondemand.org/) proxy of [Jupyter](https://jupyter.org/) instances. This means that in order to run Jupyter kernels on S3DF, you do not need to setup a chain of SSH tunnels in order to show the Jupyter web instance. -You can [launch a new juptyer session](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sys/slac-ood-jupyter/session_contexts/new) via the provided web form. You can choose to run your jupyter instance on either Batch nodes or the Interactive nodes via the 'Run on cluster type' dropdown. Note that with the former, you will need to select the appropriate cpu and memory resources in advance to run your notebook. Hoewver, for the latter, you will most likely be contending your jupyter resources against others who are also logged on to the interactive node. +You can [launch a new juptyer session via the provided web form](https://s3df.slac.stanford.edu/pun/sys/dashboard/batch_connect/sys/slac-ood-jupyter/session_contexts/new). You may choose to run your jupyter instance on either Batch nodes or the Interactive nodes via the **Run on cluster type** dropdown. Note that with the former, you will need to select the appropriate cpu and memory resources in advance to run your notebook. Hoewver, for the latter, you will most likely be contending your jupyter resources against others who are also logged on to the interactive node. ### 'bring-your-own-Jupyter' From ec93e6ebee285e8cd9a3ab7487c7f7e58875451a Mon Sep 17 00:00:00 2001 From: yee379 Date: Tue, 10 Dec 2024 10:25:36 -0800 Subject: [PATCH 26/31] chore: fix slurm faq link --- batch-compute.md | 2 +- reference.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/batch-compute.md b/batch-compute.md index 7d68086..f964de8 100644 --- a/batch-compute.md +++ b/batch-compute.md @@ -8,7 +8,7 @@ that the compute resources available in S3DF are fairly and efficiently shared and distributed for all users. This page describes S3DF specific Slurm information. If you haven't used Slurm before, you can find general information on using this workflow manager in our -[Slurm reference FAQ](reference.md#slurm-daq). +[Slurm reference FAQ](reference.md#slurm-faq). ## Clusters & Repos diff --git a/reference.md b/reference.md index 30410bb..46d64af 100644 --- a/reference.md +++ b/reference.md @@ -82,7 +82,7 @@ The documentation uses [docsify.js](https://docsify.js.org/) to render [markdown -## Slurm FAQ :SlurmFAQ +## Slurm FAQ :id=slurm-faq The official Slurm documentation can be found at the [SchedMD site](https://slurm.schedmd.com/documentation.html). From 4a52eef8aea1d20de01464c65b7e36c133490790 Mon Sep 17 00:00:00 2001 From: yee379 Date: Tue, 10 Dec 2024 10:31:13 -0800 Subject: [PATCH 27/31] chore: add space --- batch-compute.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/batch-compute.md b/batch-compute.md index f964de8..9aee351 100644 --- a/batch-compute.md +++ b/batch-compute.md @@ -84,7 +84,7 @@ cluster/partition. | milano | Milan 7713 | 120 | 480 GB | - | - | 300 GB | 136 | | ampere | Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 | | turing | Intel Xeon Gold 5118 | 40 (hyperthreaded) | 160 GB | NVIDIA GeForce 2080Ti | 10 | 300 GB | 27 | -| ada | AMD EPYC 9454 | 72(hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 | +| ada | AMD EPYC 9454 | 72 (hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 | ### Banking From f42e03684009ec01c643a90aece0bc04f6e7bef3 Mon Sep 17 00:00:00 2001 From: lnakata Date: Tue, 10 Dec 2024 18:01:11 -0800 Subject: [PATCH 28/31] Update changelog.md Added StaaS GPFS disk array outages --- changelog.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index a5d17c7..f9c6256 100644 --- a/changelog.md +++ b/changelog.md @@ -10,7 +10,9 @@ |When |Duration | What | | --- | --- | --- | -| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. +|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| +| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. +|Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| |Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions. |Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins |Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter From 79404143cfa926ca022c539c8fcdc1ac1d19face Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 12 Dec 2024 10:24:52 -0800 Subject: [PATCH 29/31] Update changelog.md --- changelog.md | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/changelog.md b/changelog.md index f9c6256..4976cd1 100644 --- a/changelog.md +++ b/changelog.md @@ -1,5 +1,44 @@ # Status & Outages +## Support during Winter Shutdown + +S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue as detailed below. + +**Contacting S3DF staff for issues:** +Users should email s3df-help@slac.stanford.edu for ALL issues (critical and non-critical) providing full details of the problem (including what resources were being used, the impact and other information that may be useful in resolving the issue). +We will update the #comp-sdf Slack channel for critical issues as they are being worked on with status updates. +[This S3DF status web-page](https://s3df.slac.stanford.edu/#/changelog) will also have any updates on current issues. +If critical issues are not responded to within 2 hours of reporting the issue please contact your [Facility Czar](https://s3df.slac.stanford.edu/#/contact-us) for escalation. + +**Critical issues** will be responded to as we become aware of them, except for the period of Dec 24-25 and Jan 31-1, which will be handled as soon as possible depending on staff availability. +* Critical issues are defined as full (a system-wide) outages that impact: + * Access to S3DF resources including + * All SSH logins + * All IANA interactive resources + * B50 compute resources(*) + * Bullet Cluster + * Access to all of the S3DF storage + * Home directories + * Group, Data and Scratch filesystems + * B50 Lustre, GPFS and NFS storage(*) + * Batch system access to S3DF Compute resources + * S3DF Kubernetes vClusters + * VMware clusters + * S3DF virtual machines + * B50 virtual machines(*) +* Critical issues for other SCS-managed systems and services for Experimental system support will be managed in conjunction with the experiment as appropriate. This includes + * LCLS workflows + * Rubin USDF resources + * CryoEM workflows + * Fermi workflows +(*) B50 resources are also dependent on SLAC-IT resources being available. + +**Non-critical issues** will be responded to in the order they were received in the ticketing system when normal operations resume after the Winter Shutdown. Non-critical issues include: + * Individual node-outages in the compute or interactive pool + * Variable or unexpected performance issues for compute, storage or networking resources. + * Batch job errors (that do not impact overall batch system scheduling) + * Tape restores and data transfer issues + ## Outages ### Current From b86bf64762c1fa1fc9db003d9d922212febea232 Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 12 Dec 2024 10:35:38 -0800 Subject: [PATCH 30/31] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 0cfd0f8..cae0cb5 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. +**S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue. [Full details are here](https://s3df.slac.stanford.edu/#/changelog).** ## Quick Reference From 94352780a9077db03abf73541e5f7c6f39ce335c Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 26 Dec 2024 09:52:23 -0800 Subject: [PATCH 31/31] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index cae0cb5..129088e 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue. [Full details are here](https://s3df.slac.stanford.edu/#/changelog).** +**December 26th 8:00am PST: ALL S3DF services are currently DOWN/unavailable. We are investigating and will provide an update later today.** ## Quick Reference