Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push from prod #98

Merged
merged 52 commits into from
Dec 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
7b3c15a
Merge pull request #74 from slaclab/main
YemBot Oct 17, 2024
4506629
Merge pull request #75 from slaclab/main
YemBot Oct 17, 2024
0032245
Merge branch 'main' into prod
ac6y Oct 25, 2024
82b0e48
Merge branch 'main' into prod
ac6y Oct 25, 2024
afa69f1
Update accounts-and-access.md
pjpascual Nov 1, 2024
aa6b7b8
Merge pull request #79 from slaclab/accounts-and-access-updates
yee379 Nov 1, 2024
4e1d8df
Merge pull request #80 from slaclab/main
yee379 Nov 1, 2024
b5abb33
Update reference.md
lnakata Nov 11, 2024
d93f2eb
Update README.md
YemBot Nov 14, 2024
0ecefc7
Merge pull request #82 from slaclab/main
YemBot Nov 14, 2024
59d288a
Update README.md
YemBot Nov 15, 2024
01b6fd5
Merge pull request #83 from slaclab/main
YemBot Nov 15, 2024
15e5b14
Merge pull request #81 from lnakata/patch-28
pjpascual Nov 15, 2024
b7c1cb9
Update README.md
pav511 Nov 21, 2024
e32912d
Update README.md
pav511 Nov 21, 2024
1a68260
Update README.md
pav511 Nov 21, 2024
2250a3f
Update README.md
pav511 Nov 21, 2024
bce0411
Update README.md
pav511 Nov 21, 2024
2a675dd
Update README.md
pav511 Nov 22, 2024
9b711ad
Update README.md
pav511 Nov 22, 2024
bbb3791
Update README.md
pav511 Nov 22, 2024
b33947e
Update README.md
pav511 Nov 22, 2024
0df81f1
Update README.md
pav511 Nov 22, 2024
3fa0e2c
Update README.md
pav511 Nov 22, 2024
2a542dc
Update changelog.md
YemBot Nov 22, 2024
ffbad64
Update changelog.md
YemBot Nov 22, 2024
558bf20
maintenance notification
pav511 Nov 22, 2024
dc6ef34
Update changelog.md
slac-jonl Nov 25, 2024
d2534d6
Merge pull request #86 from slac-jonl/patch-1
pjpascual Nov 25, 2024
55a83e1
Merge pull request #87 from slaclab/main
pjpascual Nov 25, 2024
6398923
Update changelog.md
slac-jonl Nov 26, 2024
baf4409
Merge pull request #88 from slac-jonl/patch-2
pjpascual Nov 29, 2024
1a9bfd8
Merge pull request #89 from slaclab/main
pjpascual Nov 29, 2024
e56534b
Update README.md
YemBot Dec 4, 2024
2514155
Update changelog.md
YemBot Dec 4, 2024
ca210e2
Merge pull request #90 from slaclab/main
YemBot Dec 5, 2024
9b33c17
Update contact-us.md
YemBot Dec 9, 2024
79ad720
Merge pull request #91 from slaclab/main
YemBot Dec 10, 2024
c670e56
chore: rewording of text
yee379 Dec 10, 2024
77d5782
chore: remove old ondemand reference
yee379 Dec 10, 2024
c77b567
chore: make ondemand more obvious
yee379 Dec 10, 2024
ec93e6e
chore: fix slurm faq link
yee379 Dec 10, 2024
4a52eef
chore: add space
yee379 Dec 10, 2024
f42e036
Update changelog.md
lnakata Dec 11, 2024
6a390f9
Merge pull request #92 from lnakata/patch-29
yee379 Dec 11, 2024
17c68e0
Merge pull request #93 from slaclab/prod
ac6y Dec 11, 2024
c7f37bc
Merge pull request #94 from slaclab/main
yee379 Dec 11, 2024
7940414
Update changelog.md
YemBot Dec 12, 2024
b86bf64
Update README.md
YemBot Dec 12, 2024
11308d9
Merge pull request #95 from slaclab/main
YemBot Dec 12, 2024
9435278
Update README.md
YemBot Dec 26, 2024
0e5fbca
Merge pull request #97 from slaclab/main
YemBot Dec 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for
data analytics and is characterized by large, massive throughput, high
concurrency storage systems.

**December 26th 8:00am PST: ALL S3DF services are currently DOWN/unavailable. We are investigating and will provide an update later today.**

## Quick Reference

Expand Down
52 changes: 28 additions & 24 deletions accounts-and-access.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,41 @@
## How to get an account :id=access

If you are a SLAC employee, affiliated researcher, or experimental
facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC Unix account. The legacy SDF 1.0 environment requires a SLAC Active Directory account. They are not the same password system.***


1. If you don't already have a SLAC UNIX account (that allowed logins to the rhel6-64 and centos7 clusters), you'll need to get one by following these instructions. **If you already have one, skip to step 2**:
* Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration)
* Take Cyber 100 training via the [SLAC training portal](http://training.slac.stanford.edu/web-training.asp)
* Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID, and your preferred account name (and second choice).
2. Enable the SLAC UNIX account into S3DF:
* Log into [coact](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account and follow the instructions to enable your account in S3DF. If the account creation process fails for any reason, we'll let you know. Otherwise, you can assume your account will be enabled within 1 hour.

?> In some cases, e.g. for Rubin and LCLS, you may want to ask your
facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC UNIX account. The legacy SDF 1.0 environment required a SLAC Active Directory (Windows) account. These are not the same authentication system.***


1. If you don't already have a SLAC UNIX account (the credentials used to log in to SLAC UNIX clusters such as `rhel6-64` and `centos7`), you will need to acquire one by following these instructions. **If you already have an active SLAC UNIX account, skip to step 2**:
* Affiliated users/experimental facility users: Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) form (SLAC employees should already have a SLAC ID number).
* Take the appropriate cybersecurity SLAC training course via the [SLAC training portal](https://slactraining.slac.stanford.edu/how-access-the-web-training-portal):
* All lab users and non-SLAC/Stanford employees: "CS100: Cyber Security for Laboratory Users Training".
* All SLAC/Stanford employees or term employees of SLAC or the University: "CS200: Cyber Security Training for Employees".
* Depending on role, you may be required to take additional cybersecurity training. Consult with your supervisor or SLAC Point of Contact (POC) for more details.
* Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID and your preferred account name (include a second choice in case your preferred username is unavailable).
2. Register your SLAC UNIX account in S3DF:
* Log into the [Coact S3DF User Portal](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account via the "Log in with S3DF (unix)" option.
* Click on "Repos" in the menu bar.
* Click the "Request Access to Facility" button and select a facility from the dropdown.
* Include your affiliation and other contextual information for your request in the "Notes" field, then submit.
* A czar for the S3DF facility you requested access to will review your request. **Once approved by a facility czar**, the registration process should be completed in about 1 hour.

?> To access files and folders in facilities such as Rubin and LCLS, you will need to ask your
SLAC POC to add your username to the [POSIX
group](contact-us.md#facpoc) that manages access to your facility's
storage space. This is needed because S3DF is not the source of truth
for SLAC POSIX groups. S3DF is working with SLAC IT to deploy a
centralized database that will grant S3DF the ability to modify group
membership.
group](contact-us.md#facpoc) that manages access to that facility's
storage space. In the future, access to facility storage will be part of the S3DF registration process in Coact.


?> SLAC is currently working on providing federated access to SLAC
resources so that you will be able to authenticate with your home
institution's account as opposed to your SLAC account. We expect
federated authentication to be available in late 2024.
?> SLAC IT is currently working on providing federated access to SLAC
resources, which will enable authentication to SLAC computing systems
with a user's home institution account rather than a SLAC account.
Federated authentication is expected to be available in late 2024.

## Managing your UNIX account password

You can change your password yourself via [this password update site](https://unix-password.slac.stanford.edu/)
You can change your password via [the SLAC UNIX self-service password update site](https://unix-password.slac.stanford.edu/).

If you've forgotten your password and you want to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support)
If you have forgotten your password and need to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support).

Make sure you comply with SLAC training and cyber requirements to avoid getting your account disabled. You will be notified of these requirements via email.
Make sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email.


## How to connect
Expand Down Expand Up @@ -68,6 +72,6 @@ use applications like Jupyter, you can also launch a web-based
terminal using OnDemand:\
[`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand).\
You can find more information about using OnDemand in the [OnDemand
reference](reference.md#ondemand).
reference](interactive-compute.md#ondemand).

![S3DF users access](assets/S3DF_users_access.png)
4 changes: 2 additions & 2 deletions batch-compute.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ that the compute resources available in S3DF are fairly and
efficiently shared and distributed for all users. This page describes
S3DF specific Slurm information. If you haven't used Slurm before, you
can find general information on using this workflow manager in our
[Slurm reference FAQ](reference.md#slurm-daq).
[Slurm reference FAQ](reference.md#slurm-faq).

## Clusters & Repos

Expand Down Expand Up @@ -84,7 +84,7 @@ cluster/partition.
| milano | Milan 7713 | 120 | 480 GB | - | - | 300 GB | 136 |
| ampere | Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 |
| turing | Intel Xeon Gold 5118 | 40 (hyperthreaded) | 160 GB | NVIDIA GeForce 2080Ti | 10 | 300 GB | 27 |
| ada | AMD EPYC 9454 | 72(hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 |
| ada | AMD EPYC 9454 | 72 (hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 |

### Banking

Expand Down
42 changes: 42 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,44 @@
# Status & Outages

## Support during Winter Shutdown

S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue as detailed below.

**Contacting S3DF staff for issues:**
Users should email [email protected] for ALL issues (critical and non-critical) providing full details of the problem (including what resources were being used, the impact and other information that may be useful in resolving the issue).
We will update the #comp-sdf Slack channel for critical issues as they are being worked on with status updates.
[This S3DF status web-page](https://s3df.slac.stanford.edu/#/changelog) will also have any updates on current issues.
If critical issues are not responded to within 2 hours of reporting the issue please contact your [Facility Czar](https://s3df.slac.stanford.edu/#/contact-us) for escalation.

**Critical issues** will be responded to as we become aware of them, except for the period of Dec 24-25 and Jan 31-1, which will be handled as soon as possible depending on staff availability.
* Critical issues are defined as full (a system-wide) outages that impact:
* Access to S3DF resources including
* All SSH logins
* All IANA interactive resources
* B50 compute resources(*)
* Bullet Cluster
* Access to all of the S3DF storage
* Home directories
* Group, Data and Scratch filesystems
* B50 Lustre, GPFS and NFS storage(*)
* Batch system access to S3DF Compute resources
* S3DF Kubernetes vClusters
* VMware clusters
* S3DF virtual machines
* B50 virtual machines(*)
* Critical issues for other SCS-managed systems and services for Experimental system support will be managed in conjunction with the experiment as appropriate. This includes
* LCLS workflows
* Rubin USDF resources
* CryoEM workflows
* Fermi workflows
(*) B50 resources are also dependent on SLAC-IT resources being available.

**Non-critical issues** will be responded to in the order they were received in the ticketing system when normal operations resume after the Winter Shutdown. Non-critical issues include:
* Individual node-outages in the compute or interactive pool
* Variable or unexpected performance issues for compute, storage or networking resources.
* Batch job errors (that do not impact overall batch system scheduling)
* Tape restores and data transfer issues

## Outages

### Current
Expand All @@ -10,6 +49,9 @@

|When |Duration | What |
| --- | --- | --- |
|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)|
| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
|Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)|
|Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions.
|Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins
|Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter
Expand Down
35 changes: 22 additions & 13 deletions contact-us.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,30 @@ S3DF and you don't see your facility in this table.

|Facility | PoC | Primary POSIX group|
|--- |--- |--- |
|Rubin | Richard Dubois | rubin_users |
|Rubin | James Chiang, Adam Bolton | rubin_users |
|SuperCDMS | Concetta Cartaro | cdms |
|LCLS | [email protected] | ps-users |
|MLI| Daniel Ratner | - |
|Neutrino| Kazuhiro Terao | - |
|AD | Greg White | - |
|MLI| Daniel Ratner | mli |
|Neutrino| Kazuhiro Terao | nu |
|AD | Greg White | cd |
|SUNCAT | Johannes Voss| suncat-norm |
|Fermi | Richard Dubois| glast-pipeline |
|Fermi | Seth Digel, Nicola Omodei| glast-pipeline |
|EPPTheory | Tom Rizzo | theorygrp |
|FACET | Nathan Majernik | - |
|DESC | Tom Glanzman | desc |
|KIPAC | Stuart Marshall | ki |
|FACET | Nathan Majernik | facet |
|DESC | Heather Kelly | desc |
|KIPAC | Marcelo Alvarez | ki |
|RFAR | David Bizzozero | rfar |
|SIMES | Tom Devereaux | - |
|CryoEM | Yee Ting Li | - |
|SSRL | Riti Sarangi | - |
|LDMX | Omar Moreno | - |
|HPS | Omar Moreno | - |
|SIMES | Tom Devereaux, Brian Moritz | simes |
|CryoEM | Patrick Pascual | cryo-data |
|SSRL | Riti Sarangi | ssrl |
|LDMX | Omar Moreno | ldmx |
|HPS | Mathew Graham | hps |
|EXO | Brian Mong | exo |
|ATLAS | Wei Yang, Michael Kagan | atlas |
|CDS | Ernest Williams | cds |
|SRS | Tony Johnson | srs |
|FADERS | Ryan Herbst | faders |
|TOPAS | Joseph Perl | topas |
|RP | Thomas Frosio | esh-rp |
|Projects | Yemi Adesanya, Ryan Herbst | - |
|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - |
Loading
Loading