This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov.
The sheer quantity of resources available to learn AWS can quickly become overwhelming. NIH Cloud Lab’s goal is to make cloud easy and accessible for you, so that you can spend less time on administrative tasks and focus more on your research.
Use this repository to learn about how to use AWS by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this jumpstart section. If you already have foundational knowledge of AWS and cloud, feel free to skip ahead to the tutorials section for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis.
- Getting Started
- Overview
- IAM
- Command Line Tools
- Amazon Marketplace
- Ingest and Store Data
- Virtual Machines in EC2
- Disk Images and Elastic File Storage
- SageMaker Notebooks and Sagemaker Studio
- Creating a Conda Environment
- Managing Containers and Code Repositories
- Clusters (Batch, Kubernetes, SLURM)
- Billing and Benchmarking
- Cost Optimization
- Getting Support
- Additional Training
You can learn a lot of what is possible on AWS in the AWS Getting Started Tutorials Page and we recommend you go there and explore some of the tutorials on offer. Nonetheless, it can be hard to know where to start if you are new to the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore cloud-based research. For an alternative perspective, you can also check out Lynn Langit's AWS for Bioinformatics repo.
There are three primary ways you can run analyses using AWS: using Virtual Machines, Jupyter Notebook instances, and Serverless services. We give a brief overview of each of these here and go into more detail in the sections below. Virtual machines are like desktop computers, but you access them through the cloud console and you get to pick the operating system and the specs such as CPU and memory. In AWS, these virtual machines are called Elastic Compute Cloud or EC2 for short. Jupyter Notebook instances are virtual machines with preconfigured with Jupyter Lab. On AWS these are run through SageMaker, which is also AWS's ML/AI platform. You decide what kind of virtual machine you want to 'spin up' and then you can run Juptyer notebooks on that virtual machine. Finally, Serverless services are services that allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background, and then see the outputs usually in a storage bucket.
Identity and Access Management (IAM) is the service that controls your roles and access to all of AWS. Check out the AWS Getting Started Page for more details. In Cloud Lab you do not have full acess to IAM, but you can create Roles and you can attach Permissions to those Roles. For example, you may need to grant your SageMaker Role extra permissions to access some AWS Service. To do this, you would go to IAM, then Roles. Search for SageMaker
and select AmazonSageMaker-ExecutionRoleXYZ
, where XYZ is your Role's unique identifier. Next, go to Add Permissions
and there you can attach policies as needed. See an example here.
Most tasks in AWS can be done without the command line, but the command line tools will generally make your life easier in the long run. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (GUI). The primary tool you will need is the AWS CLI, which will allow you to interact with instances or S3 buckets (see below) from your local terminal. Instructions for the CLI can be found here. If you are unable to install locally, you can use all the CLI commands from within EC2 and SageMaker instances, or from the Cloud Shell.
To configure the CLI, you will need to authenticate using access keys, which are unique strings that tell AWS that you are allowed to interact with the account. Within Cloud Lab, you will need to use Short Term Access Keys. If you are an NIH user, the instructions for accessing these are found here. Short Term Access keys differ from Long Term Access keys in that they only work for a short period of time. Once your time limit expires, you have to request new keys and then authenticate again. If you do not work at the NIH, but have a Cloud Lab account, you will not have access to STAKS and will need to use the AWS CLI within AWS (such as within a Sagemaker Notebook or EC2 instance). If you have issues with a tutorial in this repository, just email us with your issue at [email protected].
If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using Amazon HealthOmics which is a service for genome-aware storage, serverless workflow execution (using WDL, Nextflow or CWL), and variant and annotation queries using Amazon Athena. Learn more by completing this AWS tutorial. For those who want to use other workflow managers, you can instead try the AWS Genomics CLI, which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our docs on how to set up the Genomics CLI for Cloud Lab. Supported workflow engines include Toil, Cromwell, minwdl, Nextflow, and Snakemake.
The AWS Marketplace is a platform similar to Amazon.com where you can search for and launch pre-configured solutions such as Machine Images. Examples of images you may launch would be those with enhanced security (see EC2 section) or ones opimized for various tasks like machine learning, platform-specific genomics, or accelerated genomics.
Amazon CodeWhisperer is an AI coding companion that helps accelerate development by providing code suggestions in real time, you can integrate with your integrated development environment(IDE). The tool is free for individual use click "Use CodeWhisperer for free" tab in the link provided for setup instructions. Code Whisperer can be used on Visual Studio Code (VScode), Amazon SageMaker Studio, JupyterLab, AWS Glue Studio, AWS lambda, and AWS Cloud9.
Data can be stored in two places on the cloud: either in a cloud storage bucket, which on AWS is called Amazon Simple Storage Service (S3), or on an instance, which usually has Elastic Block Storage. Block storage is storage with a finite size (e.g., 200 GB) that is located on your virtual machine. S3 is object storage, meaning that you can put any type of object in S3, and it is scalable, so there is no upper limit on storage size. There is a 5 TB limit on individual items that you upload, so if you needed to upload a larger file, you would need to break it into smaller pieces.
In general, you want to keep your compute and storage separate, so you should aim to store data in S3 for access, then only copy the data you need to a particular instance to run an analysis, then copy the results back to S3. In addition, the data on an instance is only available when the instance is running, whereas the data in S3 is always available, and serves as a longer term storage solution. Here is a great tutorial on how to use S3 and is worth going through to learn how it all works. If you have files that you will use a over and over, such as reference genomes or protein databases, consider attaching a disk to your instance, which allows you to keep your instance size smaller and pay less for the EBS storage (see next section).
We also wanted to give you a few other tips that may be helpful when it comes to moving and storing data. If your end goal is to move data to an S3 bucket, you can do that using the UI and clicking the Upload
button, or you can use the CLI by typing aws s3 cp <FILE> <s3://BUCKET>
. If you want to move a whole folder, then use the --recursive flag: aws s3 cp <DIR> <s3://BUCKET> --recursive
. The same applies whether moving data from your local directory or from an EC2 instance. Likewise, you can move data from S3 back to your local machine or your EC2 instance with aws s3 cp <s3://BUCKET/FILE> <DESTINATION/PATH>
. Finally, you can move data to an instance using scp, just make sure the instance is running. You can use a command like scp -i 'key.pem' <FILE> [email protected]:~/PATH
. SCP is an SSH tool for copying local data to a remote server. Once the data is on the VM, it is a good idea to use aws s3 cp
to move data to S3. If you are trying to move data from the Short Read Archive (SRA) to an instance, or to S3, you can use the SRA Toolkit. Follow our SRA Toolkit tutorial for best practices.
There is some strategy to managing storage costs as well. When you have spun up a VM, you have already paid for the storage on the VM since you are paying for the size of the disk, whereas S3 storage is charged based on how much data you put in your buckets. This is something to think about when copying results files back to S3 for example. If they are not files you will need later, then leave them on the VM's block storage and save your money on more important data to put in S3. Just make sure you delete the VM when you are finished with it.
Virtual machines (VMs) on AWS are called Amazon Elastic Compute Cloud (EC2) and are like virtual computers that you access via SSH and which start as (nearly) completely blank slates. You have complete control over the VM configuration beginning with the operating system. You can choose a variety of Linux flavors, as well as macOS and Windows. Virtual Machines are organized into machine families with different functions, such as General Purpose, Compute Optimized, Accelerated Computing etc. You can also select machines with graphics processing units (GPUs), which run very quickly for some use cases, but also can cost more than most of the CPU machines. Billing occurs on a per second basis, and larger and faster machine types cost more per second. This is why it is important to stop or delete machines when not in use to minimize costs, and consider always using an idle shutdown script.
Many great resources exist on how to spin up, connect to, and work on a VM on AWS. The first place to direct you is the tutorial created by NIH Common Data Fund. This tutorial expects that you will launch an instance and work with it interactively. Here is the Amazon documentation for different ways to connect to an EC2 instance. In NIH staff will be able to connect from their local terminal via SSH or in the browser via Session Manager. If you are an NIH-affiliated researcher, you will only be able to use the Session Manager. We wrote a guide with screen shots that walks through SSH options.
If you want to launch a Windows VM, check out this tutorial.
From a security perspective, we recommend that you use Center for Internet Security (CIS) Hardened VMs. These have security controls that meet the CIS benchmark for enhanced cloud security. To use these VMs, go to the AWS Marketplace > Discover Products. Then search for CIS Hardened
and chose the OS that meets your needs. Click, Continue to Subscribe
in the top right, and then Continue to Configuration
and set your configuration parameters. Finally, click Continue to Launch
. Here you decide how to launch the Marketplace solution; we recommend Launch from EC2
, although you are welcome to experiment with the other options. Now click Launch
and walk through the usual EC2 launch parameters. Click Launch
and then you can view the status of your VM in the EC2 Instances page.
If you need to scale your VM up or down (see Cost Optimization below), you can always change the machine type by clicking on the instance ID, then go to Actions > Instance Settings > Change instance type
. The VM must be stopped to change the instance type.
Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM instead of having to recreate the wheel. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to an Amazon Machine Image which can serve as a backup, or can be used to launch a new instance with the programs and data from a previous instance. AWS also takes snapshots of your instances, and you can convert these to machine images from which you can launch a new instance that has the same configuration as the snapshot.
For some use cases, you will have some large files that you use over and over, such as reference genomes or protein databases (such as for AlphaFold or ESMFold). It doesn't make sense to keep these stored on a VM or an AMI if that means paying for EBS storage. You will learn quickly that keeping EBS volumes around quickly adds up costs. A better solution is to use elastic files systems that you can attach to VMS (in EC2 or Batch) allowing you to maintain much smaller root EBS storage (and save costs). The two best services for this solution are Amazon Elastic File System and Amazon FSx.
SageMaker is the AWS ML/AI development platform, as well as the hosted/managed Jupyter notebook platform. Notebooks are ideal for certain problems, particularly when doing a tutorial because you can mix code with instructions. They are also great for exploring your data or workflow one portion at a time, since the code gets broken up into little chunks that you can run one by one, which lends itself very well to most ML/AI problems. However, you can also open a terminal within Jupyter Lab, so you can switch between a traditional terminal and notebook interface. The notebook we are going to test here is inside this repo, but we are going to launch a SageMaker instance and then copy the notebook into AWS programmatically.
Follow our SageMaker Notebook guide to learn how to spin up an instance and walk through an example notebook focused on genome-wide association studies.
Amazon recently launched a new IDE environment called Sagemaker Studio, which we recommend for Cloud Lab users. For a comprehensive workshop on Sagemaker Studio, go to this on-demand workshop which walks you through all the important elements of using Sagemaker Studio. To launch Studio, you will need to first set up a Domain which you can read more about here. Once launched, you can use the normal Sagemaker notebook features, except that you can resizes your VM on the fly. You can also execute a whole ML/AI pipeline, including training, deploying, and monitoring, and you have ready access to AWS Jumpstart Models for easy to deploy large language models. If you do try deploying one of these models and run into a quota limit, follow these instructions. You can also launch Foundation Models directly from a notebook on the main Sagemaker menu on the left: Jumpstart
> Foundation Models
> View Model
> Open Notebook in Studio
. You do have to have a domain and user already created (see above).
Virtual environments allow you to manage package versions without having package conflicts. For example, if you needed Python 3 for one analysis, but Python 2.7 for another, you could create separate environments to use the two versions of Python. One of the most popular package managers used for creating virtual environments is the conda package manager.
Mamba is a reimplementation of conda written in C++ and runs much faster than legacy conda. Follow our guide to create a conda environment using Mamba in an EC2 or SageMaker instance.
You can host containers within Amazon Elastic Container Registry. We outline how to build a container, push to Elastic Container Registry, and pull to a compute environment in our docs.
Further, you can manage your git repositories within your AWS account using AWS CodeCommit. Here we outline how to create a repository, authenticate to it, then push and pull files using standard git commands.
One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional cluster, you specify up front how many CPUs and memory you want to give to your job, and you may over or under utilize these resources. With managed resources like serverless and clusters you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (think about workshop participants all submitting jobs at the same time to a cluster). For most users of Cloud Lab, the best way to leverage scaling is to use AWS Batch, but in some cases, maybe for a whole lab group or large project, it may make sense to spin up a Kubernetes cluster. Note that if you spin up resources in Batch, you will need to deactivate the compute environment (in Batch) and delete the autoscaling groups (in EC2) to avoid further charges. You can also spin up SLURM clusters using Parallel Cluster following this guide. You can automate SLURM environment provisioning using Cloud Formation. This recipe library contains a variety of recipes that you can explore using to automate cluster creation. View the GitHub link directly here.
Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of time and cost required for a larger scale project. Follow our Cost Management Guide to see how to tag specific resources for workflow benchmarking. You should also review the AWS Documentation on Billing and Cost Management.
In terms of cost, the best way to estimate costs is to use the AWS pricing calculator here for an initial figure, which is a pricing tool that forecasts costs based on products and usage. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premeses cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate out how much everything may cost using the calculator (e.g. EC2 + S3). You can also watch this helpful video from the AnVIL project to learn more about Cloud Costs.
Follow our Cost Management Guide for details on how to monitor costs, set up budget alerts, and cost-benchmark specific analyses using resource tagging. In addition, here are a few tips to help you stay on budget.
- Configure auto-shutdown on your EC2 instances following this guide. This will prevent you from accidentally leaving instances running.
- Make sure you shut down other resources after you use them, and periodically 'clean up' your account. This can include S3 buckets, virtual machines/notebooks, Batch environments and Cloud Formation scripts. For Batch environments, you will also need to go to EC2 and delete the autoscaling groups (far bottom left option on the EC2 page).
- Use elastic file systems instead of paying for unnecessary EBS storage. Take a look at Amazon Elastic File System and Amazon FSx.
- Ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using CloudWatch. If you are only really using 8 CPUs for example, then just change your machine size to fit the analysis. You can also view our CPU optimization guide here.
- Explore using Spot Instances or Reserved for running workflows.
As part of your participation in Cloud Lab you will be added to the Cloud Lab Teams channel where you can chat with other Cloud Lab users, and gain support from the Cloud Lab team. For NIH Intramural users, you can submit a support ticket to Service Now. For issues related to the cloud environment, feel free to request AWS Enterprise Support. For issues related to scientific use cases, such as, how can I best run an RNAseq pipeline in AWS, email us at [email protected]
.
If you have a question about Quota Limits, visit our documentation on how to request a limit increase.
This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at [email protected] for more information.