Skip to content

Latest commit

 

History

History
199 lines (152 loc) · 11.1 KB

README.md

File metadata and controls

199 lines (152 loc) · 11.1 KB

CARAML

Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark to assess main stream Computer Vision and Natural Language Processing workloads on novel accelerators. It is developed and tested on systems of Jülich Supercomputing Centre (JSC).

CARAML benchmark is automated and made compact with the help of JUBE, a scripting-based framework to easily create benchmark sets, run those sets on different computer systems and evaluate the results. Additionally, the benchmarks are supplemented with power/energy measurement feature using jpwr.

With the usage of JUBE CARAML provides easy and reproducible way to benchmark different systems and model configurations with minimal effort.

Tested Accelerators:

CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI and WEST-AI Nodes. These include the accelerators:

  • AMD MI200 node with 4 $\times$ MI250 GPUs (tag: MI250)
  • Graphcore IPU-POD4 M2000 with 4 $\times$ GC200 IPUs (tag: GC200)
  • NVIDIA Ampere node (SXM) with 4 $\times$ A100 GPUs (tag: A100)
  • NVIDIA Hopper node (PCIe) with 4 $\times$ H100 GPUs (tag: H100)
  • NVIDIA Hopper node (NVLink) with 4 $\times$ H100 GPUs (tag: WAIH100)
  • NVIDIA Grace-Hopper chip with 1 $\times$ GH200 GPU (tag: GH200)
  • NVIDIA Grace-Hopper Node with 4 $\times$ GH200 GPUs (tag: JEDI)

Benchmark

CARAML currently offers two benchmarks written in python:

Requirements

To run the benchmark JUBE must be installed. Refer to JUBE Installation Documentation. The containers are deployed using Apptainer images and SLURM on the accelerators.

Dataset

For ResNet50, either download the ImageNet LSVRC 2012 dataset from the source or via kaggle (disk space required: 144 GB) or use tag synthetic with JUBE to use synthetic data for benchmark.

For LLM training, a subset (790 samples, 10 MB) of the small version of the OSCAR dataset that is pre-processed using GPT-2 tokenizers is provided in llm_data.

Implementation

ResNet50

The JUBE file resnet50_benchmark.xml sets up the environment by

The performance is measured in terms of images/sec and energy is in units of Wh.

LLM-Training

The JUBE file llm_benchmark_nvidia_amd.yaml and llm_benchmark_ipu.yaml sets up the environment by

The performance is measured in terms of tokens/sec and energy is in units of Wh.

Execution

Clone this repository and cd into it as

git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML

ResNet50

Set the required system and model parameters and the path to downloaded ImageNet data in resnet50_benchmark.xml

  • To pull the required container use container tag as:

    • NVIDIA A100 and H100 GPUs
    jube run  resnet50/resnet50_benchmark.xml --tag container H100
    • NVIDIA GH200 and JEDI GPUs
    jube run resnet50/resnet50_benchmark.xml --tag container GH200
    • AMD MI250
    jube run resnet50/resnet50_benchmark.xml --tag container MI250
    • Graphcore GC200
    jube run resnet50/resnet50_benchmark.xml --tag container GC200
  • To run the benchmark with defined configurations do

    jube run resnet50/resnet50_benchmark.xml --tag A100

    OR with synthetic data

    jube run resnet50/resnet50_benchmark.xml --tag A100 synthetic

    A100 can be replaced with H100, WAIH100, GH200, JEDI, MI250 and GC200 for the respective systems.

  • After the benchmark has been executed, use jube continue to postprocess results

    jube continue resnet50/resnet50_benchmark_run -i last
  • After the postprocessing, to get the result do

     jube result resnet50/resnet50_benchmark_run -i last
  • Example result

    JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
    13077565,MI250,2024.01,dc-mi200,54.71,resnet50_v2,ImageNet,1,8,8,4,64,8,2107.00,CARAML/resnet50/resnet50_benchmark_run/000004/000002_combine_energy/work/combined_energy.csv
    
    JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
    13082568,GC200,2024.01,dc-ipu,1.0,resnet50_mlperf_pod4_bs20,ImageNet,1,4,1,12,32,8,3556.18,CARAML/resnet50/resnet50_benchmark_run/000000/000000_execute/work/GC200_power.0.energy.csv
    
    JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
    13080521,H100,2024.01,dc-h100,89.67,resnet50_v2,ImageNet,1,4,4,4,32,8,1994.69,CARAML/resnet50/resnet50_benchmark_run/000000/000001_combine_energy/work/combined_energy.csv

LLM-Training

Set the required system and model parameters in llm_benchmark_nvidia_amd.yaml for NVIDIA and AMD devices and in llm_benchmark_ipu.yaml for Graphcore

  • To pull the required container and build packages use container tag as:

    • NVIDIA A100 and H100 GPUs
    jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container H100
    • NVIDIA GH200 and JEDI GPUs
    jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container GH200
    • AMD MI250
    jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container MI250
    • Graphcore GC200
    jube run llm_training/llm_benchmark_ipu.yaml --tag container 
  • To run the benchmark with defined configurations for 800M GPT model with OSCAR data do:

    jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100

    A100 can be replaced with H100, WAIH100, GH200, JEDI and MI250 for the respective systems and 800M can be replaced with 13B and 175B for systems with more node resources like JEDI, H100, A100 and MI250.

  • To run the benchmark with defined configurations for 117M GPT model on Graphcore with synthetic data do

    jube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic

    If tag synthetic is not given, the benchmark will use OSCAR data

  • After the benchmark has been executed, use jube continue to postprocess results

    jube continue llm_training/llm_benchmark_nvidia_amd_run -i last

    OR

     jube continue llm_training/llm_benchmark_ipu_run -i last
  • After the postprocessing, to get the result do

    jube result llm_training/llm_benchmark_nvidia_amd_run -i last

    OR

    jube result llm_training/llm_benchmark_ipu_run -i last
  • Example result

JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
13077019,MI250,2024.01,dc-mi200,00:15:00,10,GPT,800M,OSCAR,1,8,32,1,1,8,750,0.74,88620.76,69.35,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000006/000002_combine_energy/work/combined_energy.csv

JobID,System,Version,Queue,JobTime,Model,ModelSize,Dataset,Nodes,Devices,DataParallel,IPU/replica,GlobalBatchSize,Time/iteration(s),Tokens/second,EnergyFile
13011841,GC200,2024.01,dc-ipu,00:40:00,GPT,117M,Synthetic,1,4,1,4,2048,11.17,183.37,CARAML/llm_training/llm_benchmark_ipu_run/000003/000002_combine_energy/work/combined_energy.csv

JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
3914,JEDI,2024.01,all,00:34:00,30,GPT,800M,OSCAR,1,4,2048,1,1,4,25,26.52,158152.80,321.65,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000025/000002_combine_energy/work/combined_energy.csv

JSC Specific Fixes

In order to use PyTorch torch run API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.

Additionally the hostname is appended with an i for allowing communication over InfiniBand as described here.