-
Notifications
You must be signed in to change notification settings - Fork 57
Performance analysis of PyTorch
The PyTorch port to ROCm is under active development especially in regards to performance. We are focusing our efforts on server-grade accelerators (MI25/MI60/...) but the following applies to all supported AMD hardware.
We supply a small microbenchmarking script for PyTorch training on ROCm. To use, download micro_benchmarking_pytorch.py, fp16util.py, shufflenet.py, and shufflenet_v2.py.
To execute:
python micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ] [--dataparallel|--distributed_dataparallel] [--device_ids <comma separated list (no spaces) of GPU indices (0-indexed) to run dataparallel/distributed_dataparallel api on>]
Possible network names are: alexnet
, densenet121
, inception_v3
, resnet50
, resnet101
, SqueezeNet
, vgg16
etc.
Default are 10 training iterations, fp16
off (i.e., 0), and a batch size of 64.
--distributed_dataparallel will spawn multiple sub-processes and adjust world_size and rank accordingly. Py3.6 ONLY.
If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, export MIOPEN_FIND_ENFORCE=3
prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the MIOpen documentation.