JAX vs. TF MLPerf Benchmark #4488

pranavsubramani · 2020-10-07T23:36:31Z

pranavsubramani
Oct 7, 2020

I recently came across these results: https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer and I was wondering why the runtimes are different between JAX and TensorFlow if they both use XLA under the hood.

I tried searching for documentation detailing the differences in the way they use XLA but came up short and was wondering if there was a fundamental difference between how JAX JIT-compiles programs to XLA versus how TensorFlow does it. In addition to this, I looked at certain XLA dumps of the same methods in JAX (with XLA) vs. TF (with XLA) and they appear to be fundamentally different.

I was hoping to get some more insight into this.

Answered by jekbradbury

Oct 8, 2020

While JAX and TensorFlow both use XLA as their compiler on TPUs, there are many reasons why similar models implemented in JAX and TensorFlow might not end up with exactly the same XLA HLO: differences between the TF-XLA bridge and JAX-XLA translations, differences between layer implementations in TensorFlow/Keras and JAX neural network libraries like Flax, etc.

But in the MLPerf submissions, we made an effort to produce very similar XLA HLO from both TensorFlow and JAX, so that XLA optimizations could apply equally to both submissions. Instead, end-to-end timing differences between TF and JAX tended to come from a few different sources:

differences in startup overhead (while MLPerf doesn…

View full answer

hawkinsp · 2020-10-08T00:24:05Z

hawkinsp
Oct 8, 2020
Maintainer

@jekbradbury

0 replies

jekbradbury · 2020-10-08T00:54:43Z

jekbradbury
Oct 8, 2020

While JAX and TensorFlow both use XLA as their compiler on TPUs, there are many reasons why similar models implemented in JAX and TensorFlow might not end up with exactly the same XLA HLO: differences between the TF-XLA bridge and JAX-XLA translations, differences between layer implementations in TensorFlow/Keras and JAX neural network libraries like Flax, etc.

But in the MLPerf submissions, we made an effort to produce very similar XLA HLO from both TensorFlow and JAX, so that XLA optimizations could apply equally to both submissions. Instead, end-to-end timing differences between TF and JAX tended to come from a few different sources:

differences in startup overhead (while MLPerf doesn't count things like model compilation time, it does count any startup that takes place after the program first touches data on disk)
different evaluation approaches (e.g., the TF/Lingvo Transformer submission brings all sampled sentences to one machine and computes the BLEU score there, while the JAX Transformer submission computes the same BLEU score in a distributed way, saving a few hundred ms of evaluation time)
differences in hyperparameters (MLPerf often leaves submitters with some freedom in hyperparameter choices; while we were mostly able to use the same hyperparameters as the TF submissions and achieve the same number of epochs to convergence, we ended up with fewer epochs needed for Transformer and slightly more needed for ResNet, possibly owing to tiny numerical differences between the model implementations)
remaining device-side (XLA HLO) differences (e.g., we left about 1% of BERT performance on the table by not implementing a model-level optimization the TensorFlow submission included)

4 replies

n2cholas Oct 8, 2020

Is there any documentation or code you could point to that would provide more details about how the TF-XLA and JAX-XLA translations work (and differ)? In particular, I'd be interested in understanding what tradeoffs both approaches take and in what circumstances one could expect TF to perform better than JAX and vice versa.

pranavsubramani Oct 8, 2020
Author

Just to add to @n2cholas. Here's a small example in colab: https://colab.research.google.com/drive/1VeM9161Cnzve15g2ih1NtUCGDhTuPcEN?usp=sharing where the only computations are a matrix-vector product with the same output, yet the XLA dump appears to be different. Any insight into this would be highly appreciated.

jekbradbury Oct 11, 2020

Can you post the before_optimizations text file for both of them? I expect that they're different in a not-very-meaningful way.

pranavsubramani Oct 14, 2020
Author

Here is the before_optimizations for JAX:

HloModule jit_loss.44

%jit_jvp_fn1_.7 (parameter.8: f32[5,5], parameter.9: f32[5], parameter.10: f32[5], parameter.11: pred[]) -> (f32[5], pred[], f32[5]) {
  %parameter.11 = pred[] parameter(3)
  %constant.12 = pred[] constant(false)
  %parameter.8 = f32[5,5]{1,0} parameter(0)
  %parameter.9 = f32[5]{0} parameter(1)
  %dot.13 = f32[5]{0} dot(f32[5,5]{1,0} %parameter.8, f32[5]{0} %parameter.9), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_type="dot_general" op_name="jit(loss)/jit(jvp(fn1))/dot_general[ dimension_numbers=(((1,), (0,)), ((), ()))\n                                     precision=None ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=3}
  %parameter.10 = f32[5]{0} parameter(2)
  %add.14 = f32[5]{0} add(f32[5]{0} %dot.13, f32[5]{0} %parameter.10), metadata={op_type="add" op_name="jit(loss)/jit(jvp(fn1))/add" source_file="<ipython-input-2-8e7c515249a1>" source_line=3}
  %constant.15 = pred[] constant(false)
  ROOT %tuple.16 = (f32[5]{0}, pred[], f32[5]{0}) tuple(f32[5]{0} %add.14, pred[] %constant.15, f32[5]{0} %parameter.9)
}

%primitive_computation_add.27 (parameter.28: f32[], parameter.29: f32[]) -> f32[] {
  %parameter.28 = f32[] parameter(0), metadata={op_type="add" op_name="add"}
  %parameter.29 = f32[] parameter(1), metadata={op_type="add" op_name="add"}
  ROOT %add.30 = f32[] add(f32[] %parameter.28, f32[] %parameter.29), metadata={op_type="add" op_name="add"}
}

%jit_transpose_jvp_fn1__.35 (parameter.36: f32[5], parameter.37: f32[5]) -> (f32[5,5]) {
  %constant.38 = pred[] constant(false)
  %parameter.37 = f32[5]{0} parameter(1)
  %parameter.36 = f32[5]{0} parameter(0)
  %dot.39 = f32[5,5]{1,0} dot(f32[5]{0} %parameter.37, f32[5]{0} %parameter.36), lhs_contracting_dims={}, rhs_contracting_dims={}, metadata={op_type="dot_general" op_name="jit(loss)/jit(transpose(jvp(fn1)))/dot_general[ dimension_numbers=(((), ()), ((), ()))\n                                                precision=None ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=3}
  ROOT %tuple.40 = (f32[5,5]{1,0}) tuple(f32[5,5]{1,0} %dot.39)
}

ENTRY %jit_loss.44 (parameter.1: f32[5,5], parameter.2: f32[5], parameter.3: f32[5], parameter.4: f32[5]) -> (f32[5,5]) {
  %constant.5 = pred[] constant(false)
  %parameter.1 = f32[5,5]{1,0} parameter(0)
  %parameter.2 = f32[5]{0} parameter(1)
  %parameter.3 = f32[5]{0} parameter(2)
  %constant.6 = pred[] constant(false), metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False, False, False)\n                    name=jvp(fn1) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %call.17 = (f32[5]{0}, pred[], f32[5]{0}) call(f32[5,5]{1,0} %parameter.1, f32[5]{0} %parameter.2, f32[5]{0} %parameter.3, pred[] %constant.6), to_apply=%jit_jvp_fn1_.7, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False, False, False)\n                    name=jvp(fn1) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %get-tuple-element.19 = pred[] get-tuple-element((f32[5]{0}, pred[], f32[5]{0}) %call.17), index=1, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False, False, False)\n                    name=jvp(fn1) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %get-tuple-element.18 = f32[5]{0} get-tuple-element((f32[5]{0}, pred[], f32[5]{0}) %call.17), index=0, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False, False, False)\n                    name=jvp(fn1) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %parameter.4 = f32[5]{0} parameter(3)
  %subtract.21 = f32[5]{0} subtract(f32[5]{0} %get-tuple-element.18, f32[5]{0} %parameter.4), metadata={op_type="sub" op_name="jit(loss)/sub" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %multiply.22 = f32[5]{0} multiply(f32[5]{0} %subtract.21, f32[5]{0} %subtract.21), metadata={op_type="integer_pow" op_name="jit(loss)/integer_pow[ y=2 ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %constant.26 = f32[] constant(0), metadata={op_type="reduce_sum" op_name="jit(loss)/reduce_sum[ axes=(0,) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %reduce.31 = f32[] reduce(f32[5]{0} %multiply.22, f32[] %constant.26), dimensions={0}, to_apply=%primitive_computation_add.27, metadata={op_type="reduce_sum" op_name="jit(loss)/reduce_sum[ axes=(0,) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %get-tuple-element.20 = f32[5]{0} get-tuple-element((f32[5]{0}, pred[], f32[5]{0}) %call.17), index=2, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False, False, False)\n                    name=jvp(fn1) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %constant.32 = f32[] constant(1), metadata={op_type="broadcast_in_dim" op_name="jit(loss)/broadcast_in_dim[ broadcast_dimensions=(  )\n                            shape=(5,) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %broadcast.33 = f32[5]{0} broadcast(f32[] %constant.32), dimensions={}, metadata={op_type="broadcast_in_dim" op_name="jit(loss)/broadcast_in_dim[ broadcast_dimensions=(  )\n                            shape=(5,) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %constant.23 = f32[] constant(2), metadata={op_type="mul" op_name="jit(loss)/mul" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %broadcast.24 = f32[5]{0} broadcast(f32[] %constant.23), dimensions={}, metadata={op_type="mul" op_name="jit(loss)/mul" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %multiply.25 = f32[5]{0} multiply(f32[5]{0} %broadcast.24, f32[5]{0} %subtract.21), metadata={op_type="mul" op_name="jit(loss)/mul" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %multiply.34 = f32[5]{0} multiply(f32[5]{0} %broadcast.33, f32[5]{0} %multiply.25), metadata={op_type="mul" op_name="jit(loss)/mul" source_file="<ipython-input-2-8e7c515249a1>" source_line=7}
  %call.41 = (f32[5,5]{1,0}) call(f32[5]{0} %get-tuple-element.20, f32[5]{0} %multiply.34), to_apply=%jit_transpose_jvp_fn1__.35, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False)\n                    name=transpose(jvp(fn1)) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  %get-tuple-element.42 = f32[5,5]{1,0} get-tuple-element((f32[5,5]{1,0}) %call.41), index=0, metadata={op_type="xla_call" op_name="jit(loss)/xla_call[ backend=None\n                    device=None\n                    donated_invars=(False, False)\n                    name=transpose(jvp(fn1)) ]" source_file="<ipython-input-2-8e7c515249a1>" source_line=6}
  ROOT %tuple.43 = (f32[5,5]{1,0}) tuple(f32[5,5]{1,0} %get-tuple-element.42)
}

Here is the before_optimizations for TensorFlow

HloModule a_inference_grad_loss_85__XlaMustCompile_true_config_proto___n_007_n_003CPU_020_001_n_007_n_003GPU_020_0012_005__0010J_0008_001_202_001_000__executor_type____.27

ENTRY %a_inference_grad_loss_85__XlaMustCompile_true_config_proto___n_007_n_003CPU_020_001_n_007_n_003GPU_020_0012_005__0010J_0008_001_202_001_000__executor_type____.27 (arg0.1: f32[5,5], arg1.2: f32[5], arg2.3: f32[5], arg3.4: f32[5]) -> f32[5,5] {
  %constant.18 = f32[] constant(2), metadata={op_type="Mul" op_name="PartitionedCall_1/gradients/pow_grad/mul_1"}
  %broadcast.19 = f32[5]{0} broadcast(f32[] %constant.18), dimensions={}, metadata={op_type="Mul" op_name="PartitionedCall_1/gradients/pow_grad/mul_1"}
  %arg0.1 = f32[5,5]{1,0} parameter(0), parameter_replication={false}, metadata={op_name="XLA_Args"}
  %reshape.5 = f32[5,5]{1,0} reshape(f32[5,5]{1,0} %arg0.1)
  %arg1.2 = f32[5]{0} parameter(1), parameter_replication={false}, metadata={op_name="XLA_Args"}
  %reshape.6 = f32[5]{0} reshape(f32[5]{0} %arg1.2)
  %reshape.9 = f32[5,1]{1,0} reshape(f32[5]{0} %reshape.6), metadata={op_type="ExpandDims" op_name="PartitionedCall/MatVec/ExpandDims"}
  %dot.10 = f32[5,1]{1,0} dot(f32[5,5]{1,0} %reshape.5, f32[5,1]{1,0} %reshape.9), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_type="MatMul" op_name="PartitionedCall/MatVec/MatMul"}
  %transpose.11 = f32[5,1]{1,0} transpose(f32[5,1]{1,0} %dot.10), dimensions={0,1}, metadata={op_type="MatMul" op_name="PartitionedCall/MatVec/MatMul"}
  %reshape.12 = f32[5]{0} reshape(f32[5,1]{1,0} %transpose.11), metadata={op_type="Squeeze" op_name="PartitionedCall/MatVec/Squeeze"}
  %arg2.3 = f32[5]{0} parameter(2), parameter_replication={false}, metadata={op_name="XLA_Args"}
  %reshape.7 = f32[5]{0} reshape(f32[5]{0} %arg2.3)
  %add.13 = f32[5]{0} add(f32[5]{0} %reshape.12, f32[5]{0} %reshape.7), metadata={op_type="AddV2" op_name="PartitionedCall/add"}
  %arg3.4 = f32[5]{0} parameter(3), parameter_replication={false}, metadata={op_name="XLA_Args"}
  %reshape.8 = f32[5]{0} reshape(f32[5]{0} %arg3.4)
  %subtract.14 = f32[5]{0} subtract(f32[5]{0} %add.13, f32[5]{0} %reshape.8), metadata={op_type="Sub" op_name="PartitionedCall/sub_0"}
  %constant.15 = f32[] constant(1), metadata={op_type="Pow" op_name="PartitionedCall_1/gradients/pow_grad/Pow"}
  %broadcast.16 = f32[5]{0} broadcast(f32[] %constant.15), dimensions={}, metadata={op_type="Pow" op_name="PartitionedCall_1/gradients/pow_grad/Pow"}
  %power.17 = f32[5]{0} power(f32[5]{0} %subtract.14, f32[5]{0} %broadcast.16), metadata={op_type="Pow" op_name="PartitionedCall_1/gradients/pow_grad/Pow"}
  %multiply.20 = f32[5]{0} multiply(f32[5]{0} %broadcast.19, f32[5]{0} %power.17), metadata={op_type="Mul" op_name="PartitionedCall_1/gradients/pow_grad/mul_1"}
  %reshape.21 = f32[5,1]{1,0} reshape(f32[5]{0} %multiply.20), metadata={op_type="Reshape" op_name="PartitionedCall_1/gradients/MatVec/Squeeze_grad/Reshape"}
  %dot.22 = f32[5,5]{1,0} dot(f32[5,1]{1,0} %reshape.21, f32[5,1]{1,0} %reshape.9), lhs_contracting_dims={1}, rhs_contracting_dims={1}, metadata={op_type="MatMul" op_name="PartitionedCall_1/gradients/MatVec/MatMul_grad/MatMul"}
  %transpose.23 = f32[5,5]{1,0} transpose(f32[5,5]{1,0} %dot.22), dimensions={0,1}, metadata={op_type="MatMul" op_name="PartitionedCall_1/gradients/MatVec/MatMul_grad/MatMul"}
  %reshape.24 = f32[5,5]{1,0} reshape(f32[5,5]{1,0} %transpose.23), metadata={op_name="XLA_Retvals"}
  %tuple.25 = (f32[5,5]{1,0}) tuple(f32[5,5]{1,0} %reshape.24), metadata={op_name="XLA_Retvals"}
  ROOT %get-tuple-element.26 = f32[5,5]{1,0} get-tuple-element((f32[5,5]{1,0}) %tuple.25), index=0, metadata={op_name="XLA_Retvals"}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAX vs. TF MLPerf Benchmark #4488

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

JAX vs. TF MLPerf Benchmark #4488

pranavsubramani Oct 7, 2020

Replies: 2 comments · 4 replies

hawkinsp Oct 8, 2020 Maintainer

jekbradbury Oct 8, 2020

n2cholas Oct 8, 2020

pranavsubramani Oct 8, 2020 Author

jekbradbury Oct 11, 2020

pranavsubramani Oct 14, 2020 Author

pranavsubramani
Oct 7, 2020

Replies: 2 comments 4 replies

hawkinsp
Oct 8, 2020
Maintainer

jekbradbury
Oct 8, 2020

pranavsubramani Oct 8, 2020
Author

pranavsubramani Oct 14, 2020
Author