Skip to content

Commit

Permalink
docs: simplify getting started docs
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Sep 18, 2024
1 parent 8722ba1 commit bdda313
Show file tree
Hide file tree
Showing 4 changed files with 60 additions and 62 deletions.
14 changes: 2 additions & 12 deletions .buildkite/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,6 @@ steps:
plugins:
- JuliaCI/julia#v1:
version: "1"
- JuliaCI/julia-coverage#v1:
codecov: true
dirs:
- src
- ext
command: julia --code-coverage=user --color=yes --project=docs --threads=auto docs/tutorials.jl
env:
TUTORIAL_BACKEND_GROUP: "CUDA"
Expand All @@ -25,7 +20,7 @@ steps:
- "docs/src/tutorials/advanced/**/*"
- "tutorial_deps/*"
- "**/*.cov"
timeout_in_minutes: 60
timeout_in_minutes: 120

- label: "Tutorial Build [%N/%t] CPU Runners"
if: build.message !~ /\[skip docs\]/ && !build.pull_request.draft
Expand All @@ -34,11 +29,6 @@ steps:
plugins:
- JuliaCI/julia#v1:
version: "1"
# - JuliaCI/julia-coverage#v1:
# codecov: true
# dirs:
# - src
# - ext
command: julia --code-coverage=user --color=yes --project=docs --threads=auto docs/tutorials.jl
env:
TUTORIAL_BACKEND_GROUP: "CPU"
Expand All @@ -52,7 +42,7 @@ steps:
- "docs/src/tutorials/advanced/**/*"
- "tutorial_deps/*"
- "**/*.cov"
timeout_in_minutes: 60
timeout_in_minutes: 120

- label: "Final Documentation Build"
depends_on:
Expand Down
18 changes: 0 additions & 18 deletions docs/src/components/AsideTrustees.vue
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,6 @@
<p class="extra-info">Machine Learning</p>
</span>
</a>

<a class="enjoyer" href="https://juliagni.github.io/GeometricMachineLearning.jl/latest/" target="_blank">
<img width="32" height="32" src="https://juliagni.github.io/GeometricMachineLearning.jl/latest/assets/logo-dark.png" />
<span>
<p class="extra-info">Structure Preserving</p>
<p class="heading">GeometricML.jl</p>
<p class="extra-info">Machine Learning</p>
</span>
</a>

<a class="enjoyer" href="https://una-auxme.github.io/MeshGraphNets.jl/dev/" target="_blank">
<img width="32" height="32" src="https://raw.githubusercontent.com/una-auxme/MeshGraphNets.jl/main/logo/meshgraphnetsjl_logo.png" />
<span>
<p class="extra-info">Physical Systems</p>
<p class="heading">MeshGraphNets.jl</p>
<p class="extra-info">Graph Neural Nets</p>
</span>
</a>
</template>

<style>
Expand Down
88 changes: 57 additions & 31 deletions docs/src/introduction/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ Pkg.add("Lux")

```@example quickstart
using Lux, Random, Optimisers, Zygote
# using LuxCUDA, AMDGPU, Metal, oneAPI # Optional packages for GPU support
using LuxCUDA # For CUDA support
# using AMDGPU, Metal, oneAPI # Other pptional packages for GPU support
```

We take randomness very seriously
Expand All @@ -43,15 +44,16 @@ Build the model
model = Chain(Dense(128, 256, tanh), Chain(Dense(256, 1, tanh), Dense(1, 10)))
```

Models don't hold parameters and states so initialize them. From there on, we just use our
standard AD and Optimisers API.
Models don't hold parameters and states so initialize them. From there on, we can just use
our standard AD and Optimisers API. However, here we will show how to use Lux's Training
API that provides an uniform API over all supported AD systems.

```@example quickstart
# Get the device determined by Lux
dev = gpu_device()
# Parameter and State Variables
ps, st = Lux.setup(rng, model) .|> dev
ps, st = Lux.setup(rng, model) |> dev
# Dummy Input
x = rand(rng, Float32, 128, 2) |> dev
Expand All @@ -60,21 +62,30 @@ x = rand(rng, Float32, 128, 2) |> dev
y, st = Lux.apply(model, x, ps, st)
# Gradients
## Pullback API to capture change in state
(l, st_), pb = pullback(Lux.apply, model, x, ps, st)
gs = pb((one.(l), nothing))[3]
## First construct a TrainState
train_state = Lux.Training.TrainState(model, ps, st, Adam(0.0001f0))
# Optimization
st_opt = Optimisers.setup(Adam(0.0001f0), ps)
st_opt, ps = Optimisers.update(st_opt, ps, gs) # or Optimisers.update!(st_opt, ps, gs)
## We can compute the gradients using Training.compute_gradients
gs, loss, stats, train_state = Lux.Training.compute_gradients(AutoZygote(), MSELoss(),
(x, dev(rand(rng, Float32, 10, 2))), train_state)
## Optimization
train_state = Training.apply_gradients!(train_state, gs) # or Training.apply_gradients (no `!` at the end)
# Both these steps can be combined into a single call
gs, loss, stats, train_state = Training.single_train_step!(AutoZygote(), MSELoss(),
(x, dev(rand(rng, Float32, 10, 2))), train_state)
```

## Defining Custom Layers

```@example custom_compact
using Lux, Random, Optimisers, Zygote
# using LuxCUDA, AMDGPU, Metal, oneAPI # Optional packages for GPU support
using Printf # For pretty printing
using LuxCUDA # For CUDA support
# using AMDGPU, Metal, oneAPI # Other pptional packages for GPU support
using Printf # For pretty printing
dev = gpu_device()
```

We will define a custom MLP using the `@compact` macro. The macro takes in a list of
Expand All @@ -86,9 +97,9 @@ n_in = 1
n_out = 1
nlayers = 3
model = @compact(w1=Dense(n_in, 128),
w2=[Dense(128, 128) for i in 1:nlayers],
w3=Dense(128, n_out),
model = @compact(w1=Dense(n_in => 32),
w2=[Dense(32 => 32) for i in 1:nlayers],
w3=Dense(32 => n_out),
act=relu) do x
embed = act(w1(x))
for w in w2
Expand All @@ -102,26 +113,41 @@ end
We can initialize the model and train it with the same code as before!

```@example custom_compact
ps, st = Lux.setup(Xoshiro(0), model)
rng = Random.default_rng()
Random.seed!(rng, 0)
ps, st = Lux.setup(Xoshiro(0), model) |> dev
model(randn(n_in, 32), ps, st) # 1×32 Matrix as output.
x = rand(rng, Float32, n_in, 32) |> dev
x_data = collect(-2.0f0:0.1f0:2.0f0)'
model(x, ps, st) # 1×32 Matrix and updated state as output.
x_data = reshape(collect(-2.0f0:0.1f0:2.0f0), 1, :) |> dev
y_data = 2 .* x_data .- x_data .^ 3
st_opt = Optimisers.setup(Adam(), ps)
for epoch in 1:1000
global st # Put this in a function in real use-cases
(loss, st), pb = Zygote.pullback(ps) do p
y, st_ = model(x_data, p, st)
return sum(abs2, y .- y_data), st_
function train_model!(model, ps, st, x_data, y_data)
train_state = Lux.Training.TrainState(model, ps, st, Adam(0.001f0))
for iter in 1:1000
_, loss, _, train_state = Lux.Training.single_train_step!(AutoZygote(), MSELoss(),
(x_data, y_data), train_state)
if iter % 100 == 1 || iter == 1000
@printf "Iteration: %04d \t Loss: %10.9g\n" iter loss
end
end
gs = only(pb((one(loss), nothing)))
epoch % 100 == 1 && @printf "Epoch: %04d \t Loss: %10.9g\n" epoch loss
Optimisers.update!(st_opt, ps, gs)
return model, ps, st
end
train_model!(model, ps, st, x_data, y_data)
nothing #hide
```

!!! tip "Training with Optimization.jl"

If you are coming from the SciML ecosystem and want to use Optimization.jl, please
refer to the [Optimization.jl Tutorial](@ref Optimization-Lux-Tutorial).

## Additional Packages

`LuxDL` hosts various packages that provide additional functionality for Lux.jl. All
Expand All @@ -133,7 +159,7 @@ You can install all those packages via `import Pkg; Pkg.add(<package name>)`.

GPU Support for Lux.jl requires loading additional packages:

* [`LuxCUDA.jl`](https://github.com/LuxDL/LuxCUDA.jl) for CUDA support.
* [`AMDGPU.jl`](https://github.com/JuliaGPU/AMDGPU.jl) for AMDGPU support.
* [`Metal.jl`](https://github.com/JuliaGPU/Metal.jl) for Apple Metal support.
* [`oneAPI.jl`](https://github.com/JuliaGPU/oneAPI.jl) for oneAPI support.
- [`LuxCUDA.jl`](https://github.com/LuxDL/LuxCUDA.jl) for CUDA support.
- [`AMDGPU.jl`](https://github.com/JuliaGPU/AMDGPU.jl) for AMDGPU support.
- [`Metal.jl`](https://github.com/JuliaGPU/Metal.jl) for Apple Metal support.
- [`oneAPI.jl`](https://github.com/JuliaGPU/oneAPI.jl) for oneAPI support.
2 changes: 1 addition & 1 deletion examples/OptimizationIntegration/main.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# # Training Lux Models using Optimization.jl
# # [Training Lux Models using Optimization.jl](@id Optimization-Lux-Tutorial)

# Lux's native [Training.TrainState](@ref) is a great API for gradient-based learning of
# neural networks, however, it is geared towards using `Optimisers.jl` as the backend.
Expand Down

3 comments on commit bdda313

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/115386

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.0.4 -m "<description of version>" bdda313c8a0325c6ccda45d8999ff16cf46bce17
git push origin v1.0.4

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: bdda313 Previous: 1d064ec Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 415000 ns 412937.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 243792 ns 322667 ns 0.76
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 244000 ns 323104.5 ns 0.76
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739583 ns 739750 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43539.5 ns 43577 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1274291 ns 1320395.5 ns 0.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 1231896 ns 2436708 ns 0.51
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 16356854.5 ns 13630167 ns 1.20
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2269041 ns 2195250 ns 1.03
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 203819.5 ns 203168.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1367999.5 ns 1394666 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 1283084 ns 2614271 ns 0.49
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 16339834 ns 13809542 ns 1.18
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2229354.5 ns 2256125 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1774104 ns 1655084 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1088209 ns 1103916 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1549542 ns 1549791 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3006521.5 ns 2999729.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 206533 ns 207221 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12172854.5 ns 12143833.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8832479 ns 8785708 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9239625.5 ns 9239167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18613312.5 ns 18591208 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1489717 ns 1485675 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17302083.5 ns 17317333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14002833 ns 13967208 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14486750 ns 14514354 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21734667 ns 21818416 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250085250 ns 250042270.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148902834 ns 148555750 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116494728.5 ns 115889000 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447939292 ns 447187584 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5485971 ns 5452362 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1224661000 ns 1224923291 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 934365500 ns 928030208 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 833255520.5 ns 825911895.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1629736875 ns 1633435667 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31255272 ns 31214910.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1147235708 ns 1134846000 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1001812750 ns 982157791.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1307261375 ns 1328335541.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1734762104 ns 1734630541.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1129937.5 ns 1097854 ns 1.03
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1633541.5 ns 1625083.5 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3618417 ns 3841334 ns 0.94
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 781771 ns 778042 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 261687.5 ns 263538 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2989729.5 ns 2979417 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4144709 ns 4119104.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 9495937 ns 11207896 ns 0.85
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3189125 ns 3132750 ns 1.02
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1088544 ns 1091322.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2337041.5 ns 2334729 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1336875 ns 1437000 ns 0.93
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1558583 ns 1665458.5 ns 0.94
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4218770.5 ns 4198334 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 207631 ns 207913 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19455000 ns 19383125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16101833.5 ns 16092916.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17388416 ns 17269063 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25913875 ns 25856687.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1590005 ns 1585334 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34246750 ns 34322667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 31092604 ns 30864666.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31249500 ns 31132250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36490792 ns 36963875 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4538334 ns 4524500 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2564249.5 ns 2779000 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2673688 ns 2902854 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8398125 ns 8387541.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 422799 ns 420101 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 39119187 ns 38904229 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32288959 ns 32105979 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32519209 ns 32346959 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51939791 ns 51945541 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2619605 ns 2624775 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89409416 ns 88746333.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 115467459 ns 114006959 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 221528542 ns 224259542 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74507479 ns 74608375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 268539958 ns 267333208 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 156524479 ns 159214292 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 123385041.5 ns 126745542 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 486000125 ns 487494166 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 6937063 ns 7012704 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1476618521 ns 1472344083.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1175367083 ns 1138687375 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1067186396 ns 1071038854 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2004426208.5 ns 2002947479.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34818111 ns 34854968.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1724168958 ns 1712616292 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1536515396 ns 1536070562.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1869553708 ns 1863636167 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2206120083 ns 2213962958 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2081499.5 ns 2080042 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3049542 ns 2936917 ns 1.04
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 8519417 ns 8042334 ns 1.06
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2503499.5 ns 2431104 ns 1.03
lenet(28, 28, 1, 128)/forward/GPU/CUDA 264802 ns 278095 ns 0.95
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9682063 ns 9677209 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12056041 ns 12036500 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 24527250 ns 24751583.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11752062.5 ns 11606292 ns 1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1163895 ns 1201527 ns 0.97
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 384081437.5 ns 379827708 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 312823416.5 ns 286677271 ns 1.09
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 256685166 ns 240261834 ns 1.07
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 452363520.5 ns 451256520.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4828920.5 ns 4858918 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1156564042 ns 1157780125 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 943789584 ns 905233917 ns 1.04
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 959368416 ns 987524666 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1405462958 ns 1579543625 ns 0.89
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 19179831 ns 17849892 ns 1.07
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1041166.5 ns 1058583.5 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 1660937.5 ns 1671187.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 6773750 ns 5011708 ns 1.35
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1302396 ns 1300437.5 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 265345 ns 274747.5 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6515979.5 ns 6254041 ns 1.04
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 13193417 ns 13149791.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 19482917 ns 18860833 ns 1.03
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6040709 ns 5852333 ns 1.03
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1201565 ns 1238555 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70558729.5 ns 70498000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43775917 ns 43638250 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39575708 ns 39557666 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132776896 ns 132574187 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1873567 ns 1944256 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 356452895.5 ns 356301646 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 271260020.5 ns 269549208 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 254204208 ns 253732875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534878271 ns 534920187.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12315083 ns 12320196.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 399335750 ns 395172666 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 382828229 ns 377158375 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 696801291.5 ns 657754625 ns 1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 711815250 ns 709829333 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1196786667 ns 1189792833 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 840316417 ns 691561166.5 ns 1.22
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 646428312 ns 626986833 ns 1.03
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1770755354.5 ns 1860884792 ns 0.95
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12304606 ns 12309151 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3647916604 ns 3633655916 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2830461833 ns 2828990458 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2740115583 ns 2702591209 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5008028167 ns 5056811500 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49712678.5 ns 49201169.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3430083 ns 3425625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2069729.5 ns 2072958.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2510750 ns 2525541 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6046916.5 ns 6028666.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 339536 ns 322034 ns 1.05
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 26138042 ns 25910625 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18987625 ns 18853584 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19921353.5 ns 19458875 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39376375 ns 39298645.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2472544 ns 2474706.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54782125 ns 54292500 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82825584 ns 81331292 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 176027917 ns 170565562 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45641958 ns 45567333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1792417 ns 1782916 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1107270.5 ns 1103709 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1575042 ns 1548917 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3035584 ns 3027375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 210834 ns 210691.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12551417 ns 12525854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9225583 ns 9206541.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9692084 ns 9628792 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19021208.5 ns 19005604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1545625 ns 1537547 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17641833.5 ns 17655854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14341584 ns 14331645.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14582062.5 ns 14600583 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22215521 ns 22163250 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70599125 ns 70499459 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43507917 ns 43573833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39557208 ns 39479542 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132632458.5 ns 132481104.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1903889.5 ns 1867593 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 360661853.5 ns 360531229 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 349138333 ns 345233354 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 306947875 ns 303345083 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 723089875 ns 722647875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13383173 ns 13388759.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 424456604.5 ns 418893124.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 427863792 ns 418550083 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 785018416.5 ns 733622021 ns 1.07
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 715835208 ns 714074250 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1596583.5 ns 1662791 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1159458 ns 1326395.5 ns 0.87
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1148708 ns 1266458.5 ns 0.91
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2311250 ns 2293875 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 580091 ns 584223 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8847333 ns 8911021 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13545750 ns 12871250 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 33507812.5 ns 31057917 ns 1.08
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9857583 ns 9825729.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1429244 ns 1434469 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 16580916.5 ns 16503167 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 23509292 ns 20919875 ns 1.12
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 49483041 ns 44942437.5 ns 1.10
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 13148812.5 ns 13103167 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 838271 ns 789458 ns 1.06
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 628167 ns 538437.5 ns 1.17
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1073104 ns 1024041.5 ns 1.05
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 724833.5 ns 725041 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47723 ns 47144.5 ns 1.01
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1518875 ns 1463416 ns 1.04
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1003959 ns 1040312 ns 0.97
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1878958 ns 1411187.5 ns 1.33
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2254542 ns 2257916 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 235472.5 ns 234270.5 ns 1.01
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1574750 ns 1530583 ns 1.03
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1064562 ns 1024209 ns 1.04
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 2010625 ns 1524333 ns 1.32
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2260979.5 ns 2201771 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3405459 ns 3406354 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2065167 ns 2052854.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2509625 ns 2507959 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6033354.5 ns 6001333 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 282898 ns 287105.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24076541 ns 24055375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17229771 ns 17211646 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17134667 ns 17114333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37637875 ns 37572333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2400944 ns 2401548.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52973250 ns 52614646 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 83207042 ns 82221646 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 170182021 ns 169582250 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44655333 ns 44570125 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250702791 ns 250290791.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148728229.5 ns 148276667 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116178542 ns 115710770.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447727458.5 ns 447663770.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5450463 ns 5443484 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1105083958 ns 1105632500 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 856949208.5 ns 854893979 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 828942146 ns 827018271 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1750357541 ns 1767047166 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 28967174 ns 28898282.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1027883729.5 ns 1021345729.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 982128875 ns 974787791 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1326112792 ns 1329964041.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1722634604 ns 1731428604.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1231729.5 ns 1243062.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 782291 ns 955375 ns 0.82
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 778459 ns 906875 ns 0.86
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2020000 ns 2048500 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 562638 ns 563206.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5893271 ns 5919958 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 9134459 ns 6419604 ns 1.42
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 26527458 ns 23873812 ns 1.11
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7115333 ns 7097854.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1395487 ns 1364575 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 9708334 ns 9591542 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 16006479.5 ns 13052166.5 ns 1.23
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 34574625 ns 31360875 ns 1.10
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7611479.5 ns 7260167 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 529958.5 ns 481625 ns 1.10
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 478041 ns 443500 ns 1.08
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 2960583.5 ns 1999500 ns 1.48
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 88458 ns 87833 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27957 ns 27760 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 381541 ns 377333 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 438792 ns 439500 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4910417 ns 4505250 ns 1.09
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 258791 ns 258291 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 223833.5 ns 219213 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 413542 ns 408166.5 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 469459 ns 470125 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 4892312.5 ns 4495000 ns 1.09
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 271125 ns 271000 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 476333 ns 427792 ns 1.11
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 419750 ns 376896 ns 1.11
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 773166 ns 733562.5 ns 1.05
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54583 ns 52417 ns 1.04
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28075 ns 28102 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 341146 ns 336500 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 333854 ns 333854 ns 1
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 490125 ns 419229.5 ns 1.17
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151750 ns 151562.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 210240.5 ns 204281.5 ns 1.03
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 355771 ns 351375 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 347438 ns 348375 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 497291 ns 899625 ns 0.55
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 150959 ns 150667 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 605471333 ns 603094958 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 434427292 ns 428615062.5 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 396215292 ns 384740459 ns 1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 871673167 ns 873854208.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7029497 ns 7027277 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2010340292 ns 2002711146 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1626784792 ns 1606403375 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1578107124.5 ns 1551092146 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2633522708 ns 2631708917 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26046548 ns 26123824 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 524125 ns 521396 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 393083 ns 431875 ns 0.91
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 2705354.5 ns 1926416 ns 1.40
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 880750 ns 866417 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47587 ns 47024 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1872979 ns 1855270.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1766354.5 ns 2793583 ns 0.63
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 16551583.5 ns 14609250 ns 1.13
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2708895.5 ns 2648521 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 251708.5 ns 246347 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 1951396 ns 1974875 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 1841000.5 ns 5038917 ns 0.37
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 16533458 ns 15177854.5 ns 1.09
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2782374.5 ns 2744270.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1493542 ns 1512729 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 932833 ns 1178292 ns 0.79
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1063125 ns 1180084 ns 0.90
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2333792 ns 2300375 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 588210.5 ns 589242.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5938333.5 ns 5245791 ns 1.13
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 8498791.5 ns 4733604 ns 1.80
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 26209458 ns 24184833 ns 1.08
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7349083.5 ns 7316583 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1387138.5 ns 1392514 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 11708833 ns 11607209 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 17834979 ns 16305271 ns 1.09
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 38988875 ns 35977250 ns 1.08
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9549499.5 ns 9550875 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2209 ns 2333 ns 0.95
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2250 ns 2542 ns 0.89
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3584 ns 3083 ns 1.16
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2458.5 ns 2458 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 25040 ns 25059 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7250 ns 7417 ns 0.98
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7020.5 ns 7042 ns 1.00
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7375 ns 7209 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7000 ns 7333 ns 0.95
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 215926 ns 214253 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8542 ns 8333 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8209 ns 8083 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8417 ns 8437.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5958 ns 6125 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10958 ns 10667 ns 1.03
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 12959 ns 13791 ns 0.94
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10395.5 ns 11208 ns 0.93
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7333 ns 7375 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25049.5 ns 25157 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 20083 ns 20062.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 19833 ns 19833 ns 1
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 20250 ns 20041 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 19979.5 ns 20000 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 235054 ns 235669 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 23541 ns 23562.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 23500 ns 23417 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 23792 ns 23625 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 21375 ns 21458 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28833.5 ns 28584 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 29084 ns 28916 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28375 ns 29417 ns 0.96
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46208 ns 46000 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26563 ns 26406 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 229666.5 ns 221667 ns 1.04
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 272334 ns 278604.5 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4450292 ns 4081750 ns 1.09
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 146042 ns 145833 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 207954 ns 208494.5 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 247416 ns 237333 ns 1.04
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 290292 ns 295625 ns 0.98
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 4062333.5 ns 4027625 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145604 ns 145875 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1875 ns 2083.5 ns 0.90
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 2042 ns 1917 ns 1.07
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2833 ns 2458 ns 1.15
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1917 ns 1791 ns 1.07
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23298 ns 23206 ns 1.00
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5291 ns 5333 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5209 ns 5167 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5375 ns 5375 ns 1
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5250 ns 5250 ns 1
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 240994.5 ns 238219 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 7625 ns 7292 ns 1.05
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7500 ns 7291 ns 1.03
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 7583 ns 7542 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5104.5 ns 5333 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 80254791.5 ns 79904000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 48077979 ns 49166750 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43208687.5 ns 44974542 ns 0.96
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151606750 ns 151504667 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2668752 ns 2718498 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 608153250 ns 496218625 ns 1.23
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 416713916 ns 410097125 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 401894583 ns 397607667 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 688243625 ns 684031750 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14712224 ns 14583158 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 717290334 ns 709703166.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 671671750 ns 675407250 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1014211208 ns 1001028958 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1000127792 ns 995697250 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.