diff --git a/README.md b/README.md index 27189fe..e21a8d5 100644 --- a/README.md +++ b/README.md @@ -16,3 +16,4 @@ Tue Sep 17 10:57:15 UTC 2024 Tue Sep 17 14:31:10 UTC 2024 Wed Sep 18 11:13:08 UTC 2024 Fri Nov 15 13:46:13 UTC 2024 +Mon Dec 2 21:30:54 UTC 2024 diff --git a/howtos/deploy_on_server/index.html b/howtos/deploy_on_server/index.html index 70b9bc7..fbce18f 100644 --- a/howtos/deploy_on_server/index.html +++ b/howtos/deploy_on_server/index.html @@ -154,7 +154,7 @@
}); -To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, for every supported target architecture and accelerator.
See Getting Started with ZML if you need more information on how to compile a model.
Here's a quick recap:
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:
--@zml//runtimes:cuda=true
--@zml//runtimes:rocm=true
--@zml//runtimes:tpu=true
--@zml//runtimes:cpu=false
So, to run the OpenLLama model from above on your development machine housing an NVIDIA GPU, run the following:
cd examples
+ Deploying Models on a Server
To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, for every supported target architecture and accelerator.
See Getting Started with ZML if you need more information on how to compile a model.
Here's a quick recap:
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AWS Trainium/Inferentia 2:
--@zml//runtimes:neuron=true
- AVOID CPU:
--@zml//runtimes:cpu=false
So, to run the OpenLLama model from above on your development machine housing an NVIDIA GPU, run the following:
cd examples
bazel run -c opt //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
Cross-Compiling and creating a TAR for your server
Currently, ZML lets you cross-compile to one of the following target architectures:
- Linux X86_64:
--platforms=@zml//platforms:linux_amd64
- Linux ARM64:
--platforms=@zml//platforms:linux_arm64
- MacOS ARM64:
--platforms=@zml//platforms:macos_arm64
As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:
cd examples
bazel build -c opt //llama:OpenLLaMA-3B \
diff --git a/howtos/dockerize_models/index.html b/howtos/dockerize_models/index.html
index 864e032..c944ff2 100644
--- a/howtos/dockerize_models/index.html
+++ b/howtos/dockerize_models/index.html
@@ -233,7 +233,7 @@ 1. The Manifest
To get started, let's make bazel
This will push the simple_layer
image with the tag latest
(you can add more) to the docker registry:
bazel run -c opt //simple_layer:push
When dealing with maybe a public and a private container registry - or if you just want to try it out right now, you can always override the repository on the command line:
bazel run -c opt //simple_layer:push -- --repository my.server.com/org/image
-
Adding weights and data
Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?
We'll use the MNIST example to illustrate how to build Docker images that also contain data files.
You can bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist
in the ./examples
folder if you want to try it out.
Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AVOID CPU:
--@zml//runtimes:cpu=false
Example:
bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
+
Adding weights and data
Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?
We'll use the MNIST example to illustrate how to build Docker images that also contain data files.
You can bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist
in the ./examples
folder if you want to try it out.
Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AWS Trainium/Inferentia 2:
--@zml//runtimes:neuron=true
- AVOID CPU:
--@zml//runtimes:cpu=false
Example:
bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
Manifest and Archive
We only add one more target to the BUILD.bazel
to construct the commandline for the entrypoint
of the container. All other steps basically remain the same.
Let's start with creating the manifest and archive:
load("@aspect_bazel_lib//lib:expand_template.bzl", "expand_template")
load("@aspect_bazel_lib//lib:tar.bzl", "mtree_spec", "tar")
load("@aspect_bazel_lib//lib:transitions.bzl", "platform_transition_filegroup")
@@ -301,7 +301,7 @@ Entrypoint
Our container entrypoint commandline is no
name = "image_",
base = "@distroless_cc_debian12",
# the entrypoint comes from the expand_template rule `entrypoint` above
- entrypoint = ":entrypoint",
+ entrypoint = ":entrypoint",
tars = [":archive"],
)
diff --git a/sources.tar b/sources.tar
index f84e8f7..da6c07c 100755
Binary files a/sources.tar and b/sources.tar differ
diff --git a/tutorials/getting_started/index.html b/tutorials/getting_started/index.html
index 8327859..8530a97 100644
--- a/tutorials/getting_started/index.html
+++ b/tutorials/getting_started/index.html
@@ -173,7 +173,7 @@
bazel run -c opt //llama:Meta-Llama-3-8b
bazel run -c opt //llama:Meta-Llama-3-8b -- --promt="Once upon a time,"
Run Tests
bazel test //zml:test
-
Running Models on GPU / TPU
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AVOID CPU:
--@zml//runtimes:cpu=false
The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:
cd examples
+
Running Models on GPU / TPU
You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:
- NVIDIA CUDA:
--@zml//runtimes:cuda=true
- AMD RoCM:
--@zml//runtimes:rocm=true
- Google TPU:
--@zml//runtimes:tpu=true
- AWS Trainium/Inferentia 2:
--@zml//runtimes:neuron=true
- AVOID CPU:
--@zml//runtimes:cpu=false
The latter, avoiding compilation for CPU, cuts down compilation time.
So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:
cd examples
bazel run -c opt //llama:OpenLLaMA-3B \
--@zml//runtimes:cuda=true \
-- --prompt="Once upon a time,"
diff --git a/tutorials/write_first_model/index.html b/tutorials/write_first_model/index.html
index 7760aa1..d0c8b18 100644
--- a/tutorials/write_first_model/index.html
+++ b/tutorials/write_first_model/index.html
@@ -163,7 +163,7 @@
const zml = @import("zml");
const asynk = @import("async");
-// shortcut to the async_ function in the asynk module
+// shortcut to the asyncc function in the asynk module
const asyncc = asynk.asyncc;
You will use above lines probably in all ZML projects. Also, note that ZML is async and comes with its own async runtime, thanks to zigcoro.
Defining our Model
We will start with a very simple "Model". One that resembles a "multiply and add" operation.
/// Model definition
@@ -244,9 +244,9 @@
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
-const compiled = try compilation.await_();
+const compiled = try compilation.awaitt();
-Compiling is happening in the background via the async_
function. We call async_
with the zml.compileModel
function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the .forward
function name in order to compile Layer.forward
, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).
Creating the Executable Model
Now that we have compiled the module utilizing the shapes, we turn it into an executable.
// pass the model weights to the compiled module to create an executable module
+Compiling is happening in the background via the asyncc
function. We call asyncc
with the zml.compileModel
function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the .forward
function name in order to compile Layer.forward
, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).
Creating the Executable Model
Now that we have compiled the module utilizing the shapes, we turn it into an executable.
// pass the model weights to the compiled module to create an executable module
var executable = try compiled.prepare(arena, model_weights);
defer executable.deinit();
@@ -384,7 +384,7 @@ Running it
With everything in place now, running the
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
- const compiled = try compilation.await_();
+ const compiled = try compilation.awaitt();
// pass the model weights to the compiled module to create an executable
// module