diff --git a/README.md b/README.md index 27189fe..e21a8d5 100644 --- a/README.md +++ b/README.md @@ -16,3 +16,4 @@ Tue Sep 17 10:57:15 UTC 2024 Tue Sep 17 14:31:10 UTC 2024 Wed Sep 18 11:13:08 UTC 2024 Fri Nov 15 13:46:13 UTC 2024 +Mon Dec 2 21:30:54 UTC 2024 diff --git a/howtos/deploy_on_server/index.html b/howtos/deploy_on_server/index.html index 70b9bc7..fbce18f 100644 --- a/howtos/deploy_on_server/index.html +++ b/howtos/deploy_on_server/index.html @@ -154,7 +154,7 @@

}); -

Deploying Models on a Server

To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, for every supported target architecture and accelerator.

See Getting Started with ZML if you need more information on how to compile a model.

Here's a quick recap:

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:

So, to run the OpenLLama model from above on your development machine housing an NVIDIA GPU, run the following:

cd examples
+  

Deploying Models on a Server

To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, for every supported target architecture and accelerator.

See Getting Started with ZML if you need more information on how to compile a model.

Here's a quick recap:

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:

  • NVIDIA CUDA: --@zml//runtimes:cuda=true
  • AMD RoCM: --@zml//runtimes:rocm=true
  • Google TPU: --@zml//runtimes:tpu=true
  • AWS Trainium/Inferentia 2: --@zml//runtimes:neuron=true
  • AVOID CPU: --@zml//runtimes:cpu=false

So, to run the OpenLLama model from above on your development machine housing an NVIDIA GPU, run the following:

cd examples
 bazel run -c opt //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
 

Cross-Compiling and creating a TAR for your server

Currently, ZML lets you cross-compile to one of the following target architectures:

  • Linux X86_64: --platforms=@zml//platforms:linux_amd64
  • Linux ARM64: --platforms=@zml//platforms:linux_arm64
  • MacOS ARM64: --platforms=@zml//platforms:macos_arm64

As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:

cd examples
 bazel build -c opt //llama:OpenLLaMA-3B               \
diff --git a/howtos/dockerize_models/index.html b/howtos/dockerize_models/index.html
index 864e032..c944ff2 100644
--- a/howtos/dockerize_models/index.html
+++ b/howtos/dockerize_models/index.html
@@ -233,7 +233,7 @@ 

1. The Manifest

To get started, let's make bazel

This will push the simple_layer image with the tag latest (you can add more) to the docker registry:

bazel run -c opt //simple_layer:push
 

When dealing with maybe a public and a private container registry - or if you just want to try it out right now, you can always override the repository on the command line:

bazel run -c opt //simple_layer:push -- --repository my.server.com/org/image
-

Adding weights and data

Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?

We'll use the MNIST example to illustrate how to build Docker images that also contain data files.

You can bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist in the ./examples folder if you want to try it out.

Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.

  • NVIDIA CUDA: --@zml//runtimes:cuda=true
  • AMD RoCM: --@zml//runtimes:rocm=true
  • Google TPU: --@zml//runtimes:tpu=true
  • AVOID CPU: --@zml//runtimes:cpu=false

Example:

bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
+

Adding weights and data

Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?

We'll use the MNIST example to illustrate how to build Docker images that also contain data files.

You can bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist in the ./examples folder if you want to try it out.

Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.

  • NVIDIA CUDA: --@zml//runtimes:cuda=true
  • AMD RoCM: --@zml//runtimes:rocm=true
  • Google TPU: --@zml//runtimes:tpu=true
  • AWS Trainium/Inferentia 2: --@zml//runtimes:neuron=true
  • AVOID CPU: --@zml//runtimes:cpu=false

Example:

bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
 

Manifest and Archive

We only add one more target to the BUILD.bazel to construct the commandline for the entrypoint of the container. All other steps basically remain the same.

Let's start with creating the manifest and archive:

load("@aspect_bazel_lib//lib:expand_template.bzl", "expand_template")
 load("@aspect_bazel_lib//lib:tar.bzl", "mtree_spec", "tar")
 load("@aspect_bazel_lib//lib:transitions.bzl", "platform_transition_filegroup")
@@ -301,7 +301,7 @@ 

Entrypoint

Our container entrypoint commandline is no name = "image_", base = "@distroless_cc_debian12", # the entrypoint comes from the expand_template rule `entrypoint` above - entrypoint = ":entrypoint", + entrypoint = ":entrypoint", tars = [":archive"], ) diff --git a/sources.tar b/sources.tar index f84e8f7..da6c07c 100755 Binary files a/sources.tar and b/sources.tar differ diff --git a/tutorials/getting_started/index.html b/tutorials/getting_started/index.html index 8327859..8530a97 100644 --- a/tutorials/getting_started/index.html +++ b/tutorials/getting_started/index.html @@ -173,7 +173,7 @@

bazel run -c opt //llama:Meta-Llama-3-8b bazel run -c opt //llama:Meta-Llama-3-8b -- --promt="Once upon a time,"

Run Tests

bazel test //zml:test
-

Running Models on GPU / TPU

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:

  • NVIDIA CUDA: --@zml//runtimes:cuda=true
  • AMD RoCM: --@zml//runtimes:rocm=true
  • Google TPU: --@zml//runtimes:tpu=true
  • AVOID CPU: --@zml//runtimes:cpu=false

The latter, avoiding compilation for CPU, cuts down compilation time.

So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:

cd examples
+

Running Models on GPU / TPU

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:

  • NVIDIA CUDA: --@zml//runtimes:cuda=true
  • AMD RoCM: --@zml//runtimes:rocm=true
  • Google TPU: --@zml//runtimes:tpu=true
  • AWS Trainium/Inferentia 2: --@zml//runtimes:neuron=true
  • AVOID CPU: --@zml//runtimes:cpu=false

The latter, avoiding compilation for CPU, cuts down compilation time.

So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:

cd examples
 bazel run -c opt //llama:OpenLLaMA-3B             \
           --@zml//runtimes:cuda=true              \
           -- --prompt="Once upon a time,"
diff --git a/tutorials/write_first_model/index.html b/tutorials/write_first_model/index.html
index 7760aa1..d0c8b18 100644
--- a/tutorials/write_first_model/index.html
+++ b/tutorials/write_first_model/index.html
@@ -163,7 +163,7 @@ 

const zml = @import("zml"); const asynk = @import("async"); -// shortcut to the async_ function in the asynk module +// shortcut to the asyncc function in the asynk module const asyncc = asynk.asyncc;

You will use above lines probably in all ZML projects. Also, note that ZML is async and comes with its own async runtime, thanks to zigcoro.

Defining our Model

We will start with a very simple "Model". One that resembles a "multiply and add" operation.

/// Model definition
@@ -244,9 +244,9 @@ 

defer zml.aio.unloadBuffers(&model_weights); // for good practice // Wait for compilation to finish -const compiled = try compilation.await_(); +const compiled = try compilation.awaitt();
-

Compiling is happening in the background via the async_ function. We call async_ with the zml.compileModel function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the .forward function name in order to compile Layer.forward, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).

Creating the Executable Model

Now that we have compiled the module utilizing the shapes, we turn it into an executable.

// pass the model weights to the compiled module to create an executable module
+

Compiling is happening in the background via the asyncc function. We call asyncc with the zml.compileModel function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the .forward function name in order to compile Layer.forward, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).

Creating the Executable Model

Now that we have compiled the module utilizing the shapes, we turn it into an executable.

// pass the model weights to the compiled module to create an executable module
 var executable = try compiled.prepare(arena, model_weights);
 defer executable.deinit();
 
@@ -384,7 +384,7 @@

Running it

With everything in place now, running the defer zml.aio.unloadBuffers(&model_weights); // for good practice // Wait for compilation to finish - const compiled = try compilation.await_(); + const compiled = try compilation.awaitt(); // pass the model weights to the compiled module to create an executable // module