Skip to content

Commit

Permalink
Update GitHub Pages
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions committed Dec 2, 2024
1 parent d8b6db9 commit 5b0f669
Show file tree
Hide file tree
Showing 6 changed files with 9 additions and 8 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Tue Sep 17 10:57:15 UTC 2024
Tue Sep 17 14:31:10 UTC 2024
Wed Sep 18 11:13:08 UTC 2024
Fri Nov 15 13:46:13 UTC 2024
Mon Dec 2 21:30:54 UTC 2024
2 changes: 1 addition & 1 deletion howtos/deploy_on_server/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ <h3 class="centered"></h3>
});
</script>
<!-- <h1 :text="$page.title"></h1> -->
<div id="docs"><h1 id="deploying-models-on-a-server">Deploying Models on a Server</h1><p>To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, <strong>for every</strong> supported target architecture and accelerator.</p><p>See <a href="/tutorials/getting_started/">Getting Started with ZML</a> if you need more information on how to compile a model.</p><p><strong>Here's a quick recap:</strong></p><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>So, to run the OpenLLama model from above <strong>on your development machine</strong> housing an NVIDIA GPU, run the following:</p><pre><code>cd examples
<div id="docs"><h1 id="deploying-models-on-a-server">Deploying Models on a Server</h1><p>To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, <strong>for every</strong> supported target architecture and accelerator.</p><p>See <a href="/tutorials/getting_started/">Getting Started with ZML</a> if you need more information on how to compile a model.</p><p><strong>Here's a quick recap:</strong></p><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>So, to run the OpenLLama model from above <strong>on your development machine</strong> housing an NVIDIA GPU, run the following:</p><pre><code>cd examples
bazel run -c opt //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
</code></pre><h2 id="cross-compiling-and-creating-a-tar-for-your-server">Cross-Compiling and creating a TAR for your server</h2><p>Currently, ZML lets you cross-compile to one of the following target architectures:</p><ul><li>Linux X86_64: <code>--platforms=@zml//platforms:linux_amd64</code></li><li>Linux ARM64: <code>--platforms=@zml//platforms:linux_arm64</code></li><li>MacOS ARM64: <code>--platforms=@zml//platforms:macos_arm64</code></li></ul><p>As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:</p><pre><code>cd examples
bazel build -c opt //llama:OpenLLaMA-3B \
Expand Down
4 changes: 2 additions & 2 deletions howtos/dockerize_models/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ <h3 id="1--the-manifest">1. The Manifest</h3><p>To get started, let's make bazel
</code></pre>
<p>This will push the <code>simple_layer</code> image with the tag <code>latest</code> (you can add more) to the docker registry:</p><pre><code>bazel run -c opt //simple_layer:push
</code></pre><p>When dealing with maybe a public and a private container registry - or if you just want to try it out <strong>right now</strong>, you can always override the repository on the command line:</p><pre><code>bazel run -c opt //simple_layer:push -- --repository my.server.com/org/image
</code></pre><h2 id="adding-weights-and-data">Adding weights and data</h2><p>Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?</p><p>We'll use the <a href="https://github.com/zml/zml/tree/master/examples/mnist" target="_blank">MNIST example</a> to illustrate how to build Docker images that also contain data files.</p><p>You can <code>bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist</code> in the <code>./examples</code> folder if you want to try it out.</p><p><strong>Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.</strong></p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p><strong>Example:</strong></p><pre><code>bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
</code></pre><h2 id="adding-weights-and-data">Adding weights and data</h2><p>Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?</p><p>We'll use the <a href="https://github.com/zml/zml/tree/master/examples/mnist" target="_blank">MNIST example</a> to illustrate how to build Docker images that also contain data files.</p><p>You can <code>bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist</code> in the <code>./examples</code> folder if you want to try it out.</p><p><strong>Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.</strong></p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p><strong>Example:</strong></p><pre><code>bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
</code></pre><h3 id="manifest-and-archive">Manifest and Archive</h3><p>We only add one more target to the <code>BUILD.bazel</code> to construct the commandline for the <code>entrypoint</code> of the container. All other steps basically remain the same.</p><p>Let's start with creating the manifest and archive:</p><pre><code class="python"><span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:expand_template.bzl&quot;</span>, <span class="string">&quot;expand_template&quot;</span>)
<span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:tar.bzl&quot;</span>, <span class="string">&quot;mtree_spec&quot;</span>, <span class="string">&quot;tar&quot;</span>)
<span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:transitions.bzl&quot;</span>, <span class="string">&quot;platform_transition_filegroup&quot;</span>)
Expand Down Expand Up @@ -301,7 +301,7 @@ <h3 id="entrypoint">Entrypoint</h3><p>Our container entrypoint commandline is no
<span class="variable">name</span> <span class="operator">=</span> <span class="string">&quot;image_&quot;</span>,
<span class="variable">base</span> <span class="operator">=</span> <span class="string">&quot;@distroless_cc_debian12&quot;</span>,
<span class="comment"># the entrypoint comes from the expand_template rule `entrypoint` above</span>
<span class="variable">entrypoint</span> <span class="operator">=</span> <span class="string">&quot;:entrypoint&quot;</span>,
<span class="variable">entrypoint</span> <span class="operator">=</span> <span class="string">&quot;:entrypoint&quot;</span>,
<span class="variable">tars</span> <span class="operator">=</span> [<span class="string">&quot;:archive&quot;</span>],
)

Expand Down
Binary file modified sources.tar
Binary file not shown.
2 changes: 1 addition & 1 deletion tutorials/getting_started/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ <h3 class="centered"></h3>
bazel run -c opt //llama:Meta-Llama-3-8b
bazel run -c opt //llama:Meta-Llama-3-8b -- --promt=&quot;Once upon a time,&quot;
</code></pre><h2 id="run-tests">Run Tests</h2><pre><code>bazel test //zml:test
</code></pre><h2 id="running-models-on-gpu---tpu">Running Models on GPU / TPU</h2><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>The latter, avoiding compilation for CPU, cuts down compilation time.</p><p>So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:</p><pre><code>cd examples
</code></pre><h2 id="running-models-on-gpu---tpu">Running Models on GPU / TPU</h2><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>The latter, avoiding compilation for CPU, cuts down compilation time.</p><p>So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:</p><pre><code>cd examples
bazel run -c opt //llama:OpenLLaMA-3B \
--@zml//runtimes:cuda=true \
-- --prompt=&quot;Once upon a time,&quot;
Expand Down
8 changes: 4 additions & 4 deletions tutorials/write_first_model/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ <h3 class="centered"></h3>
<span class="type qualifier">const</span> <span class="variable">zml</span> = <span class="function builtin">@import</span><span class="punctuation bracket">(</span><span class="string">&quot;zml&quot;</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
<span class="type qualifier">const</span> <span class="variable">asynk</span> = <span class="function builtin">@import</span><span class="punctuation bracket">(</span><span class="string">&quot;async&quot;</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>

<span class="comment">// shortcut to the async_ function in the asynk module</span>
<span class="comment">// shortcut to the asyncc function in the asynk module</span>
<span class="type qualifier">const</span> <span class="variable">asyncc</span> = <span class="variable">asynk</span><span class="punctuation delimiter">.</span><span class="field">asyncc</span><span class="punctuation delimiter">;</span>
</code></pre>
<p>You will use above lines probably in all ZML projects. Also, note that <strong>ZML is async</strong> and comes with its own async runtime, thanks to <a href="https://github.com/rsepassi/zigcoro" target="_blank">zigcoro</a>.</p><h3 id="defining-our-model">Defining our Model</h3><p>We will start with a very simple "Model". One that resembles a "multiply and add" operation.</p><pre><code class="zig"><span class="comment">/// Model definition
Expand Down Expand Up @@ -244,9 +244,9 @@ <h3 class="centered"></h3>
<span class="variable">defer</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">aio</span><span class="punctuation delimiter">.</span><span class="function">unloadBuffers</span><span class="punctuation bracket">(</span><span class="operator">&</span><span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="error">;</span> <span class="comment">// for good practice</span>

<span class="comment">// Wait for compilation to finish</span>
<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">await_</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">awaitt</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
</code></pre>
<p>Compiling is happening in the background via the <code>async_</code> function. We call <code>async_</code> with the <code>zml.compileModel</code> function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the <code>.forward</code> function name in order to compile <code>Layer.forward</code>, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).</p><h3 id="creating-the-executable-model">Creating the Executable Model</h3><p>Now that we have compiled the module utilizing the shapes, we turn it into an executable.</p><pre><code class="zig"><span class="comment">// pass the model weights to the compiled module to create an executable module</span>
<p>Compiling is happening in the background via the <code>asyncc</code> function. We call <code>asyncc</code> with the <code>zml.compileModel</code> function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the <code>.forward</code> function name in order to compile <code>Layer.forward</code>, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).</p><h3 id="creating-the-executable-model">Creating the Executable Model</h3><p>Now that we have compiled the module utilizing the shapes, we turn it into an executable.</p><pre><code class="zig"><span class="comment">// pass the model weights to the compiled module to create an executable module</span>
<span class="type qualifier">var</span> <span class="variable">executable</span> = <span class="operator">try</span> <span class="variable">compiled</span><span class="punctuation delimiter">.</span><span class="function">prepare</span><span class="punctuation bracket">(</span><span class="variable">arena</span><span class="punctuation delimiter">,</span> <span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
<span class="variable">defer</span> <span class="variable">executable</span><span class="punctuation delimiter">.</span><span class="function">deinit</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="error">;</span>
</code></pre>
Expand Down Expand Up @@ -384,7 +384,7 @@ <h2 id="running-it">Running it</h2><p>With everything in place now, running the
<span class="keyword">defer</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">aio</span><span class="punctuation delimiter">.</span><span class="function">unloadBuffers</span><span class="punctuation bracket">(</span><span class="operator">&</span><span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span> <span class="comment">// for good practice</span>

<span class="comment">// Wait for compilation to finish</span>
<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">await_</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">awaitt</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>

<span class="comment">// pass the model weights to the compiled module to create an executable</span>
<span class="comment">// module</span>
Expand Down

0 comments on commit 5b0f669

Please sign in to comment.