Update GitHub Pages

zml · Dec 2, 2024 · 5b0f669 · 5b0f669
1 parent d8b6db9
commit 5b0f669
Show file tree

Hide file tree

Showing 6 changed files with 9 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -16,3 +16,4 @@ Tue Sep 17 10:57:15 UTC 2024
 Tue Sep 17 14:31:10 UTC 2024
 Wed Sep 18 11:13:08 UTC 2024
 Fri Nov 15 13:46:13 UTC 2024
+Mon Dec  2 21:30:54 UTC 2024
diff --git a/howtos/deploy_on_server/index.html b/howtos/deploy_on_server/index.html
@@ -154,7 +154,7 @@ <h3 class="centered"></h3>
   });
   </script>
   <!-- <h1 :text="$page.title"></h1> -->
-  <div id="docs"><h1 id="deploying-models-on-a-server">Deploying Models on a Server</h1><p>To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, <strong>for every</strong> supported target architecture and accelerator.</p><p>See <a href="/tutorials/getting_started/">Getting Started with ZML</a> if you need more information on how to compile a model.</p><p><strong>Here's a quick recap:</strong></p><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>So, to run the OpenLLama model from above <strong>on your development machine</strong> housing an NVIDIA GPU, run the following:</p><pre><code>cd examples
+  <div id="docs"><h1 id="deploying-models-on-a-server">Deploying Models on a Server</h1><p>To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, <strong>for every</strong> supported target architecture and accelerator.</p><p>See <a href="/tutorials/getting_started/">Getting Started with ZML</a> if you need more information on how to compile a model.</p><p><strong>Here's a quick recap:</strong></p><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>So, to run the OpenLLama model from above <strong>on your development machine</strong> housing an NVIDIA GPU, run the following:</p><pre><code>cd examples
 bazel run -c opt //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true
 </code></pre><h2 id="cross-compiling-and-creating-a-tar-for-your-server">Cross-Compiling and creating a TAR for your server</h2><p>Currently, ZML lets you cross-compile to one of the following target architectures:</p><ul><li>Linux X86_64: <code>--platforms=@zml//platforms:linux_amd64</code></li><li>Linux ARM64: <code>--platforms=@zml//platforms:linux_arm64</code></li><li>MacOS ARM64: <code>--platforms=@zml//platforms:macos_arm64</code></li></ul><p>As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:</p><pre><code>cd examples
 bazel build -c opt //llama:OpenLLaMA-3B               \

diff --git a/howtos/dockerize_models/index.html b/howtos/dockerize_models/index.html
@@ -233,7 +233,7 @@ <h3 id="1--the-manifest">1. The Manifest</h3><p>To get started, let's make bazel
 </code></pre>
 <p>This will push the <code>simple_layer</code> image with the tag <code>latest</code> (you can add more) to the docker registry:</p><pre><code>bazel run -c opt //simple_layer:push
 </code></pre><p>When dealing with maybe a public and a private container registry - or if you just want to try it out <strong>right now</strong>, you can always override the repository on the command line:</p><pre><code>bazel run -c opt //simple_layer:push -- --repository my.server.com/org/image
-</code></pre><h2 id="adding-weights-and-data">Adding weights and data</h2><p>Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?</p><p>We'll use the <a href="https://github.com/zml/zml/tree/master/examples/mnist" target="_blank">MNIST example</a> to illustrate how to build Docker images that also contain data files.</p><p>You can <code>bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist</code> in the <code>./examples</code> folder if you want to try it out.</p><p><strong>Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.</strong></p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p><strong>Example:</strong></p><pre><code>bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
+</code></pre><h2 id="adding-weights-and-data">Adding weights and data</h2><p>Dockerizing a model that doesn't need any weights was easy. But what if you want to create a complete care-free package of a model plus all required weights and supporting files?</p><p>We'll use the <a href="https://github.com/zml/zml/tree/master/examples/mnist" target="_blank">MNIST example</a> to illustrate how to build Docker images that also contain data files.</p><p>You can <code>bazel run -c opt //mnist:push -- --repository index.docker.io/my_org/zml_mnist</code> in the <code>./examples</code> folder if you want to try it out.</p><p><strong>Note: Please add one more of the following parameters to specify all the platforms your containerized model should support.</strong></p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p><strong>Example:</strong></p><pre><code>bazel run //mnist:push -c opt --@zml//runtimes:cuda=true -- --repository index.docker.io/my_org/zml_mnist
 </code></pre><h3 id="manifest-and-archive">Manifest and Archive</h3><p>We only add one more target to the <code>BUILD.bazel</code> to construct the commandline for the <code>entrypoint</code> of the container. All other steps basically remain the same.</p><p>Let's start with creating the manifest and archive:</p><pre><code class="python"><span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:expand_template.bzl&quot;</span>, <span class="string">&quot;expand_template&quot;</span>)
 <span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:tar.bzl&quot;</span>, <span class="string">&quot;mtree_spec&quot;</span>, <span class="string">&quot;tar&quot;</span>)
 <span class="function">load</span>(<span class="string">&quot;@aspect_bazel_lib//lib:transitions.bzl&quot;</span>, <span class="string">&quot;platform_transition_filegroup&quot;</span>)
@@ -301,7 +301,7 @@ <h3 id="entrypoint">Entrypoint</h3><p>Our container entrypoint commandline is no
     <span class="variable">name</span> <span class="operator">=</span> <span class="string">&quot;image_&quot;</span>,
     <span class="variable">base</span> <span class="operator">=</span> <span class="string">&quot;@distroless_cc_debian12&quot;</span>,
     <span class="comment"># the entrypoint comes from the expand_template rule `entrypoint` above</span>
-    <span class="variable">entrypoint</span> <span class="operator">=</span> <span class="string">&quot;:entrypoint&quot;</span>, 
+    <span class="variable">entrypoint</span> <span class="operator">=</span> <span class="string">&quot;:entrypoint&quot;</span>,
     <span class="variable">tars</span> <span class="operator">=</span> [<span class="string">&quot;:archive&quot;</span>],
 )
 

diff --git a/sources.tar b/sources.tar
diff --git a/tutorials/getting_started/index.html b/tutorials/getting_started/index.html
@@ -173,7 +173,7 @@ <h3 class="centered"></h3>
 bazel run -c opt //llama:Meta-Llama-3-8b
 bazel run -c opt //llama:Meta-Llama-3-8b -- --promt=&quot;Once upon a time,&quot;
 </code></pre><h2 id="run-tests">Run Tests</h2><pre><code>bazel test //zml:test
-</code></pre><h2 id="running-models-on-gpu---tpu">Running Models on GPU / TPU</h2><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>The latter, avoiding compilation for CPU, cuts down compilation time.</p><p>So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:</p><pre><code>cd examples
+</code></pre><h2 id="running-models-on-gpu---tpu">Running Models on GPU / TPU</h2><p>You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling or running a model:</p><ul><li>NVIDIA CUDA: <code>--@zml//runtimes:cuda=true</code></li><li>AMD RoCM: <code>--@zml//runtimes:rocm=true</code></li><li>Google TPU: <code>--@zml//runtimes:tpu=true</code></li><li>AWS Trainium/Inferentia 2: <code>--@zml//runtimes:neuron=true</code></li><li><strong>AVOID CPU:</strong> <code>--@zml//runtimes:cpu=false</code></li></ul><p>The latter, avoiding compilation for CPU, cuts down compilation time.</p><p>So, to run the OpenLLama model from above on your host sporting an NVIDIA GPU, run the following:</p><pre><code>cd examples
 bazel run -c opt //llama:OpenLLaMA-3B             \
           --@zml//runtimes:cuda=true              \
           -- --prompt=&quot;Once upon a time,&quot;

diff --git a/tutorials/write_first_model/index.html b/tutorials/write_first_model/index.html
@@ -163,7 +163,7 @@ <h3 class="centered"></h3>
 <span class="type qualifier">const</span> <span class="variable">zml</span> = <span class="function builtin">@import</span><span class="punctuation bracket">(</span><span class="string">&quot;zml&quot;</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
 <span class="type qualifier">const</span> <span class="variable">asynk</span> = <span class="function builtin">@import</span><span class="punctuation bracket">(</span><span class="string">&quot;async&quot;</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
 
-<span class="comment">// shortcut to the async_ function in the asynk module</span>
+<span class="comment">// shortcut to the asyncc function in the asynk module</span>
 <span class="type qualifier">const</span> <span class="variable">asyncc</span> = <span class="variable">asynk</span><span class="punctuation delimiter">.</span><span class="field">asyncc</span><span class="punctuation delimiter">;</span>
 </code></pre>
 <p>You will use above lines probably in all ZML projects. Also, note that <strong>ZML is async</strong> and comes with its own async runtime, thanks to <a href="https://github.com/rsepassi/zigcoro" target="_blank">zigcoro</a>.</p><h3 id="defining-our-model">Defining our Model</h3><p>We will start with a very simple "Model". One that resembles a "multiply and add" operation.</p><pre><code class="zig"><span class="comment">/// Model definition
@@ -244,9 +244,9 @@ <h3 class="centered"></h3>
 <span class="variable">defer</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">aio</span><span class="punctuation delimiter">.</span><span class="function">unloadBuffers</span><span class="punctuation bracket">(</span><span class="operator">&</span><span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="error">;</span>  <span class="comment">// for good practice</span>
 
 <span class="comment">// Wait for compilation to finish</span>
-<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">await_</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
+<span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">awaitt</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
 </code></pre>
-<p>Compiling is happening in the background via the <code>async_</code> function. We call <code>async_</code> with the <code>zml.compileModel</code> function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the <code>.forward</code> function name in order to compile <code>Layer.forward</code>, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).</p><h3 id="creating-the-executable-model">Creating the Executable Model</h3><p>Now that we have compiled the module utilizing the shapes, we turn it into an executable.</p><pre><code class="zig"><span class="comment">// pass the model weights to the compiled module to create an executable module</span>
+<p>Compiling is happening in the background via the <code>asyncc</code> function. We call <code>asyncc</code> with the <code>zml.compileModel</code> function and its arguments separately. The arguments themselves are basically the shapes of the weights in the BufferStore, the <code>.forward</code> function name in order to compile <code>Layer.forward</code>, the shape of the input tensor(s), and the platform for which to compile (we used auto platform).</p><h3 id="creating-the-executable-model">Creating the Executable Model</h3><p>Now that we have compiled the module utilizing the shapes, we turn it into an executable.</p><pre><code class="zig"><span class="comment">// pass the model weights to the compiled module to create an executable module</span>
 <span class="type qualifier">var</span> <span class="variable">executable</span> = <span class="operator">try</span> <span class="variable">compiled</span><span class="punctuation delimiter">.</span><span class="function">prepare</span><span class="punctuation bracket">(</span><span class="variable">arena</span><span class="punctuation delimiter">,</span> <span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
 <span class="variable">defer</span> <span class="variable">executable</span><span class="punctuation delimiter">.</span><span class="function">deinit</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="error">;</span>
 </code></pre>
@@ -384,7 +384,7 @@ <h2 id="running-it">Running it</h2><p>With everything in place now, running the
     <span class="keyword">defer</span> <span class="variable">zml</span><span class="punctuation delimiter">.</span><span class="field">aio</span><span class="punctuation delimiter">.</span><span class="function">unloadBuffers</span><span class="punctuation bracket">(</span><span class="operator">&</span><span class="variable">model_weights</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span> <span class="comment">// for good practice</span>
 
     <span class="comment">// Wait for compilation to finish</span>
-    <span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">await_</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
+    <span class="type qualifier">const</span> <span class="variable">compiled</span> = <span class="operator">try</span> <span class="variable">compilation</span><span class="punctuation delimiter">.</span><span class="function">awaitt</span><span class="punctuation bracket">(</span><span class="punctuation bracket">)</span><span class="punctuation delimiter">;</span>
 
     <span class="comment">// pass the model weights to the compiled module to create an executable</span>
     <span class="comment">// module</span>