Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pex zip-creation takes a very long time for torch>=2 #2292

Closed
tgolsson opened this issue Nov 16, 2023 · 15 comments · Fixed by #2298
Closed

Pex zip-creation takes a very long time for torch>=2 #2292

tgolsson opened this issue Nov 16, 2023 · 15 comments · Fixed by #2298

Comments

@tgolsson
Copy link
Contributor

Hey!

Not sure if actionable, but maybe there's something here that can be done. I was investigating another issue today and ended up seeing a very slow Pants package step ~5 minutes. The issue reproduces with the simple command line pex -vvv torch>=2 -o t2.2.pex. This takes ~280 seconds on my machine, of which ~210-220 is spent purely in the zip step:

<snip>
pex: Building pex: 70298.8ms
pex:   Adding distributions from pexes: : 0.1ms
pex:   Resolving distributions for requirements: torch: 70294.7ms
pex:     Resolving requirements.: 70294.6ms
pex:       Resolving for:
  /usr/bin/python3.10: 55574.5ms
pex:       Calculating project names for direct requirements:
  PyPIRequirement(line=LogicalLine(raw_text='torch', processed_text='torch', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('')>, marker=None), editable=False): 0.1ms
pex:       Installing 22 distributions: 9352.3ms
pex:       Checking install: 2.7ms
pex:   Configuring PEX dependencies: 3.4ms
pex: Zipping PEX file.: 213135.5ms

This turns out to a 2.5 GB pex, which admittedly is on the fat side. Unzipping this beast takes ~30 seconds, and zipping it with regular zip takes ~230 seconds. zip -1 takes ~100 seconds and adds ~10% to the size. zip -0 takes 12 seconds but doubles the size. Seeing as compression seems to add the majority of the runtime, I did a very quick hack (outside of pex) where I move the compress step to a process pool (since it's CPU-heavy). With that, I get ~30 seconds at level 1, or about ~60 seconds on level 6. So 3-4x speed increase. It may be able to push this a bit higher by playing with ordering.

I also played around with the store-only-by-suffix capabilities, but it seems like the .so's make up the bulk of both the compression potential and time: only compressing text-like files gives a ~4.3 GB zip in 20 seconds.

With all that said, I'm mostly curious if this is something that has been discussed elsewhere (found nothing while searching), and what kind of solution might be palatable relative to the gains that can be made. I'm willing to contribute something based on the work I've done so far, or investigate other suggested approaches.

@jsirois
Copy link
Member

jsirois commented Nov 16, 2023

This has come up before. Two concrete results are the support for --layout packed introduced in #1431 / #1438 and the --no-compress Pex build option introduced in #1705. The associated issues have more discussion.

If neither --layout packed, which amortizes the slow zip to once per wheel and is used by Pants internally for this and other reasons, nor --no-compress are satisfactory, the only other approaches I can see are:

  1. Speed up zip.
  2. Cache zips as is done for --layout packed, but made usable for monolithic PEX zips.

@cosmicexplorer explored both and came up wanting. I think #2158 is probably the best entrypoint into that work.

@tgolsson
Copy link
Contributor Author

I'll have a peek at those, thanks. We already use layout=packed (+execution_mode=venv) in some situations. In the specific case where I hit this I was running a python_source where I don't have control over that, and I'm not sure what the default is. The timings seem to end up the same as with the command posted though.

--no-compress I think could work; if I can pass it into pants somewhere. Most of our pex building (with gpu wheels) is either to execute it immediately or to unpack it into a container. We have only one use-case for pex-at-rest, and that is a fraction of the size of these big GPU packages.

I still do think there is great value to being performant "by default" though, but maybe my effort is better invested into contributing to the already existing work by @cosmicexplorer -- will see if there's anything I can do there.

@jsirois
Copy link
Member

jsirois commented Nov 16, 2023

I still do think there is great value to being performant "by default" though

I agree there, but the only real solution for that is faster zip support. FWICT that is a problem for native code and not really related to Pex at all. With that implemented though, Pex - and many other tools - could benefit.

To be honest though, I think trying to make Pex - or any zipapp implementation - faster for behomoths like pytorch is fighting the wrong battle altogether. I imagine a much "simpler" way to do this is to not use a zipapp. For example, one might imagine a scie that contained all the resolved wheels for a zipapp, but not pre-installed wheels like PEXes contain, the actual wheel files downloaded from PyPI. The scie could then use PBS's Python distributions support for -mvenv to create a venv and install the contained wheels. This would mean there is 0 compression time or effort spent packaging the scie since the wheels are used as-is and just cat'ed to the scie and there is only the 1 time install time of unzipping.

@jsirois
Copy link
Member

jsirois commented Nov 16, 2023

Alternatively, instead of the scie containing raw wheel files, a PEX could. Pex would then need to learn how to install wheels though at runtime. Currently it lets Pip do this at build time. In this way the whl contents of a PEX could be stored as STORED by default.

@tgolsson
Copy link
Contributor Author

I agree there, but the only real solution for that is faster zip support. FWICT that is a problem for native code and not really related to Pex at all. With that implemented though, Pex - and many other tools - could benefit.

That is also an option, and looks like was explored fairly well. Will see if that can be landed, it'd definitely be good. My approach is Python native, but probably a lot hackier since it depended a lot on zipfile internals.

To be honest though, I think trying to make Pex - or any zipapp implementation - faster for behomoths like pytorch is fighting the wrong battle altogether. I imagine a much "simpler" way to do this is to not use a zipapp. For example, one might imagine a scie that contained all the resolved wheels for a zipapp, but not pre-installed wheels like PEXes contain, the actual wheel files downloaded from PyPI. The scie could then use PBS's Python distributions support for -mvenv to create a venv and install the contained wheels. This would mean there is 0 compression time or effort spent packaging the scie since the wheels are used as-is and just cat'ed to the scie and there is only the 1 time install time of unzipping.

I think my stance on torch is that whatever they do, doing the opposite is likely better. My life (and yours, by extension) would be a lot better if we didn't have to think about why they decide to ship a whole copy of CUDA in their wheels, or why their native component is larger than the Linux kernel when built 🤷 Inexplicably, the situation is now even worse that more of CUDA is on PYPI.

Alternatively, instead of the scie containing raw wheel files, a PEX could. Pex would then need to learn how to install wheels though at runtime. Currently it lets Pip do this at build time. In this way the whl contents of a PEX could be stored as STORED by default.

Hmm. That doesn't sound half bad, at least for some use-cases. I guess it'd be almost the same size as well, since zip only uses local compression. A wheel install is pretty much guaranteed to be isolated, right? I'm not sure I can fully see the implications for Pants though, or how it'd end up working in every situation (pants package vs run vs export...).

@jsirois
Copy link
Member

jsirois commented Nov 18, 2023

Hmm. That doesn't sound half bad, at least for some use-cases. I guess it'd be almost the same size as well, since zip only uses local compression. A wheel install is pretty much guaranteed to be isolated, right? I'm not sure I can fully see the implications for Pants though, or how it'd end up working in every situation (pants package vs run vs export...).

This would be opaque to all Pex users at runtime. The PEX zipapp would use STORED unadulterated .whl files instead of today's DEFLATED installed wheel chroots and the packed layout would use .deps/X.whl unadulterated .whl files instead of today's zipped-up installed wheel chroots. At runtime, new Pex installer code would install from these internal files (unzip + spread as per https://packaging.python.org/en/latest/specifications/binary-distribution-format/#installing-a-wheel-distribution-1-0-py32-none-any-whl ... plus a little more since that spec is actually wanting for how console scripts are actually handled in the wild) into the ~/.pex/installed_wheels (and then create a venv if using --venv from there), exactly as today.

I really do think this is the right way to go. Don't speed up zipping, avoid unzipping (installing wheels at build time) + zipping (back into a PEX zipapp or packed layout ~wheel zips) altogether. There will still be an unzip on a cold cache for the 1st boot at runtime, but since zipfile.ZipFile(zipfile.ZipFile("the.pex").open(".deps/X.whl")).extractall("here") works and is efficient, this should be ~the same PEX 1st boot install time as today.

I experimented enough writing a PEP-427 installer today to see it works, but you need to handle generating console scripts since .whls in the wild, for the most part, don't actually carry these in proj-rev.data/scripts/... as you'd hope they would given PEP-427.

@jsirois
Copy link
Member

jsirois commented Nov 18, 2023

@tgolsson I won't have solid time until the 23-28th, but I think I can get this knocked out and released then. I'm not sure exactly how to spell the feature activation, perhaps two new --layout options - one for zipapp and one for spread, but that's not too important as long as no existing users / PEX_ROOT caches are broken.

@tgolsson
Copy link
Contributor Author

That sounds very good. My concern with pants is mostly how far away from pants <goal> a potential error can occur, since I assume there are issues that could surface only when installing wheels. But since adding this feature to Pants would require work anyway, that's not going to be an immediate problem - and I'm guessing this would be opt-in per target either way.

@kaos
Copy link
Collaborator

kaos commented Nov 20, 2023

It also seems like a good feature for Pex, regardless of Pants usage.

@jsirois jsirois self-assigned this Nov 24, 2023
jsirois added a commit to jsirois/pex that referenced this issue Nov 27, 2023
This sets the stage for doing runtime installation of wheels without
needing to ship a copy of Pip in every PEX file. To prove the
robustness, convert build time installation of wheel chroots to this
mechanism.

Work towards pex-tool#2292
@jsirois
Copy link
Member

jsirois commented Nov 28, 2023

Noting I did not complete this during the current work stretch. It will be picked back up on December 10th when I start my next work stretch.

jsirois added a commit that referenced this issue Dec 4, 2023
This sets the stage for doing runtime installation of wheels without
needing to ship a copy of Pip in every PEX file. To help prove the
robustness, convert the current build time installation of wheel chroots
to this mechanism.

Work towards #2292
@jsirois
Copy link
Member

jsirois commented Dec 13, 2023

This should completely side-step the need for #2158 since it does better than that approach ever could by avoiding zipping altogether (and unzipping as well!).

@jsirois
Copy link
Member

jsirois commented Dec 13, 2023

Ok, circling back to the OP using #2298:

  1. Status quo:
    $ rm -rf ~/.pex/installed_wheels/
    $ time python3.11 -mpex -v torch==2.1.1 -o t2.2.pex
    ...
    pex: Building pex: 20905.4ms
    pex:   Adding distributions from pexes: : 0.0ms
    pex:   Resolving distributions for requirements: torch==2.1.1: 20902.6ms
    pex:     Resolving requirements.: 20902.5ms
    pex:       Resolving for:
      /usr/bin/python3.11: 8135.2ms
    pex:       Calculating project names for direct requirements:
      PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
    pex:       Installing 22 distributions: 10994.5ms
    pex:       Checking install: 2.2ms
    pex:   Configuring PEX dependencies: 2.3ms
    Saving PEX file to t2.2.pex
    Previous binary unexpectedly exists, cleaning: t2.2.pex
    pex: Zipping PEX file.: 167895.1ms
    /home/jsirois/dev/pantsbuild/jsirois-pex/pex/pex_builder.py:113: PEXWarning: The PEX zip at t2.2.pex~ is not a valid zipapp: Could not find the `__main__` module.
    This is likely due to the zip requiring ZIP64 extensions due to size or the
    number of file entries or both. You can work around this limitation in Python's
    `zipimport` module by re-building the PEX with `--layout packed` or
    `--layout loose`.
      pex_warnings.warn(message)
    
  2. Using --no-pre-install-wheels:
    $ rm -rf ~/.pex/installed_wheels/  
    $ python3.11 -mpex -v torch==2.1.1 --no-pre-install-wheels -o t2.2.pex
    ...
    pex: Building pex: 10125.3ms
    pex:   Adding distributions from pexes: : 0.0ms
    pex:   Resolving distributions for requirements: torch==2.1.1: 10123.1ms
    pex:     Resolving requirements.: 10123.1ms
    pex:       Resolving for:
      /usr/bin/python3.11: 8274.5ms
    pex:       Calculating project names for direct requirements:
      PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
    pex:       Checking build: 2.1ms
    pex:   Configuring PEX dependencies: 1.7ms
    Saving PEX file to t2.2.pex
    pex: Zipping PEX file.: 3173.1ms
    /home/jsirois/dev/pantsbuild/jsirois-pex/pex/pex_builder.py:113: PEXWarning: The PEX zip at t2.2.pex~ is not a valid zipapp: Could not find the `__main__` module.
    This is likely due to the zip requiring ZIP64 extensions due to size or the
    number of file entries or both. You can work around this limitation in Python's
    `zipimport` module by re-building the PEX with `--layout packed` or
    `--layout loose`.
      pex_warnings.warn(message)
    

So that's:

Status quo Using --no-pre-install-wheels
Pre-install time (~unzip) 10.99s N/A
Zip time 167.89s 3.17s
Size (bytes) 2680106601 2677995839

Of course, this is not a great example since the resulting PEX cannot be run as the elided warning indicates in both cases; so we can't examine the tradeoff in the 1st boot runtime penalty for installing the wheels just in time.

@jsirois
Copy link
Member

jsirois commented Dec 13, 2023

And, using the OP, but with --layout packed --venv --venv-site-packages-copies, which is required to work around the zipapp size issue and work around indirect nvidia dependencies failure to properly use namespace packages:

  1. Status quo cold:
    $ rm -rf ~/.pex/installed_wheels/ ~/.pex/packed_wheels/
    $ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed -o t2.2.pex
    ...
    pex: Building pex: 20589.6ms
    pex:   Adding distributions from pexes: : 0.0ms
    pex:   Resolving distributions for requirements: torch==2.1.1: 20586.9ms
    pex:     Resolving requirements.: 20586.9ms
    pex:       Resolving for:
      /usr/bin/python3.11: 8686.9ms
    pex:       Calculating project names for direct requirements:
      PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
    pex:       Installing 22 distributions: 10215.5ms
    pex:       Checking install: 1.7ms
    pex:   Configuring PEX dependencies: 2.2ms
    Saving PEX file to t2.2.pex
    pex: Zipping PEX .bootstrap/ code.: 86.5ms
    pex: Zipping 22 distributions.: 172517.1ms
    $ du -sb t2.2.pex/
    2679282217      t2.2.pex/
    
  2. Status quo warm:
    $ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed -o t2.2.pex
    ...
    pex: Building pex: 12982.0ms
    pex:   Adding distributions from pexes: : 0.1ms
    pex:   Resolving distributions for requirements: torch==2.1.1: 12979.3ms
    pex:     Resolving requirements.: 12979.2ms
    pex:       Resolving for:
      /usr/bin/python3.11: 8217.2ms
    pex:       Calculating project names for direct requirements:
      PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
    pex:       Installing 22 distributions: 3051.5ms
    pex:       Checking install: 1.8ms
    pex:   Configuring PEX dependencies: 2.2ms
    Saving PEX file to t2.2.pex
    pex: Zipping PEX .bootstrap/ code.: 0.0ms
    pex: Zipping 22 distributions.: 0.4ms
    $ du -sb t2.2.pex/
    2679282217      t2.2.pex/
    
  3. Using --no-pre-install-wheels (~same for warm and cold cases):
    $ python3.11 -mpex -v torch==2.1.1 --venv --venv-site-packages-copies --layout packed --no-pre-install-wheels -o t2.2.whls.pex
    ...
    pex: Building pex: 10429.3ms
    pex:   Adding distributions from pexes: : 0.0ms
    pex:   Resolving distributions for requirements: torch==2.1.1: 10427.3ms
    pex:     Resolving requirements.: 10427.2ms
    pex:       Resolving for:
      /usr/bin/python3.11: 8666.5ms
    pex:       Calculating project names for direct requirements:
      PyPIRequirement(line=LogicalLine(raw_text='torch==2.1.1', processed_text='torch==2.1.1', source='<string>', start_line=1, end_line=1), requirement=Requirement(name='torch', url=None, extras=frozenset(), specifier=<SpecifierSet('==2.1.1')>, marker=None), editable=False): 0.1ms
    pex:       Checking build: 1.7ms
    pex:   Configuring PEX dependencies: 1.7ms
    Saving PEX file to t2.2.whls.pex
    pex: Zipping PEX .bootstrap/ code.: 91.7ms
    pex: Copying 22 distributions.: 0.2ms
    $ du -sb t2.2.whls.pex/
    2678537958      t2.2.whls.pex/
    

So that's:

Status quo (cold) Status quo (warm) Using --no-pre-install-wheels
Pre-install time (~unzip) 10.22s N/A N/A
Zip / Copy time 172.52s 0.4s 0.2s
Size (bytes) 2679282217 2679282217 2678537958

And at runtime:

$ hyperfine \
    -w2 \
    -p 'rm -rf ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p '' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p 'rm -rf ~/.pex/installed_wheels ~/.pex/unzipped_pexes ~/.pex/venvs' \
    -p '' \
    -n 'Status quo warm 1st' \
    -n 'Status quo warm 1st parallel' \
    -n 'Status quo cold 1st' \
    -n 'Status quo cold 1st parallel' \
    -n 'Status quo hot' \
    -n 'With --no-pre-install-wheels 1st' \
    -n 'With --no-pre-install-wheels 1st parallel' \
    -n 'With --no-pre-install-wheels hot' \
    't2.2.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.pex/__main__.py -c "import torch"' \
    't2.2.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.pex/__main__.py -c "import torch"' \
    't2.2.pex/__main__.py -c "import torch"' \
    't2.2.whls.pex/__main__.py -c "import torch"' \
    'PEX_MAX_INSTALL_JOBS=0 t2.2.whls.pex/__main__.py -c "import torch"' \
    't2.2.whls.pex/__main__.py -c "import torch"'
Benchmark 1: Status quo warm 1st
  Time (mean ± σ):      5.765 s ±  0.040 s    [User: 5.017 s, System: 0.734 s]
  Range (min … max):    5.717 s …  5.853 s    10 runs

Benchmark 2: Status quo warm 1st parallel
  Time (mean ± σ):      5.991 s ±  0.035 s    [User: 7.267 s, System: 0.885 s]
  Range (min … max):    5.952 s …  6.054 s    10 runs

Benchmark 3: Status quo cold 1st
  Time (mean ± σ):     26.737 s ±  0.338 s    [User: 24.027 s, System: 2.683 s]
  Range (min … max):   26.307 s … 27.365 s    10 runs

Benchmark 4: Status quo cold 1st parallel
  Time (mean ± σ):     12.790 s ±  0.141 s    [User: 30.314 s, System: 3.424 s]
  Range (min … max):   12.549 s … 12.969 s    10 runs

Benchmark 5: Status quo hot
  Time (mean ± σ):     889.1 ms ±   4.9 ms    [User: 815.3 ms, System: 68.5 ms]
  Range (min … max):   883.1 ms … 898.3 ms    10 runs

Benchmark 6: With --no-pre-install-wheels 1st
  Time (mean ± σ):     29.602 s ±  0.137 s    [User: 26.534 s, System: 3.034 s]
  Range (min … max):   29.480 s … 29.955 s    10 runs

Benchmark 7: With --no-pre-install-wheels 1st parallel
  Time (mean ± σ):     14.062 s ±  0.245 s    [User: 34.360 s, System: 3.842 s]
  Range (min … max):   13.780 s … 14.540 s    10 runs

Benchmark 8: With --no-pre-install-wheels hot
  Time (mean ± σ):     882.1 ms ±   4.0 ms    [User: 810.3 ms, System: 66.7 ms]
  Range (min … max):   874.7 ms … 889.1 ms    10 runs

Summary
  With --no-pre-install-wheels hot ran
    1.01 ± 0.01 times faster than Status quo hot
    6.54 ± 0.05 times faster than Status quo warm 1st
    6.79 ± 0.05 times faster than Status quo warm 1st parallel
   14.50 ± 0.17 times faster than Status quo cold 1st parallel
   15.94 ± 0.29 times faster than With --no-pre-install-wheels 1st parallel
   30.31 ± 0.41 times faster than Status quo cold 1st
   33.56 ± 0.22 times faster than With --no-pre-install-wheels 1st

So, in summary, that's (assuming resolve time for the build and run cases are equal and so are ignored):

Status quo With --no-pre-install-wheels --no-pre-install-wheels savings
Cold build and run 1st local machine 188.51s 29.80s 84% faster
Cold run 1st remote machine 26.74s 29.60s 11% slower
Cold run 1st remote machine parallel 12.79s 14.06s 10% slower
Size (bytes) 2679282217 2678537958 0.02% smaller

This means, for local, internal-only use --no-pre-install-wheels is always a win. Important examples are Pants's Python backend use case and @cosmicexplorer's case in #2158 of local iteration on an ML / data science project.

For cases where remote deployment cold 1st run start time is important (legacy lambdex use cases come to mind), --no-pre-install-wheels will always be a small loss.

For other cases the perf is a wash and more localized analysis is needed to decide which set of options to use.

jsirois added a commit to jsirois/pex that referenced this issue Dec 13, 2023
Working through the perf analysis in pex-tool#2292 brought these to light.
@jsirois
Copy link
Member

jsirois commented Dec 13, 2023

The analysis above is at the extreme end of PEX sizes (~2GB). I'll add the same analysis below for the extreme small end (A cowsay PEX) to button this up, assuming ~linearity between the two extremes.

@jsirois
Copy link
Member

jsirois commented Dec 14, 2023

Ok, for a small case I used cowsay and ansicolors deps with this 93 byte main.py and driver scripts:

app/src/main.py
import colors
import cowsay


if __name__ == "__main__":
    cowsay.tux(colors.blue("Moo?"))
app/build-cowsay.sh
#!/usr/bin/env bash

set -euo pipefail

PYTHON="${PYTHON:-python3.11}"

PEX_DIR="$(git rev-parse --show-toplevel)"
APP_DIR="${PEX_DIR}/app"

cd "${PEX_DIR}"

DEPS="${DEPS:-cowsay ansicolors}"

venv="$(mktemp -d)"
"${PYTHON}" -mvenv "${venv}"
"${venv}/bin/python" -mpip --disable-pip-version-check -q wheel --wheel-dir "${APP_DIR}/wheels" ${DEPS[*]}

function build_pex() {
    echo "${PYTHON} -mpex --no-pypi -f ${APP_DIR}/wheels -D ${APP_DIR}/src -m main ${DEPS[*]} ${@}"
}

hyperfine \
    -w2 \
    -p 'rm -rf ~/.pex' \
    -p 'rm -rf ~/.pex' \
    -p 'rm -rf ~/.pex' \
    -p 'rm -rf ~/.pex' \
    -p 'rm -rf ~/.pex' \
    -p 'rm -rf ~/.pex' \
    -p '' \
    -p '' \
    -p '' \
    -p '' \
    -p '' \
    -p '' \
    -n 'Build zipappi (cold)' \
    -n 'Build .whl zipapp (cold)' \
    -n 'Build packed (cold)' \
    -n 'Build .whl packed (cold)' \
    -n 'Build loose (cold)' \
    -n 'Build .whl loose (cold)' \
    -n 'Build zipappi (warm)' \
    -n 'Build .whl zipapp (warm)' \
    -n 'Build packed (warm)' \
    -n 'Build .whl packed (warm)' \
    -n 'Build loose (warm)' \
    -n 'Build .whl loose (warm)' \
    "$(build_pex --layout zipapp -o ${APP_DIR}/cowsay.zipapp.pex)" \
    "$(build_pex --layout zipapp --no-pre-install-wheels -o ${APP_DIR}/cowsay.zipapp.whls.pex)" \
    "$(build_pex --layout packed -o ${APP_DIR}/cowsay.packed.pex)" \
    "$(build_pex --layout packed --no-pre-install-wheels -o ${APP_DIR}/cowsay.packed.whls.pex)" \
    "$(build_pex --layout loose -o ${APP_DIR}/cowsay.loose.pex)" \
    "$(build_pex --layout loose --no-pre-install-wheels -o ${APP_DIR}/cowsay.loose.whls.pex)" \
    "$(build_pex --layout zipapp -o ${APP_DIR}/cowsay.zipapp.pex)" \
    "$(build_pex --layout zipapp --no-pre-install-wheels -o ${APP_DIR}/cowsay.zipapp.whls.pex)" \
    "$(build_pex --layout packed -o ${APP_DIR}/cowsay.packed.pex)" \
    "$(build_pex --layout packed --no-pre-install-wheels -o ${APP_DIR}/cowsay.packed.whls.pex)" \
    "$(build_pex --layout loose -o ${APP_DIR}/cowsay.loose.pex)" \
    "$(build_pex --layout loose --no-pre-install-wheels -o ${APP_DIR}/cowsay.loose.whls.pex)"

du -sbl ${APP_DIR}/cowsay.* | sort -n
app/perf-cowsay.sh
#!/usr/bin/env bash

set -euo pipefail

PEX_DIR="$(git rev-parse --show-toplevel)"
APP_DIR="${PEX_DIR}/app"

cd "${APP_DIR}"


hyperfine \
    -w2 \
    -p 'rm -rf ~/.pex' \
    -n 'Run zipapp cold' \
    -n 'Run .whl zipapp cold' \
    -n 'Run packed cold' \
    -n 'Run .whl packed cold' \
    -n 'Run loose cold' \
    -n 'Run .whl loose cold' \
    -n 'Run zipapp cold (parallel)' \
    -n 'Run .whl zipapp coldi (parallel)' \
    -n 'Run packed cold (parallel)' \
    -n 'Run .whl packed cold (parallel)' \
    -n 'Run loose cold (parallel)' \
    -n 'Run .whl loose cold (parallel)' \
    "./cowsay.zipapp.pex" \
    "./cowsay.zipapp.whls.pex" \
    "cowsay.packed.pex/__main__.py" \
    "cowsay.packed.whls.pex/__main__.py" \
    "cowsay.loose.pex/__main__.py" \
    "cowsay.loose.whls.pex/__main__.py" \
    "PEX_MAX_INSTALL_JOBS=0 ./cowsay.zipapp.pex" \
    "PEX_MAX_INSTALL_JOBS=0 ./cowsay.zipapp.whls.pex" \
    "PEX_MAX_INSTALL_JOBS=0 cowsay.packed.pex/__main__.py" \
    "PEX_MAX_INSTALL_JOBS=0 cowsay.packed.whls.pex/__main__.py" \
    "PEX_MAX_INSTALL_JOBS=0 cowsay.loose.pex/__main__.py" \
    "PEX_MAX_INSTALL_JOBS=0 cowsay.loose.whls.pex/__main__.py"
$ ./build-cowsay.sh && ./perf-cowsay.sh
Benchmark 1: Build zipappi (cold)
  Time (mean ± σ):      1.146 s ±  0.028 s    [User: 1.075 s, System: 0.161 s]
  Range (min … max):    1.110 s …  1.189 s    10 runs

Benchmark 2: Build .whl zipapp (cold)
  Time (mean ± σ):      1.047 s ±  0.026 s    [User: 0.914 s, System: 0.131 s]
  Range (min … max):    1.011 s …  1.081 s    10 runs

Benchmark 3: Build packed (cold)
  Time (mean ± σ):      1.125 s ±  0.016 s    [User: 1.073 s, System: 0.136 s]
  Range (min … max):    1.109 s …  1.167 s    10 runs

Benchmark 4: Build .whl packed (cold)
  Time (mean ± σ):      1.034 s ±  0.008 s    [User: 0.893 s, System: 0.140 s]
  Range (min … max):    1.017 s …  1.042 s    10 runs

Benchmark 5: Build loose (cold)
  Time (mean ± σ):      1.077 s ±  0.010 s    [User: 1.030 s, System: 0.131 s]
  Range (min … max):    1.062 s …  1.094 s    10 runs

Benchmark 6: Build .whl loose (cold)
  Time (mean ± σ):     995.2 ms ±  17.7 ms    [User: 852.2 ms, System: 142.5 ms]
  Range (min … max):   972.2 ms … 1028.8 ms    10 runs

Benchmark 7: Build zipappi (warm)
  Time (mean ± σ):     413.8 ms ±  12.5 ms    [User: 370.8 ms, System: 43.0 ms]
  Range (min … max):   399.5 ms … 437.5 ms    10 runs

Benchmark 8: Build .whl zipapp (warm)
  Time (mean ± σ):     401.1 ms ±   5.4 ms    [User: 345.5 ms, System: 55.5 ms]
  Range (min … max):   396.0 ms … 415.1 ms    10 runs

Benchmark 9: Build packed (warm)
  Time (mean ± σ):     351.6 ms ±   2.9 ms    [User: 314.1 ms, System: 37.3 ms]
  Range (min … max):   348.6 ms … 357.1 ms    10 runs

Benchmark 10: Build .whl packed (warm)
  Time (mean ± σ):     354.5 ms ±  11.4 ms    [User: 315.7 ms, System: 38.5 ms]
  Range (min … max):   343.2 ms … 372.2 ms    10 runs

Benchmark 11: Build loose (warm)
  Time (mean ± σ):     358.2 ms ±   2.5 ms    [User: 307.2 ms, System: 50.5 ms]
  Range (min … max):   354.3 ms … 364.1 ms    10 runs

Benchmark 12: Build .whl loose (warm)
  Time (mean ± σ):     365.2 ms ±  19.3 ms    [User: 314.1 ms, System: 51.2 ms]
  Range (min … max):   352.7 ms … 415.4 ms    10 runs

Summary
  Build packed (warm) ran
    1.01 ± 0.03 times faster than Build .whl packed (warm)
    1.02 ± 0.01 times faster than Build loose (warm)
    1.04 ± 0.06 times faster than Build .whl loose (warm)
    1.14 ± 0.02 times faster than Build .whl zipapp (warm)
    1.18 ± 0.04 times faster than Build zipappi (warm)
    2.83 ± 0.06 times faster than Build .whl loose (cold)
    2.94 ± 0.03 times faster than Build .whl packed (cold)
    2.98 ± 0.08 times faster than Build .whl zipapp (cold)
    3.06 ± 0.04 times faster than Build loose (cold)
    3.20 ± 0.05 times faster than Build packed (cold)
    3.26 ± 0.08 times faster than Build zipappi (cold)
709130  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.zipapp.whls.pex
714166  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.zipapp.pex
721772  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.packed.whls.pex
723960  /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.packed.pex
2543013 /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.loose.whls.pex
2670261 /home/jsirois/dev/pantsbuild/jsirois-pex/app/cowsay.loose.pex
Benchmark 1: Run zipapp cold
  Time (mean ± σ):     433.1 ms ±  17.8 ms    [User: 383.9 ms, System: 48.6 ms]
  Range (min … max):   417.3 ms … 476.7 ms    10 runs

Benchmark 2: Run .whl zipapp cold
  Time (mean ± σ):     511.4 ms ±   8.2 ms    [User: 469.1 ms, System: 41.9 ms]
  Range (min … max):   497.8 ms … 524.0 ms    10 runs

Benchmark 3: Run packed cold
  Time (mean ± σ):     422.3 ms ±   5.1 ms    [User: 375.7 ms, System: 46.3 ms]
  Range (min … max):   413.4 ms … 429.8 ms    10 runs

Benchmark 4: Run .whl packed cold
  Time (mean ± σ):     504.6 ms ±   7.0 ms    [User: 455.2 ms, System: 49.0 ms]
  Range (min … max):   493.8 ms … 515.9 ms    10 runs

Benchmark 5: Run loose cold
  Time (mean ± σ):     239.7 ms ±   6.5 ms    [User: 212.8 ms, System: 26.5 ms]
  Range (min … max):   231.2 ms … 256.2 ms    12 runs

Benchmark 6: Run .whl loose cold
  Time (mean ± σ):     332.3 ms ±   5.1 ms    [User: 285.4 ms, System: 46.7 ms]
  Range (min … max):   326.7 ms … 340.5 ms    10 runs

Benchmark 7: Run zipapp cold (parallel)
  Time (mean ± σ):     550.6 ms ±   4.4 ms    [User: 551.2 ms, System: 55.1 ms]
  Range (min … max):   544.3 ms … 556.6 ms    10 runs

Benchmark 8: Run .whl zipapp coldi (parallel)
  Time (mean ± σ):     586.3 ms ±   5.2 ms    [User: 616.6 ms, System: 65.1 ms]
  Range (min … max):   581.7 ms … 595.8 ms    10 runs

Benchmark 9: Run packed cold (parallel)
  Time (mean ± σ):     545.6 ms ±   8.2 ms    [User: 551.4 ms, System: 50.6 ms]
  Range (min … max):   536.5 ms … 561.9 ms    10 runs

Benchmark 10: Run .whl packed cold (parallel)
  Time (mean ± σ):     580.6 ms ±   4.8 ms    [User: 608.2 ms, System: 64.9 ms]
  Range (min … max):   573.0 ms … 588.4 ms    10 runs

Benchmark 11: Run loose cold (parallel)
  Time (mean ± σ):     232.4 ms ±   2.3 ms    [User: 211.8 ms, System: 20.3 ms]
  Range (min … max):   229.4 ms … 237.2 ms    12 runs

Benchmark 12: Run .whl loose cold (parallel)
  Time (mean ± σ):     411.7 ms ±   2.4 ms    [User: 449.2 ms, System: 56.2 ms]
  Range (min … max):   407.8 ms … 416.1 ms    10 runs

Summary
  Run loose cold (parallel) ran
    1.03 ± 0.03 times faster than Run loose cold
    1.43 ± 0.03 times faster than Run .whl loose cold
    1.77 ± 0.02 times faster than Run .whl loose cold (parallel)
    1.82 ± 0.03 times faster than Run packed cold
    1.86 ± 0.08 times faster than Run zipapp cold
    2.17 ± 0.04 times faster than Run .whl packed cold
    2.20 ± 0.04 times faster than Run .whl zipapp cold
    2.35 ± 0.04 times faster than Run packed cold (parallel)
    2.37 ± 0.03 times faster than Run zipapp cold (parallel)
    2.50 ± 0.03 times faster than Run .whl packed cold (parallel)
    2.52 ± 0.03 times faster than Run .whl zipapp coldi (parallel)

The summary is:

  • Roughly, .whl builds are slightly faster than the status quo as expected (no unzipping and, for zipapp and packed, re-zipping is required).
  • Roughly, .whl 1st cold runs are slightly slower than the status quo as expected (extra install step at runtime).
  • Forcing parallelization makes things slower. In general knowing when this will pay off requires experimentation with the PEX and deploy target machine in-hand.

jsirois added a commit that referenced this issue Dec 14, 2023
…2298)

The `--no-pre-install-wheels` option causes built PEXes to use raw
`.whl` files. For `--layout zipapp` this means a single `.whl` file is
`STORED` per dep, and for `--layout {packed,loose}` this means the loose
`.deps/` dir contains raw `.whl` files. This speeds up all PEX builds by
avoiding pre-installing wheel deps (~unzipping into the `PEX_ROOT`) and
then, in the case of zipapp and packed layout, re-zipping. For large
dependencies the time savings can be dramatic.

Not pre-installing wheels comes with a PEX boot cold-start performance
tradeoff since installation now needs to be done at runtime. This is
generally a penalty of O(100ms), but that penalty can be erased for some
deployment scenarios with the new `--max-install-jobs` build option / 
`PEX_MAX_INSTALL_JOBS` runtime env var. By default, runtime installs are
performed serially, but this new option can be set to use multiple
parallel install processes, which can speed up cold boots for large
dependencies.

Fixes #2292
huonw added a commit to pantsbuild/pants that referenced this issue Apr 15, 2024
…s of internal pexes (#20670)

This has all internal PEXes be built with settings to improve
performance:

- with `--no-pre-install-wheels`, to package `.whl` directly rather than
unpack and install them. (NB. this requires Pex 2.3.0 to pick up
pex-tool/pex#2392)
- with `PEX_MAX_INSTALL_JOBS`, to use more concurrency for install, when
available

This is designed to be a performance improvement for any processing
where Pants synthesises a PEX internally, like `pants run
path/to/script.py` or `pants test ...`.
pex-tool/pex#2292 has benchmarks for the PEX
tool itself.

For benchmarks, I did some more purposeful ones with tensorflow (PyTorch
seems a bit awkward hard to set-up and Tensorflow is still huge), using
https://gist.github.com/huonw/0560f5aaa34630b68bfb7e0995e99285 . I did 3
runs each of two goals, with 2.21.0.dev4 and with `PANTS_SOURCE`
pointing to this PR, and pulled the numbers out by finding the relevant
log lines:

- `pants --no-local-cache --no-pantsd --named-caches-dir=$(mktemp -d)
test example_test.py`. This involves building 4 separate PEXes partially
in parallel, partially sequentially: `requirements.pex`,
`local_dists.pex` `pytest.pex`, and then `pytest_runner.pex`. The first
and last are the interesting ones for this test.
- `pants --no-local-cache --no-pantsd --named-caches-dir=$(mktemp -d)
run script.py`. This just builds the requirements into `script.pex`.

(NB. these are potentially unrealistic in they're running with all
caching turned off or cleared, so are truly a worst case. This means
they're downloading tensorflow wheels and all the others, each time,
which takes about 30s on my 100Mbit/s connection. Faster connections
will thus see a higher ratio of benefit.)

| goal                | period                       | before (s) | after (s) |
|---------------------|------------------------------|-----------:|----------:|
| `run script.py`     | building requirements        |      74-82 |     49-52 |
| `test some_test.py` | building requirements        |      67-71 |     30-36 |
|                     | building pytest runner       |        8-9 |     17-18 |
|                     | total to start running tests |      76-80 |     53-58 |
 
I also did more adhoc ones on a real-world work repo of mine, which
doesn't use any of the big ML libraries, just running some basic goals
once.

| goal                                              | period                                  | before (s) | after (s) |    |
|---------------------------------------------------|-----------------------------------------|-----------:|----------:|----|
| `pants export` on largest resolve                 | building requirements                   |         66 |        35 |    |
|                                                   | total                                   |         82 |        54 |    |
| "random" `pants test path/to/file.py` (1 attempt) | building requirements and pytest runner |          1 |        49 | 38 |

Fixes #15062
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants