-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consume packed wheel cache in zipapp creation #2175
Conversation
Reading @jsirois's thoughts at #2158 (comment) again:
I'll take a look at pyoxy and friends to convert the crate into a python distribution. I'll leave this as a draft until then, and I'll re-ping reviewers at that time. |
Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/. |
Great. Yeah @cosmicexplorer that's exactly what I meant. I've been away climbing here and haven't taken a look yet, but if this all looks good, consuming as a native wheel was the idea. |
Ok, just had a lot of fun converting this into a pyo3 package and the result runs even faster! This branch isn't passing because I've added some hacks to get it running, but I will be working to remove those! |
e6ea967
to
fb4bbae
Compare
954476e
to
4d04953
Compare
add change to do parallel zipping only, no crawling modify cli arg format for medusa-zip tool update cli arg format fix non-exhaustive CopyMode usage [BROKEN] add first run of complete medusa zip with cli arg! the resulting zip cannot be zipimported yet.... medusa zipping works great now, let's revert .zip() changes bump medusa options bump more medusa options use the merged medusa command lines now manage a cache of parallel intermediate zip generation jobs! small fix much closer to mergeable now working much more complex control flow between the medusa-zip cli move medusa zip to medusa.py medusa works for packed apps now too works for everything, but kind of crazy close stdin after writing to the child process factor out a ridiculous amount of boilerplate add back the non-medusa impl for packed layouts implement a "normal" version which uses atomic directories revert unintentional whitespace changes separate the serial and parallel pex creations remove the attempts at parallelism add --medusa-path remove unused code make the medusa hook work when not provided add back a tracer revert some changes that make things harder to read revert some changes i shouldn't need make medusa work with the medusa-zip package and not subprocesses! update after adding defaults in medusa-zip python package remove -f arg for resolving medusa-zip [BROKEN] possibly obsolete! fix cli arg add stitched layout create stitch copymode no initial stitch impl add merge_zip_file() method move packed wheel caching into common methods initial impl of merging zips with entry renaming make MergeableZipFile into a subclass of ZipFile fix header offset calculation tmp fix mypy remove unused imports fix buffering for file handles in py2 Revert "tmp" This reverts commit 8ad12de71a455c3918434cd81dcaf319fa43d9b4.
Oops! Didn't realize that was the current behavior! However, regarding the following:
Would it be useful to put |
I'm advocating for don't muck with essentially unrelated stuff in this PR. Please keep the status quo.
Your contributions are very welcome, but I'm having flashbacks of the wild sprawling nature these things tended to have and that part is very unwelcome as a reviewer, especially a plodding, single threaded reviewer like me. Please keep the exposed products of your work as focused as possible regardless of how sprawling your internal work process may be. |
Please see #2175 (comment), where you'll find I had a simple misunderstanding. I understand you are not at liberty to review in depth right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Danny. This looks good to me mod some details.
copy_file_range(source_zf.fp, self.fp, zinfo.compress_size) | ||
|
||
# (4) Hack the synthesized ZipInfo onto the destination zip's infos. | ||
self.filelist.append(zinfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the hack tally is self.{fp,filelist,NameToInfo}
afaict. Techincally "public", but certainly not documented as such. CI proves this is OK across CPython 2.7, 3.5, 3.7, 3.11 and 3.12 as well as PyPy 2.7 and 3.9. CI will continue to be extensive; so I'm not too worried in the medium term, but it does seem like this should all be behind a flag given the combination of the hackyness and the new cache burden (which you thankfully eliminated for --layout packed
users like Pants, but still exists for folks who never use(d) that option). You seemed to be on board with a feature flag ealier - does that still seem reasonable?
Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps no flag is just fine. I'm a bit conflicted here. I still do not see the real use case for this feature, I'd personally use --layout packed
for both iteration and distribution (via rsync over ssh), but Pex has never treated new caches as opt-in. Its cache behavior is currently uniformly untunable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seemed to be on board with a feature flag ealier - does that still seem reasonable?
Yes, absolutely! Pip still has the zip file magic I proposed to them behind --use-feature=fast-deps
for similar reasons, although that's had a lot more people contribute to it since then.
the new cache burden (which is you thankfully eliminated for --layout packed users like Pants, but still exists for folks who never use(d) that option).
Yes: for this reason alone I think it should be behind a feature flag.
Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.
Well yes, that definitely hasn't evaporated; in fact, it's quite useful, and indeed takes advantage of rust's structured parallelism, and it's a CLI, a rust library, and a pyo3 python library: https://github.com/cosmicexplorer/medusa-zip. I also got the hang of using github actions and cibuildwheels in the meantime and it's now published to crates.io ({,lib,py}medusa-zip
) and pypi (medusa-zip
) upon every tagged release. I've been continuing to speak with @jonjohnsonjr regarding the similarity of this work to Chainguard's use case efficiently merging tarballs and/or containers (see chainguard-dev/apko#781) and considering whether we can share any code. libarchive
exists, but its goal is compatibility with the widest range of formats, not performance, and it seems plausible that a general framework for efficiently merging the contents of archives might have applicability to a wide range of build tools and package managers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outside Python.
I might propose this ZipFile#merge_archive()
method to python-ideas
too, although it seems possibly too specialized to be in the stdlib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems too specialized. FWICT they have been on a somewhat lazy warpath to do less in the stdlib, not more.
Yes, but
The |
Just noticed this:
I actually mean it. The docs are in a sorry state. They are horrible. I have not spent any time on them in the ~5 years I've been working as the primary Pex maintainer; meanwhile I have spent time adding a lot of features for Pants; so the docs are severely out of date. I do plan to circle back though. I've vetted out a whole system over at https://github.com/a-scie/lift based on Sphinx that I will be applying here along with fully re-worked content. |
a3d3415
to
e587fd3
Compare
a223a94
to
738c007
Compare
On it! #2209 |
738c007
to
713d0ed
Compare
713d0ed
to
31592ec
Compare
@jsirois you already approved this, but I assumed you'd want to review again after I introduced the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look more closely here in an hour or so (this is from phone); then I'm AFK until August 18th.
@@ -664,17 +672,34 @@ def build( | |||
else: | |||
shutil.rmtree(tmp_pex, True) | |||
|
|||
if not compress: | |||
if cache_dists is True: | |||
pex_warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, say there is some legitimate use case, aka the user knows better. You've now pissed them off by nagging them. It's much cleaner if you just fail fast if that makes sense to do. I'm not commenting on whether or not there is a legitimate use case, my intent is just to point out the cost of combinatorial feature explosion and the nanny vs. document vs. trust the user tradeoffs.
Re the commit description, I'm still not a fan of the formula - you can say the exact same thing without the damn template. But that's just my grumpy dislike of the 5 paragraph essay. More importantly you just whiff the actual alternative, which is the packed layout. I think you should address why that doesn't work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the uncompressed option to the existing packed layout is a mistake. Seems good otherwise.
"PEX's bootstrap and dependency zip files. Does nothing for loose layout PEXes." | ||
"PEX's bootstrap and dependency zip files. " | ||
"Uncompressed PEX files are much faster to create from an empty cache, but are no " | ||
"faster after the cache has been populated, and uncompressed cache entries will " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading a second time there is just no reason to cache uncompressed at all. I definitely vote for fail fast. It seems compress=False
, cached_dists=True
should not be allowed at all. Further, there is no reason for the current packed layout behavior to be toggleable. If you're building a packed layout, the zips should be compressed zips since the whole point is cache friendliness. If there is no call to add a degree of freedom, don't add it because, at least in Pex, you can never take it back.
zinfo.compress_type = zipfile.ZIP_STORED | ||
data = b"" | ||
else: | ||
zinfo.file_size = st.st_size | ||
zinfo.compress_type = zipfile.ZIP_DEFLATED | ||
# File contents may be compressed or decompressed. Decompressed is significantly faster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a gratuitous comment. You're selling here or else speaking to an end user. Clearly this is not the place to speak to the user and the coder doesn't need to be sold to, if a function takes a param ... it takes a param. If its a public function maybe sell in the API doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I'll be offline ~presently. Hopefully this approve over-rides the prior request changes. I would like to see the change, but I trust you to do the right thing.
Thanks for putting up with my opinions here. You've started a contribution spike and I'm grateful for that, but I'm pretty open about not viewing all contribution as a good thing by default. As such I'm just trying to set up my expectations / the current Pex code base opinions clearly to help make further contribution more straightforward for us both. My time on this project is limited as I'm sure yours is too; so making things efficient seems to benefit us both.
@cosmicexplorer although this PR is approved, another thing to weigh is #2298 which provides even greater speedups with 0 hacks by skipping zipping entirely with a bonus of skipping unzipping (pre-installing) as well. |
Hey @jsirois, thanks so much for your thoughtful review. I'll be rebasing and responding to your comments, although I understand that pex has changed a bit in the meantime. I think this is a useful change and I have the time to see it through. |
Ah, I'll read through #2298 before doing that. I know you've been pushing back on adding this PR's complexity to pex for a while, and I've since found that some of the tricks here are probably better suited for the rust zip crate. I'll read that change and re-evaluate this now. Thanks. |
Ok, it seems like this change will still provide a speedup in zipapp creation, without incurring the (very slight = O(100ms)) cold-start performance slowdown of |
Ok, I totally buy this comment: #2175 (comment). I think I'm willing to close this and move the zip file magic to my work on the zip crate. I was really surprised/impressed that we were able to do this, but |
Closes #2158.
Problem
#1675 has another repro.
Generating a
--compress --layout zipapp
(the default) regularly exhibits pathologically slow performance (46s zip / 1m1s total) on many real-world inputs (especially tensorflow). This occurs even when all dists are already downloaded from pypi:Alternatives
#1705 introduced
--no-compress
, which drastically improves the zip performance (2.2s zip => 20.6x speedup / 19.3s total = 2.4x speedup):but as #1686 describes, it also introduces a significant tradeoff in file size (1.6G => 2.5x blowup):
The cache entries are also many times larger:
Solution
pex.ziputils.MergeableZipFile
to incorporate the contents of other zip files without decompressing them.pex.common.copy_file_range()
to copy over a limited range of the contents of another file handle.--cache-dists
option to toggle the use (and/or creation) of the packed wheel caches for.bootstrap/
and.deps/
subdirectories.--layout packed
caches for--layout zipapp
when--cache-dists
is provided.