-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Avoid unnecessary URL processing while parsing links #13132
Conversation
As an example for where this matters, with #13128 already applied, this saves about 1600 ms while collecting and resolving a list of homeassistant dependencies: Command:
And while this depends on network performance (so please look at the time elapsed, not the percentages which may be off), the entire command takes ~16-18 seconds, so the savings are significant. |
src/pip/_internal/models/link.py
Outdated
# correct to parse the file URL and check if it has a scheme, but the | ||
# slow URL parsing urljoin() does is what we're trying to avoid in the | ||
# first place, so we only check for the http[s]:// prefix.) | ||
if file_url.startswith(("https://", "http://")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we don't need to support other schemes here? What about file://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire function is meant to be an optimization. We used to call urljoin()
for every single file URL, but that was pointless for PyPI simple pages as their file URLs are already absolute (and thus urljoin()
would return said URL unchanged). So it's not a matter of support, but rather is it worth to optimize for file://
? How common are file://
URLs in a simple index response? They'd only make sense with a local dev server (maybe devpi?)
I did just check and apparently relative file://
URLs are not a thing, so it'd be totally fine to also check for that scheme, but I'm curious to whether you know where they'd show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not optimising file://
is fine to me but we should make this clear in the function name (e.g. call this _absolute_http_url
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this function still return an absolute url regardless? I thought this optimization was an implementation detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it seems like devpi doesn't serve absolute URLs, so I think it's fair to leave it out to avoid paying the (small) overhead to optimize the rare file://
URLs.
83f9a08
to
66e5e73
Compare
There are three optimizations in this commit, in descending order of impact: - If the file URL in the "project detail" response is already absolute, then avoid calling urljoin() as it's expensive (mostly because it calls urlparse() on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts with http[s]://, we can avoid slow urljoin() calls for PyPI responses. - Replacing urllib.parse.urlparse() with urllib.parse.urlsplit() in _ensure_quoted_url(). The URL parsing functions are equivalent for our needs[^1]. However, urlsplit() isfaster, and we achieve better cache utilization of its internal cache if we call it directly[^2]. - Calculating the Link.path property in advance as it's very hot. [^1]: we don't care about URL parameters AFAIK (which are different than the query component!) [^2]: urlparse() calls urlsplit() internally, but it passes the authority parameter (unlike any of our calls) so it bypasses the cache. Co-authored-by: Stéphane Bidoul <[email protected]>
11e3065
to
3a847b9
Compare
Thanks for the review @sbidoul. To be safe, I threw the optimized |
There are three optimizations in this commit, in descending order of impact:
If the file URL in the "project detail" response is already absolute, then avoid calling
urljoin()
as it's expensive (mostly because it callsurlparse()
on both of its URL arguments) and does nothing. While it'd be more correct to check whether the file URL has a scheme, we'd need to parse the URL which is what we're trying to avoid in the first place. Anyway, by simply checking if the URL starts withhttp[s]://
, we can avoid slowurljoin()
calls for PyPI responses.Replacing
urllib.parse.urlparse()
withurllib.parse.urlsplit()
in_ensure_quoted_url()
. The URL parsing functions are equivalent for our needs1. However,urlsplit()
isfaster, and we achieve better cache utilization of its internal cache if we call it directly2.Calculating the Link.path property in advance as it's very hot.
Footnotes
we don't care about URL parameters AFAIK (which are different than the query component!) ↩
urlparse()
callsurlsplit()
internally, but it passes theauthority
parameter (unlike any of our calls) so it bypasses the cache. ↩