Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make URL from extract_link safe and add missing scheme #19

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

PyExplorer
Copy link

Problem:
In some cases, the extract_link function returns links that are not fully valid or not entirely safe to use in requests.
Suggested fix:
There is an additional function that makes the link safe using the safe_url_string function, and also adds a scheme for cases where we have a link that looks valid but lacks a scheme and base URL. (e.g. //example.com)

@PyExplorer PyExplorer requested a review from wRAR December 6, 2024 10:18
@PyExplorer
Copy link
Author

Hi @wRAR please take a look at the suggested fix.

zyte_parsers/utils.py Outdated Show resolved Hide resolved
Co-authored-by: Andrey Rakhmatullin <[email protected]>
@PyExplorer PyExplorer requested a review from wRAR December 6, 2024 11:43
@PyExplorer
Copy link
Author

Hi @lopuhin , could you also take a look?

Copy link
Contributor

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PyExplorer looks good to me

Comment on lines 95 to 96
# add scheme (https) when missing schema and no base url
safe_link = add_https_to_url(safe_link)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor: I think "no base url" part of the comment does not quite apply, as this can have effect even if base URL is supplied. Also if we take that away we'd just repeat the function name so maybe we could remove the comment completely.

Copy link
Author

@PyExplorer PyExplorer Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in case of existing base_url this link gets scheme from base_url here strip_urljoin.
I just wanted to point out that this append scheme is kind of an add-on to strip_urljoin.
Do you think it's better to remove this to avoid confusion?
UPDATE: For the simplicity I removed this function.



def extract_link(
a_node: SelectorOrElement, base_url: str, force_safe: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand you're adding force_safe to make new behavior optional, I wonder if it's possible to apply that unconditionally to keep API simpler, or there are any cases where you are worried it can break working code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually I don't know where and in what cases this feature is used and decided to add it in such a way to prevent any unpredictable problems. And also for the cases if someone wants to use this without changing the link at all.

@PyExplorer PyExplorer requested a review from lopuhin December 9, 2024 18:53
Comment on lines 59 to 67
def add_https_to_url(url: str) -> str:
if url.startswith(("http://", "https://")):
return url

parsed_url = urlparse(url)
if not parsed_url.scheme and parsed_url.netloc:
parsed_url = parsed_url._replace(scheme="https")

return str(urlunparse(parsed_url))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why do we need this function? If a URL is written without http or https, it won't work anyways in the browser, right (unless it's a relative URL, which uses different rules - adding http/https is wrong for them)?

If so, why are we adding the missing schema? My concern is that it'd change some invalid URLs to URLs which look valid, but which are not.

Copy link
Contributor

@kmike kmike Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, or is it the // example, i.e. the relative schema ("use whatever protocol the page is using")? In this case, adding https could be not 100% precise, as it should be the protocol of the web page itself.

Copy link
Author

@PyExplorer PyExplorer Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to handle such cases like //example.com when base_url here https://github.com/zytedata/zyte-parsers/pull/19/files#diff-0db3ea2158d725e2de6295c907fbe44928efc0c94057ef6ca85dbbee4c733819R83 is not provided.
But this function doesn't seem to be a very important feature at the moment and we can easily ignore it.
UPDATE: For the simplicity I removed this function.

@PyExplorer PyExplorer requested a review from kmike December 12, 2024 06:44
@PyExplorer
Copy link
Author

PyExplorer commented Dec 12, 2024

@kmike, @lopuhin could you please take a look once again? I removed add_https_to_url

Copy link
Contributor

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PyExplorer looks good to me. Perhaps we could make this default eventually, but it's better to have it here and it's nice that we have more tests now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants