Make URL from extract_link safe and add missing scheme #19

PyExplorer · 2024-12-06T10:18:30Z

Problem:
In some cases, the extract_link function returns links that are not fully valid or not entirely safe to use in requests.
Suggested fix:
There is an additional function that makes the link safe using the safe_url_string function, and also adds a scheme for cases where we have a link that looks valid but lacks a scheme and base URL. (e.g. //example.com)

PyExplorer · 2024-12-06T10:19:31Z

Hi @wRAR please take a look at the suggested fix.

zyte_parsers/utils.py

Co-authored-by: Andrey Rakhmatullin <[email protected]>

PyExplorer · 2024-12-06T11:43:51Z

Hi @lopuhin , could you also take a look?

lopuhin

Thanks @PyExplorer looks good to me

lopuhin · 2024-12-06T12:22:49Z

zyte_parsers/utils.py

+    # add scheme (https) when missing schema and no base url
+    safe_link = add_https_to_url(safe_link)


Super minor: I think "no base url" part of the comment does not quite apply, as this can have effect even if base URL is supplied. Also if we take that away we'd just repeat the function name so maybe we could remove the comment completely.

Ah, in case of existing base_url this link gets scheme from base_url here strip_urljoin.
I just wanted to point out that this append scheme is kind of an add-on to strip_urljoin.
Do you think it's better to remove this to avoid confusion?
UPDATE: For the simplicity I removed this function.

lopuhin · 2024-12-06T12:24:57Z

zyte_parsers/utils.py

+
+
+def extract_link(
+    a_node: SelectorOrElement, base_url: str, force_safe: bool = False


As I understand you're adding force_safe to make new behavior optional, I wonder if it's possible to apply that unconditionally to keep API simpler, or there are any cases where you are worried it can break working code?

Yes, actually I don't know where and in what cases this feature is used and decided to add it in such a way to prevent any unpredictable problems. And also for the cases if someone wants to use this without changing the link at all.

kmike · 2024-12-09T21:14:49Z

zyte_parsers/utils.py

+def add_https_to_url(url: str) -> str:
+    if url.startswith(("http://", "https://")):
+        return url
+
+    parsed_url = urlparse(url)
+    if not parsed_url.scheme and parsed_url.netloc:
+        parsed_url = parsed_url._replace(scheme="https")
+
+    return str(urlunparse(parsed_url))


Hm, why do we need this function? If a URL is written without http or https, it won't work anyways in the browser, right (unless it's a relative URL, which uses different rules - adding http/https is wrong for them)?

If so, why are we adding the missing schema? My concern is that it'd change some invalid URLs to URLs which look valid, but which are not.

Hm, or is it the // example, i.e. the relative schema ("use whatever protocol the page is using")? In this case, adding https could be not 100% precise, as it should be the protocol of the web page itself.

Yes, the idea is to handle such cases like //example.com when base_url here https://github.com/zytedata/zyte-parsers/pull/19/files#diff-0db3ea2158d725e2de6295c907fbe44928efc0c94057ef6ca85dbbee4c733819R83 is not provided.
But this function doesn't seem to be a very important feature at the moment and we can easily ignore it.
UPDATE: For the simplicity I removed this function.

…ract_url_safe_https

PyExplorer · 2024-12-12T06:45:07Z

@kmike, @lopuhin could you please take a look once again? I removed add_https_to_url

lopuhin

Thanks @PyExplorer looks good to me. Perhaps we could make this default eventually, but it's better to have it here and it's nice that we have more tests now.

PyExplorer added 3 commits December 6, 2024 13:05

add tests for extract_link

0d4669d

make link from extract_link safe and with missing scheme

138fb39

add early return to add_https_to_url

0fcfaba

PyExplorer requested a review from wRAR December 6, 2024 10:18

PyExplorer added 2 commits December 6, 2024 13:22

format

4b854dc

add early return if link is empty/None

8c77922

wRAR reviewed Dec 6, 2024

View reviewed changes

zyte_parsers/utils.py Outdated Show resolved Hide resolved

Update zyte_parsers/utils.py

6bbc25d

Co-authored-by: Andrey Rakhmatullin <[email protected]>

PyExplorer requested a review from wRAR December 6, 2024 11:43

lopuhin approved these changes Dec 6, 2024

View reviewed changes

PyExplorer requested a review from lopuhin December 9, 2024 18:53

kmike reviewed Dec 9, 2024

View reviewed changes

PyExplorer added 2 commits December 10, 2024 09:55

tune add_https_to_url

209fbe7

Merge remote-tracking branch 'origin/extract_url_safe_https' into ext…

ef9925d

…ract_url_safe_https

PyExplorer requested a review from kmike December 12, 2024 06:44

remove add_https_to_url

9a31516

lopuhin approved these changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make URL from extract_link safe and add missing scheme #19

Make URL from extract_link safe and add missing scheme #19

PyExplorer commented Dec 6, 2024

PyExplorer commented Dec 6, 2024

PyExplorer commented Dec 6, 2024

lopuhin left a comment

lopuhin Dec 6, 2024

PyExplorer Dec 6, 2024 •

edited

Loading

lopuhin Dec 6, 2024

PyExplorer Dec 6, 2024

kmike Dec 9, 2024

kmike Dec 9, 2024 •

edited

Loading

PyExplorer Dec 10, 2024 •

edited

Loading

PyExplorer commented Dec 12, 2024 •

edited

Loading

lopuhin left a comment

		# add scheme (https) when missing schema and no base url
		safe_link = add_https_to_url(safe_link)



		def extract_link(
		a_node: SelectorOrElement, base_url: str, force_safe: bool = False

Make URL from extract_link safe and add missing scheme #19

Are you sure you want to change the base?

Make URL from extract_link safe and add missing scheme #19

Conversation

PyExplorer commented Dec 6, 2024

PyExplorer commented Dec 6, 2024

PyExplorer commented Dec 6, 2024

lopuhin left a comment

Choose a reason for hiding this comment

lopuhin Dec 6, 2024

Choose a reason for hiding this comment

PyExplorer Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

lopuhin Dec 6, 2024

Choose a reason for hiding this comment

PyExplorer Dec 6, 2024

Choose a reason for hiding this comment

kmike Dec 9, 2024

Choose a reason for hiding this comment

kmike Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

PyExplorer Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

PyExplorer commented Dec 12, 2024 • edited Loading

lopuhin left a comment

Choose a reason for hiding this comment

PyExplorer Dec 6, 2024 •

edited

Loading

kmike Dec 9, 2024 •

edited

Loading

PyExplorer Dec 10, 2024 •

edited

Loading

PyExplorer commented Dec 12, 2024 •

edited

Loading