-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make URL from extract_link safe and add missing scheme #19
Open
PyExplorer
wants to merge
9
commits into
main
Choose a base branch
from
extract_url_safe_https
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
0d4669d
add tests for extract_link
PyExplorer 138fb39
make link from extract_link safe and with missing scheme
PyExplorer 0fcfaba
add early return to add_https_to_url
PyExplorer 4b854dc
format
PyExplorer 8c77922
add early return if link is empty/None
PyExplorer 6bbc25d
Update zyte_parsers/utils.py
PyExplorer 209fbe7
tune add_https_to_url
PyExplorer ef9925d
Merge remote-tracking branch 'origin/extract_url_safe_https' into ext…
PyExplorer 9a31516
remove add_https_to_url
PyExplorer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,72 @@ | ||
from pathlib import Path | ||
|
||
import pytest | ||
from parsel import Selector # noqa: F401 | ||
|
||
from zyte_parsers.utils import extract_link, fromstring | ||
|
||
TEST_DATA_ROOT = Path(__file__).parent / "data" | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"html_input, base_url, expected_output", | ||
[ | ||
("<a href=' http://example.com'>", "", "http://example.com"), | ||
("<a href='foo'>", "http://example.com", "http://example.com/foo"), | ||
("<a href='/foo '>", "http://example.com", "http://example.com/foo"), | ||
("<a href='//foo '>", "http://example.com", "http://foo"), | ||
( | ||
"<a href='//example.com/foo'>", | ||
"http://example.com", | ||
"http://example.com/foo", | ||
), | ||
# Selector | ||
( | ||
Selector(text="<a href='http://example.com'>").css("a")[0], | ||
"", | ||
"http://example.com", | ||
), | ||
# no base url | ||
("<a href='foo'>", "", "foo"), | ||
("<a href='/foo '>", "", "/foo"), | ||
("<a href='//foo '>", "", "//foo"), | ||
("<a href='' data-url='http://example.com'>", "", "http://example.com"), | ||
("<a href='http://example.com'>", "", "http://example.com"), | ||
# invalid url | ||
("<a href='javascript:void(0)'>", "", None), | ||
("<a href=''>", "http://example.com", None), | ||
], | ||
) | ||
def test_extract_link(html_input, base_url, expected_output): | ||
a_node = fromstring(html_input) if isinstance(html_input, str) else html_input | ||
result = extract_link(a_node, base_url) | ||
assert result == expected_output | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"html_input, base_url, expected_output", | ||
[ | ||
# Spaces in the path | ||
( | ||
"<a href='/path/to/resource with spaces'>", | ||
"http://example.com", | ||
"http://example.com/path/to/resource%20with%20spaces", | ||
), | ||
# Missing schema and base_url | ||
( | ||
"<a href='//example.com/foo'>", | ||
"", | ||
"//example.com/foo", | ||
), | ||
# no base url | ||
("<a href='foo'>", "", "foo"), | ||
("<a href='/foo '>", "", "/foo"), | ||
("<a href='//foo '>", "", "//foo"), | ||
("<a href='' data-url='http://example.com'>", "", "http://example.com"), | ||
("<a href='http://example.com'>", "", "http://example.com"), | ||
], | ||
) | ||
def test_extract_safe_link(html_input, base_url, expected_output): | ||
a_node = fromstring(html_input) if isinstance(html_input, str) else html_input | ||
result = extract_link(a_node, base_url, force_safe=True) | ||
assert result == expected_output |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand you're adding
force_safe
to make new behavior optional, I wonder if it's possible to apply that unconditionally to keep API simpler, or there are any cases where you are worried it can break working code?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually I don't know where and in what cases this feature is used and decided to add it in such a way to prevent any unpredictable problems. And also for the cases if someone wants to use this without changing the link at all.