-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathCHANGELOG.txt
476 lines (378 loc) · 20.9 KB
/
CHANGELOG.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
1.0
1.0 will be a complete api overhaul
0.12.1
* typing
* added `METADATA_PARSER_FUTURE` environment variable
`export METADATA_PARSER_FUTURE=1` to enable
* is_parsed_valid_url can accept a ParseResultBytes object now
0.12.0
* drop python 2.7
* initial typing support
0.11.0 | UNRELEASED
* BREAKING CHANGES
Due to the following breaking changes, the version was bumped to 0.11.0
* `MetadataParser.fetch_url` now returns a third item.
* COMPATIBLE CHANGES
The following changes are backwards compatible to the 0.10.x releases
* a test-suite for an application leveraging `metadata_parser` experienced
some issues due to changes in the Responses package used to mock tests.
to better faciliate against that, a new change were made:
MetadataParser now has 2 subclassable attributes for items that should
or should not be parsed:
+ _content_types_parse = ("text/html",)
+ _content_types_noparse = ("application/json",)
Previously, these values were hardcoded into the logic.
* some error log messages were reformatted for clarity
* some error log messages were incorrectly reformatted by black
* added logging for NotParseable situations involving redirects
* added a `.response` attribute to NotParsable errors to help debug
redirects
* added a new ResponseHistory class to track redirects
* it is computed and returned during `MetadataParser.fetch_url`
* `MetadataParser.parse(` optionally accepts it, and will stash
it into ParsedResult
* `ParsedResult`
* ResponseHistory is not stashed in the metadata stash, but a new namespace
* `.response_history` will either be `ResponseHistory` or None
* improving docstrings
* added `decode_html` helper
* extended MetadataParser to allow registration of a defcault_encoder for results
* style cleanup
0.10.5
packaging fixes
migrated 'types.txt' out of distribution; it remains in github source
updated some log lines with the url
introduced some new log lines
added `METADATA_PARSER__DISABLE_TLDEXTRACT` env
merged, but reverted PR#34 which addresses Issue#32
0.10.4
* black via pre-commit
* upgraded black; 20.8b1
* integrated with pre-commit
* github actions and tox
* several test files were not in git!
0.10.3
updated docs on bad data
black formatting
added pyproject.toml
moved BeautifulSoup generation into it's own method, so anyone can subclass to customize
:fixes: https://github.com/jvanasco/metadata_parser/issues/25
some internal variable changes thanks to flake8
0.10.2
added some docs on encoding
0.10.1
clarifying some inline docs
BREAKING CHANGE: `fetch_url` now returns a tuple of `(html, encoding)
now tracking in ParsedResult: encoding
ParsedResult.metadata['_internal']['encoding'] = resp.encoding.lower() if resp.encoding else None
`.parse` now accepts `html_encoding`
refactored url fetching to use context managers
refactored url fetching to only insert our hooks when needed
adjusted test harness to close socket connections
0.10.0
better Python3 support by using the six library
0.9.23
added tests for url entities
better grabbing of the charset
better grabbing of some edge cases
0.9.22
removed internal calls to the deprecated `get_metadata`, replacing them with `get_metadatas`.
this will avoid emitting a deprecation warning, allowing users to migrate more easily
0.9.21
* requests_toolbelt is now required
** this is to solve PR#16 / Issue#21
** the toolbelt and built-in versions of get_encodings_from_content required different workarounds
* the output of urlparse is now cached onto the parser instance.
** perhaps this will be global cache in the future
* MetadataParser now accepts `cached_urlparser`
** default: True
options: True: use a instance of UrlParserCacheable(maxitems=30)
: INT: use a instance of UrlParserCacheable(maxitems=cached_urlparser)
: None/False/0 - use native urlparse
: other truthy values - use as a custom urlparse
* addressing issue #17 (https://github.com/jvanasco/metadata_parser/issues/17) where `get_link_` logic does not handle schemeless urls.
** `MetadataParser.get_metadata_link` will now try to upgrade schemeless links (e.g. urls that start with "//")
** `MetadataParser.get_metadata_link` will now check values against `FIELDS_REQUIRE_HTTPS` in certain situations to see if the value is valid for http
** `MetadataParser.schemeless_fields_upgradeable` is a tuple of the fields which can be upgradeable. this defaults to a package definition, but can be changed on a per-parser bases.
The defaults are:
'image',
'og:image', 'og:image:url', 'og:audio', 'og:video',
'og:image:secure_url', 'og:audio:secure_url', 'og:video:secure_url',
** `MetadataParser.schemeless_fields_disallow` is a tuple of the fields which can not be upgradeable. this defaults to a package definition, but can be changed on a per-parser bases.
The defaults are:
'canonical',
'og:url',
** `MetadataParser.get_url_scheme()` is a new method to expose the scheme of the active url
** `MetadataParser.upgrade_schemeless_url()` is a new method to upgrade schemeless links
it accepts two arguments: url and field(optional)
if present, the field is checked against the package tuple FIELDS_REQUIRE_HTTPS to see if the value is valid for http
'og:image:secure_url',
'og:audio:secure_url',
'og:video:secure_url',
0.9.20
* support for deprecated `twitter:label` and `twitter:data` metatags, which use "value" instead of "content".
* new param to `__init__` and `parse`: `support_malformed` (default `None`).
if true, will support malformed parsing (such as consulting "value" instead of "content".
functionality extended from PR #13 (https://github.com/jvanasco/metadata_parser/pull/13) from https://github.com/amensouissi
0.9.19
* addressing https://github.com/jvanasco/metadata_parser/issues/12
on pages with duplicate metadata keys, additional elements are ignored
when parsing the document, duplicate data was not kept.
* `MetadataParser.get_metadata` will always return a single string (or none)
* `MetadataParser.get_metadatas` has been introduced. this will always return an array.
* the internal parsed_metadata store will now store data in a mix of arrays and strings, keeping it backwards compatible
* This new version benches slightly slower because of the mixed format but preserves a smaller footprint.
* the parsed result now contains a version record for tracking the format `_v`.
* standardized single/double quoting
* cleaned up some line
* the library will try to coerce strategy= arguments into the right format
* when getting dublin core data, the result could either be a string of a dict. there's no good way to handle this.
* added tests for encoders
* greatly expanded tests
0.9.18
* removed a stray debug line
0.9.17
* added `retry_dropped_without_headers` option
0.9.16
* added `fix_unicode_url()`
* Added `allow_unicode_url` (default True) to the following calls:
`MetadataParser.get_url_canonical`
`MetadataParser.get_url_opengraph`
`MetadataParser.get_discrete_url`
This functionality will try to recode canonical urls with unicode data into percent-encoded streams
0.9.15
* Python3 support returned
0.9.14
* added some more tests to ensure encoding detected correctly
* stash the soup sooner when parsing, to aid in debugging
0.9.13
* doing some work to guess encoding...
* internal: now using `resp` instead of `r`, it is easier for pdb debugging
* the peername check was changed to be a hook, so it can be processed more immediately
* the custom session redirect test was altered
* changed the DummyResponse encoding fallback to `ENCODING_FALLBACK` which is Latin (not utf8)
this is somewhat backwards incompatible with this library, but maintains compatibility with the underlying `requests` library
0.9.12
* added more attributes to DummyResponse:
** `content`
** `headers`
0.9.11
* some changes to how we handle upgrading bad canonicals
upgrades will no longer happen IF they specify a bad domain.
upgrades from localhost will still transfer over
0.9.10
* slight reorder internally of TLD extract support
0.9.9
* inspecting `requests` errors for a response and using it if possible
* this will now try to validate urls if the `tldextract` library is present.
this feature can be disabled with a global toggle
import metadata_parser
metadata_parser.USE_TLDEXTRACT = False
0.9.7
* changed some internal variable names to better clarify difference between a hostname and netloc
0.9.7
updated the following functions to test for RFC valid characters in the url string
some websites, even BIG PROFESSIONAL ONES, will put html in here.
idiots? amateurs? lazy? doesn't matter, they're now our problem. well, not anymore.
* get_url_canonical
* get_url_opengraph
* get_metadata_link
0.9.6
this is being held for an update to the `requests` library
* made the following arguments to `MetadataParser.fetch_url()` default to None - which will then default to the class setting. they are all passed-through to `requests.get`
** `ssl_verify`
** `allow_redirects`
** `requests_timeout`
* removed `force_parse` kwarg from `MetadataParser.parser`
* added 'metadata_parser.RedirectDetected' class. if allow_redirects is False, a detected redirect will raise this.
* added 'metadata_parser.NotParsableRedirect' class. if allow_redirects is False, a detected redirect will raise this if missing a Location.
* added `requests_session` argument to `MetadataParser`
* starting to use httpbin for some tests
* detecting JSON documents
* extended NotParseable exceptions with the MetadataParser instance as `metadataParser`
* added `only_parse_http_ok` which defaults to True (legacy). submitting False will allow non-http200 responses to be parsed.
* shuffled `fetch_url` logic around. it will now process more data before a potential error.
* working on support for custom request sessions that can better handle redirects (requires patch or future version of requests)
* caching the peername onto the response object as `_mp_peername` [ _m(etadata)p(arser)_peername ]. this will allow it to be calculated in a redirect session hook. (see tests/sessions.py)
* added `defer_fetch` argument to `MetadataParser.__init__`, default ``False``. If ``True``, this will overwrite the instance's `deferred_fetch` method to actually fetch the url. this strategy allows for the `page` to be defined and response history caught. Under this situation, a 301 redirecting to a 500 can be observed; in the previous versions only the 500 would be caught.
* starting to encapsulate everything into a "parsed result" class
* fixed opengraph minimum check
* added `MetadataParser.is_redirect_unique`
* added `DummyResponse.history`
0.9.5
* failing to load a document into BeautifulSoup will now catch the BS error and raise NotParsable
0.9.4
* created `MetadataParser.get_url_canonical`
* created `MetadataParser.get_url_opengraph`
* `MetadataParser.get_discrete_url` now calls `get_url_canonical` and `get_url_opengraph`
0.9.3
* fixed packaging error. removed debug "print" statements
0.9.2
* upgrade nested local canonical rels correctly
0.9.1
* added a new `_internal` storage namespace to the `MetadataParser.metadata` payload.
this simply stashes the `MetadataParser.url` and `MetadataParser.url_actual` attributes to makes objects easier to encode for debugging
* the twitter parsing was incorrectly looking for 'value' not 'content' as in the current spec
* tracking the shortlink on a page
0.9.0
- This has a default behavior change regarding `get_discrete_url()` .
- `is_parsed_valid_url()` did not correctly handle `require_public_netloc=True`, and would allow for `localhost` values to pass
- new kwarg `allow_localhosts` added to
* is_parsed_valid_url
* is_url_valid
* url_to_absolute_url
* MetadataParser.__init__
* MetadataParser.absolute_url
* MetadataParser.get_discrete_url
* MetadataParser.get_metadata_link
- new method `get_fallback_url`
- `url_to_absolute_url` will return `None` if not supplied with a fallback and test url. Previously an error in parsing would occur
- `url_to_absolute_url` tries to do a better job at determining the intended url when given a malformed url.
0.8.3
- packaging fixes
0.8.2
- incorporated fix in https://github.com/jvanasco/metadata_parser/pull/10 to handle windows support of socket objects
- cleaned up some tests
- added `encode_ascii` helper
- added git-ignored `tests/private` directory for non-public tests
- added an `encoder` argument to `get_metadata` for encoding values
0.8.1
added 2 new properties to a computed MetadataParser object:
is_redirect = None
is_redirect_same_host = None
in the case of redirects, we only have the peername available for the final URL (not the source)
if a response is a redirect, it may not be for the same host -- and the peername would correspond to the destination URL -- not the origin
0.8.0
this bump introduces 2 new arguments and some changed behavior:
- `search_head_only=None`. previously the meta/og/etc data was only searched in the document head (where expected as per HTML specs).
after indexing millions of pages, many appeared to implement this incorrectly of have html that is so off specification that
parsing libraries can't correctly read it (for example, Twitter.com).
This is currently implemented to default from None to True, but future versions will default to `False`.
This is marked for a future default of `search_head_only=False`
- `raise_on_invalid`. default False. If True, this will raise a new exception: InvalidDocument if the response
does not look like a proper html document
0.7.4
- more aggressive attempts to get the peername.
0.7.3
- this will now try to cache the `peername` of the request (ie, the remote server) onto the peername attribute
0.7.2
- applying a `strip()` to the "title". bad authors/cms often have whitespace.
0.7.1
- added kwargs to docstrings
- `get_metadata_link` behavior has been changed as follows:
* if an encoded uri is present (starts with `data:image/`)
** this will return None by default
** if a kwarg of `allow_encoded_uri=True` is submitted, will return the encoded url (without a url prefix)
0.7.0
- merged https://github.com/jvanasco/metadata_parser/pull/9 from xethorn
- nested all commands to `log` under `__debug__` to avoid calls on production when PYTHONOPTIMIZE is set
0.6.18
- migrated version string into __init__.py
0.6.17
- added a new `DummyResponse` class to mimic popular attributes of a `requests.response` object when parsing from HTML files
0.6.16
- incorporated pull8 (https://github.com/jvanasco/metadata_parser/pull/8) which fixes issue5 (https://github.com/jvanasco/metadata_parser/issues/5) with comments
0.6.15
- fixed README which used old api in the example
0.6.14
- there was a typo and another bug that passed some tests on BeautifulSoup parsing. they have been fixed. todo- migrate tests to public repo
0.6.13
- trying to integrate a "safe read"
0.6.12
- now passing "stream=True" to requests.get. this will fetch the headers first, before looping through the response. we can avoid many issues with this approach
0.6.11
- now correctly validating urls with ports. had to restructure a lot of the url validation
0.6.10
- changed how some nodes are inspected. this should lead to fewer errors
0.6.9
- added a new method `get_metadata_link()`, which applies link transformations to a metadata in an attempt to ensure a valid link
0.6.8
- added a kwarg `requests_timeout` to proxy a timeout value to `requests.get()`
0.6.7
- added a lockdown to `is_parsed_valid_url` titled `http_only` -- requires http/https for the scheme
0.6.6
- protecting against bad doctypes, like nasa.gov
-- added `force_doctype` to __init__. defaults to False. this will change the doctype to get around to bs4/lxml issues
-- this is defaulted to False.
0.6.5
- keeping the parsed BS4 document; a user may wish to perform further operations on it.
-- `MetadataParser.soup` attribute holds BS4 document
0.6.4
- flake8 fixes. purely cosmetic.
0.6.3
- no changes. `sdist upload` was picking up a reference file that wasn't in github; that file killed the distribution install
0.6.2
- formatting fixes via flake8
0.6.1
- Lightweight, but functional, url validation
-- new 'init' argument (defaults to True) : `require_public_netloc`
-- this will ensure a url's hostname/netloc is either an IPV4 or "public DNS" name
-- if the url is entirely numeric, requires it to be IPV4
-- if the url is alphanumeric, requires a TLD + Domain ( exception is "localhost" )
-- this is NOT RFC compliant, but designed for "Real Life" use cases.
0.6.0
- Several fixes to improve support of canonical and absolute urls
-- replaced REGEX parsing of urls with `urlparse` parsing and inspection; too many edge cases got in
-- refactored `MediaParser.absolute_url` , now proxies a call to new function `url_to_absolute_url`
-- refactored `MediaParser.get_discrete_url` , now cleaner and leaner.
-- refactored how some tests run, so there is cleaner output
0.5.8
- trying to fix some issues with distribution
0.5.7
- trying to parse unparsable pages was creating an error
-- `MetadataParser.init` now accepts `only_parse_file_extensions` -- list of the only file extensions to parse
-- `MetadataParser.init` now accepts `force_parse_invalid_content_type` -- forces to parse invalid content
-- `MetadataParser.fetch_url` will only parse "text/html" content by default
0.5.6
- trying to ensure we return a valid url in get_discrete_url()
- adding in some proper unit tests; migrating from the private demo's slowly ( the private demo's hit a lot of internal files and public urls ; wouldn't be proper to make these public )
- setting `self.url_actual = url` on __init__. this will get overridden on a `fetch`, but allows for a fallback on html docs passed through
0.5.5
- Dropped BS3 support
- test Python3 support ( support added by Paul Bonser [ https://github.com/pib ] )
0.5.4
- Pull Request - https://github.com/jvanasco/metadata_parser/pull/1
Credit to Paul Bonser [ https://github.com/pib ]
0.5.3
- added a few `.strip()` calls to clean up metadata values
0.5.2
- fixed an issue on html title parsing. the old method incorrectly regexed on a BS4 tag, not tag contents, creating character encoding issues.
0.5.1
- missed the ssl_verify command
0.5.0
- migrated to the requests library
0.4.13
- trapping all errors in httplib and urrlib2 ; raising as an NotParsable and sticking the original error into the `raised` attribute.
this will allow for cleaner error handling
- we *really* need to move to requests.py
0.4.12
- created a workaround for sharethis hashbang urls, which urllib2 doesn't like
- we need to move to requests.py
0.4.11
- added more relaxed controls for parsing safe files
0.4.10
- fixed force_parse arg on init
- added support for more filetypes
0.4.9
- support for gzip documents that pad with extra data ( spec allows, python module doesn't )
- ensure proper document format
0.4.8
- added support for twitter's own og style markup
- cleaned up the beautifulsoup finds for og data
- moved 'try' from encapsulating 'for' blocks to encapsulating the inner loop. this will pull more data out if an error occurs.
0.4.7
- cleaned up some code
0.4.6
- realized that some servers return gzip content, despite not advertising that this client accepts that content ; fixed by using some ideas from mark pilgrim's feedparser. metadata_parser now advertises gzip and zlib, and processes it as needed
0.4.5
- fixed a bug that prevented toplevel directories from being parsed
0.4.4
- made redirect/masked/shortened links have better dereferenced url support
0.4.2
- Wrapped title tag traversal with an AttributeException try block
- Wrapped canonical tag lookup with a KeyError try block, defaulting to 'href' then 'content'
- Added support for `url_actual` and `url_info` , which persist the data from the urllib2.urlopen object's `geturl()` and `info()`
- `get_discrete_url` and `absolute_url` use the underlying url_actual data
- added support for passing data and headers into urllib2 requests
0.4.1
Initial Release