forked from yasserg/crawler4j
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
110 lines (97 loc) · 5.78 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
crawler4j Change Log
Version 4.1 (2015)
Enhancements
- Issue 281: Upgrade try-catch to java7 "try with resources"
- Issue 214: Always log if exception happens + Some code refactoring
- Issue 326: PageFetcher is unreadable
- Issue 325: Remove Non official http status codes
- Issue 238: fetchHeader Does a HTTP GET
- Issue 149: Proper compression support in the PageFetcher
- Issue 323: Upgrade CustomFetchStatus
- Issue 322: WebCrawler should throw exceptions instead of returning at the middle of the method
- Issue 321: Change logging from System.out.print to SLF4J
- Issue 301: Upgrade the Pattern constant on the crawler examples
BugFixes
- Issue 329: Fixed relative path to the tld file (even when packed in the jar file)
- Issue 324: NullPointerException when crawling links with no HREF
- Issue 175: Missing IF-Statement causes crawler to throw a NullPointerException while syncing
Version 4.0 (December 2014)
Enhancements
- Issue 272: Parse Binary Content parsing, Now the crawler can really parse binary content. Based on Tero Jankilla's code
- Issue 311: Add functionality to retrieve links from binary and text only files
- Issue 88 & Issue 171: Crawling Password Protected Sites
- Issue 213: Changed from log4j implementation to slf4j API - Removed any reference to log4j - Logging Implementations should be the user's choice and not the library choice. Instead I have inserted the slf4j API.
- Issue 213: Upgrade all logging statements to use {} of slf4j - Slf4j optimizes string concatenation in logging statements by using {} placeholder. Upgrade all logging statements to use it.
- Issue 312: Crawler should follow links in plain text files
- Issue 278: Add hooks in the webcrawler for better error handling
- Issue 239: Add an option to tweak the URL before processing the page Added hooks to several errors where by default we log the error but allow the user to override those methods in order to do anything they want with those URLs
- Issue 276: Don't let a crawled URL to be dropped without proper logging. Each URL should be logged somehow so the user will know it was crawled. We shouldn't just drop URLs without logging them
- Updated pom.xml project Dependencies were updated to their latest
- Added log messages to PageFetcher These logs were taken from: ind9/crawler4j 153eb752bef5a57bd39016807423683ab22f3913
- Issue 282: Add CHANGES.TXT with the changelog to the root
- Issue 288: Upgrade Unit Tests to v4
- Issue 289: Parsing a binary content shouldn't throw a general parsing error
- Issue 236: Please default includeHttpsPages to true
- Issue 273: Tabbing looks messed up in several places
- Issue 290: We should support all redirect status codes
- Issue 291: HtmlParseData should hold a unique list of URLs
- Issue 133: Get the content type and prevent crawling for example feeds
- Issue 160: Add more context in shouldVisit
- Issue 205: Remove eclipse generated files from the repository
- Issues 293 & 294: Grab the TLD list from the online URL Save the TLD list as a compressed file backup / fallback
- Issue 295: Add meta tags into the parsed html object
- Issue 297: Add tag name to WebUrl
- Issue 225: Meta refresh does not work correctly ?
- Issue 298: Fatal Transport Error when crawling robots.txt
- Issue 303: Create a log configuration file default
- Issue 174: crawler4j should support crawling all https page
- Issue 302: Update deprecated methods/classes in PageFetcher
- Issue 304: Crawl Site Maps
- Issue 216: crawl JSON content instead of HTML
BugFixes
- Issue 284: Catching any exception and hiding the log. Upgraded logging in webcrawler Clarified a little statuscode 1005
- Issue 251: Fix a typo, Fix a typo in line #92. 'cuncurrent' -> 'concurrent'
- Issue 279: TikaException is thrown while crawling several PDFs in a row The problem was the wrong re-use of the metadata
- Issue 139: A URL with a backslash was rejected
- Issue 231: Memory leakage in crawler4j caused by database environment, Closing the environment solves the leak.
- Issue 206: StringIndexOutOfBoundsException in WebURL
- Issue 285: WebURL.java causes IndexOutOfBoundException
- Issue 131: Internal error in WebURL
- Issue 296: Threads not being killed in graceful shutdown
- Issue 299: NullPointerException when trying to crawl different URLs
Version 3.5 (March 2013)
- Fixed issues #184, #169, #136, #156, #191, #192, #127
- Fixed issue #101 in normalizing Urls
- Fixed issue #143 : Anchor text is now provided correctly
- Access to response headers is now provided as a field in Page object
- Added support for handling content fetch and parsing issues.
- Added meta refresh and location to outgoing links
- Upgraded to Java 7
- Updated dependencies to their latest versions
Version 3.3
- Updated URL canonicalizer based on RFC1808 and fixed bugs
- Added Parent Url and Path to WebURL
- Added support for domain and sub-domain detection.
- Added support for handling http status codes for fetched pages
Version 3.1
- Fixed thread-safety bug in page fetcher.
- Improved URL canonicalizer.
Version 3.0
- Major re-design of crawler4j to add many new features and make it extendable.
- Multiple distinct crawlers can now run concurrently.
- Maven support
- Automatic detection of character encoding of pages to support crawling of non UTF-8 pages.
- Implemented graceful shutdown of the crawlers.
- Crawlers are now configurable through a config objects and the static crawler4j.properties file is removed.
- Fixed several bugs in handling redirects, resuming crawling, ...
Version 2.6.1
- Added Javadocs
- Fixed bug in parsing robots.txt files
Version 2.6
- Added support for robots.txt files
- Added the feature to specify maximum number of pages to crawl
Version 2.5
- Added support for specifying maximum crawl depth
- Fixed bug in terminating all threads
- Fixed bug in handling redirected pages
- Fixed bug in handling pages with base url