You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've got the requirement to avoid duplicate downloads when searching for special file types. This can be achieved by simply comparing checksums. Since it's not security-relevant, MD5 hashes should do.
Since in crawler4j each file is represented by a Page and consumed anyways in Page.toByteArray, it would be an advantage in performance to calculate it there on the fly by simply decorating the InputStream with DigestInputStream, instead of touching the byte array again afterwards.
@yasserg What do you think? Do you want me to provide a PR? Would you prefer to make this feature an opt-in?
The text was updated successfully, but these errors were encountered:
cnsgithub
added a commit
to cnsgithub/crawler4j
that referenced
this issue
Mar 29, 2019
I've got the requirement to avoid duplicate downloads when searching for special file types. This can be achieved by simply comparing checksums. Since it's not security-relevant, MD5 hashes should do.
Since in crawler4j each file is represented by a
Page
and consumed anyways inPage.toByteArray
, it would be an advantage in performance to calculate it there on the fly by simply decorating theInputStream
withDigestInputStream
, instead of touching the byte array again afterwards.@yasserg What do you think? Do you want me to provide a PR? Would you prefer to make this feature an opt-in?
The text was updated successfully, but these errors were encountered: