[Feature] On-the-fly checksum calculation #399

cnsgithub · 2019-03-29T07:18:45Z

I've got the requirement to avoid duplicate downloads when searching for special file types. This can be achieved by simply comparing checksums. Since it's not security-relevant, MD5 hashes should do.

Since in crawler4j each file is represented by a Page and consumed anyways in Page.toByteArray, it would be an advantage in performance to calculate it there on the fly by simply decorating the InputStream with DigestInputStream, instead of touching the byte array again afterwards.

@yasserg What do you think? Do you want me to provide a PR? Would you prefer to make this feature an opt-in?

The text was updated successfully, but these errors were encountered:

cnsgithub · 2019-03-29T08:23:44Z

PR: #400

cnsgithub added a commit to cnsgithub/crawler4j that referenced this issue Mar 29, 2019

closes yasserg#399: on-the-fly calculation of checksum

1f6e3a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] On-the-fly checksum calculation #399

[Feature] On-the-fly checksum calculation #399

cnsgithub commented Mar 29, 2019

cnsgithub commented Mar 29, 2019

[Feature] On-the-fly checksum calculation #399

[Feature] On-the-fly checksum calculation #399

Comments

cnsgithub commented Mar 29, 2019

cnsgithub commented Mar 29, 2019