Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] On-the-fly checksum calculation #399

Open
cnsgithub opened this issue Mar 29, 2019 · 1 comment
Open

[Feature] On-the-fly checksum calculation #399

cnsgithub opened this issue Mar 29, 2019 · 1 comment

Comments

@cnsgithub
Copy link

I've got the requirement to avoid duplicate downloads when searching for special file types. This can be achieved by simply comparing checksums. Since it's not security-relevant, MD5 hashes should do.

Since in crawler4j each file is represented by a Page and consumed anyways in Page.toByteArray, it would be an advantage in performance to calculate it there on the fly by simply decorating the InputStream with DigestInputStream, instead of touching the byte array again afterwards.

@yasserg What do you think? Do you want me to provide a PR? Would you prefer to make this feature an opt-in?

cnsgithub added a commit to cnsgithub/crawler4j that referenced this issue Mar 29, 2019
@cnsgithub
Copy link
Author

PR: #400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant