-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance gains from even more small hash tests? #230
Comments
They are also configurable:
|
Can you give me an example of a situation where files of the same size would match with the beginning and end, but not the middle? Even matching the ends turns out to filter out very few files. I believe such situations would be extremely rare and that wouldn't justify the added complexity and cost. |
@clunkclunk the algorithm you describe is implemented by https://github.com/kornelski/dupe-krill |
For reference on a way to speed up dedupes, I typically use this is like so:
I set a large prefix size, suffix size, and skip the content hash alltogether. As well as skip small files where there's a higher false negative likelihood and less benefit to this approach. I've processed multiple thousands of media files and seen zero false negatives with cursory manual investigation. For what it's worth, I do see value in random / "statistical" checks, though like mentioned it would add a lot of complexity. What I like about it is that it gives you a "reliable" statistical confidence instead of just the beginning and end. You get to set a threshold of how comfortable you are that is mostly independent of the kind of file, rather than relying on the fact that the middle chunks haven't been tampered with. |
I can't say what files they are (I can't seem to find a verbose option), but I see this when scanning a very large directory:
These should all be zip and rar files. I should open up a ticket for a verbose option for what's unmatched between steps 4 and 5, but I definitely have files with the same size, prefix, and suffix that don't have the same content - or maybe just update to the latest version. Another option would be to do random chunk matching of a configurable percentage of the file size using a seed based on the file size to pick which chunks. |
I really dig fclones - thank you for writing it!
This is purely a thought and question right now. I've not actually tested it on any real data, but it came to me when I was looking for some duplicates in a data set that contained a lot of relatively large files (10 to 80 GB - usually video files or backup images).
The algorithm's steps 4 and 5 are to calculate a hash of a tiny block of data at the start and the end of the file respectively, pruning the results based on those hashes. Based on the description of 'tiny', I assume the file reads on these are very short and the calculation very fast. Then the remaining files get a hash of the entire file contents. For large files, this can take some time. I assume it's fairly I/O bound on most systems, especially spinning disks.
Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash? We're already calculating at the start and the end of the file, what kind of overhead would it be to add a check at every 10% increment in the file, then comparing the nine hashes (10%, 20%, etc all the way up to 90%), to be able to filter the list further, then proceeding with the full file hash if necessary?
(side note - how big are the 'tiny blocks' referenced in the algorithm?)
Also for some situations, matching hashes at 11 points (start, end and nine points at 10% intervals) on the file may be enough confidence to optionally skip the whole file hash check. I know a portion of my duplicate files are non-critical files and mostly are duplicates based on simple user mistakes copying things in multiple places, and 10 checks would be enough confidence to call it a match.
The text was updated successfully, but these errors were encountered: