Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance gains from even more small hash tests? #230

Open
clunkclunk opened this issue Sep 17, 2023 · 5 comments
Open

Performance gains from even more small hash tests? #230

clunkclunk opened this issue Sep 17, 2023 · 5 comments
Labels
question Further information is requested

Comments

@clunkclunk
Copy link

I really dig fclones - thank you for writing it!

This is purely a thought and question right now. I've not actually tested it on any real data, but it came to me when I was looking for some duplicates in a data set that contained a lot of relatively large files (10 to 80 GB - usually video files or backup images).

The algorithm's steps 4 and 5 are to calculate a hash of a tiny block of data at the start and the end of the file respectively, pruning the results based on those hashes. Based on the description of 'tiny', I assume the file reads on these are very short and the calculation very fast. Then the remaining files get a hash of the entire file contents. For large files, this can take some time. I assume it's fairly I/O bound on most systems, especially spinning disks.

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash? We're already calculating at the start and the end of the file, what kind of overhead would it be to add a check at every 10% increment in the file, then comparing the nine hashes (10%, 20%, etc all the way up to 90%), to be able to filter the list further, then proceeding with the full file hash if necessary?

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

Also for some situations, matching hashes at 11 points (start, end and nine points at 10% intervals) on the file may be enough confidence to optionally skip the whole file hash check. I know a portion of my duplicate files are non-critical files and mostly are duplicates based on simple user mistakes copying things in multiple places, and 10 checks would be enough confidence to call it a match.

@kapitainsky
Copy link
Contributor

kapitainsky commented Sep 17, 2023

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

They are also configurable:

      --max-prefix-size <BYTES>
          Maximum prefix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

      --max-suffix-size <BYTES>
          Maximum suffix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

@pkolaczk pkolaczk added the question Further information is requested label Oct 18, 2023
@pkolaczk pkolaczk reopened this Oct 18, 2023
@pkolaczk
Copy link
Owner

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash?

Can you give me an example of a situation where files of the same size would match with the beginning and end, but not the middle? Even matching the ends turns out to filter out very few files. I believe such situations would be extremely rare and that wouldn't justify the added complexity and cost.

@gcflymoto
Copy link

@clunkclunk the algorithm you describe is implemented by https://github.com/kornelski/dupe-krill

@johnpyp
Copy link
Contributor

johnpyp commented Oct 27, 2023

For reference on a way to speed up dedupes, I typically use this is like so:

fclones group --cache /my/media/folder --min 500MiB --max-prefix-size 128MiB --max-suffix-size 64MiB --skip-content-hash 

I set a large prefix size, suffix size, and skip the content hash alltogether. As well as skip small files where there's a higher false negative likelihood and less benefit to this approach. I've processed multiple thousands of media files and seen zero false negatives with cursory manual investigation.


For what it's worth, I do see value in random / "statistical" checks, though like mentioned it would add a lot of complexity. What I like about it is that it gives you a "reliable" statistical confidence instead of just the beginning and end. You get to set a threshold of how comfortable you are that is mostly independent of the kind of file, rather than relying on the fact that the middle chunks haven't been tampered with.

@bmfrosty
Copy link

I can't say what files they are (I can't seem to find a verbose option), but I see this when scanning a very large directory:

[2023-10-27 13:54:46.198] fclones:  info: Found 2303 (276.5 GB) candidates after grouping by prefix
[2023-10-27 13:54:53.994] fclones:  info: Found 2298 (274.9 GB) candidates after grouping by suffix
[2023-10-27 14:50:36.020] fclones:  info: Found 2186 (271.0 GB) redundant files

These should all be zip and rar files.

I should open up a ticket for a verbose option for what's unmatched between steps 4 and 5, but I definitely have files with the same size, prefix, and suffix that don't have the same content - or maybe just update to the latest version.

Another option would be to do random chunk matching of a configurable percentage of the file size using a seed based on the file size to pick which chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants