-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent persistent storage directory collisions for similiar repository URLs #72
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a question
// names easily distinguishable. | ||
$hashedUrl = hash('sha256', $url); | ||
|
||
return $escapedUrl . '-' . substr($hashedUrl, 0, 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of only using the first 8 chars? Presumably we are using sha256
because we want a long hash that is very unlikely to conflict? doesn't using only part of it defeat that purpose. Why use 1 of the hash_algos()
that just produces a shorter string. I don't think we worried about attacks where people try take advantage of the possible matches, correect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that we want just enough of the hash to virtually guarantee uniqueness, but to keep the final generated directory name understandable by humans. Maybe crc32b
would suffice? (I'd certainly prefer to remove the string manipulation if possible.)
I'll wait for @davidstrauss to make the final call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'll regret suggesting we can truncate the SHA-256 output substantially. It's the kind of decision that I'd want to do deliberately if we do it at all, and I'm not certain how to compute the bytes of collision prevention we need here. On the other hand, I know it's safe to use the mangled name + SHA2, as the SHA2 alone would be defensibly unique.
If we're worried about an unwieldy file name, there are other serializations of SHA-256 output that have greater density than hex. So, maybe get the hash in binary and base64_encode()
it? The output should be about 34-40 characters.
Finally, I'd consider just rawurlencode()
and no hash at all. That escaping is RFC-compliant to the point where you shouldn't see the implementation's output drift. Percent encoding isn't particularly obnoxious to use for files/shell purposes, as the percent character isn't as special in that context as other escaping/quoting options.
You're welcome to run forward with any of these options (full hex SHA256, base64 SHA256, or robust escaping) without waiting on further input from me. :-)
Addresses #24 by implementing this approach: