Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading entry points can be much faster #1631

Open
ShabbyX opened this issue Jan 7, 2025 · 7 comments
Open

Loading entry points can be much faster #1631

ShabbyX opened this issue Jan 7, 2025 · 7 comments
Labels
enhancement New feature or request

Comments

@ShabbyX
Copy link
Contributor

ShabbyX commented Jan 7, 2025

As @tycho demonstrates in this repo loading entry points can be made much faster with this patch.

The basic principle is to hash the entry point names (offline) and the name being looked up, and then do a numerical lookup instead of strcmp compares. One could take this a step further and sort the pre-generated list of hashes sorted and do a binary search look up.

@ShabbyX ShabbyX added the enhancement New feature or request label Jan 7, 2025
@spencer-lunarg
Copy link
Contributor

cc @charles-lunarg

@charles-lunarg
Copy link
Collaborator

What is the measured time difference?

And an even better solution is to create a hash table at compile time so that no iteration needs to occur at runtime (except for unknown functions since those aren't knowable at compile time).

@ShabbyX
Copy link
Contributor Author

ShabbyX commented Jan 7, 2025

What is the measured time difference?

Please see the link to the repo's README in the OP.

And an even better solution is to create a hash table at compile time

Yes, indeed that is the idea, there would be a table of hashes of all known functions (derived from xml) baked in the repo. The hash that is done at runtime is on the function name that is being looked up. You'd still need an iteration, like in trampoline_get_proc_addr

@tycho
Copy link

tycho commented Jan 7, 2025

Yes, a compiled in hash table would be ideal. This was just a POC to show that strcmp is guilty of massive overhead for libvulkan, and changing it benefits everyone greatly.

The time difference is in the page linked above, but I'll show here for completeness.

With stock libvulkan and using Volk as the API loader:

Task Iterations Total Time (µs) Average Time (µs)
Load instance functions 200 14186 70.93
Load device functions 200 116361 581.805
Teardown and full init 20 13179 658.95

With libvulkan using my POC patch, and again using Volk:

Task Iterations Total Time (µs) Average Time (µs)
Load instance functions 200 2275 11.375
Load device functions 200 13006 65.03
Teardown and full init 20 1530 76.5

The "average time" for each row is the average time to complete one call of e.g. volkLoadInstanceOnly, volkLoadDevice, etc. The last row is doing everything (volkInitialize, volkLoadInstanceOnly, volkLoadDevice, volkFinalize in sequence).

@tycho
Copy link

tycho commented Jan 7, 2025

Also, my patch doesn't kill every one of the strcmp() chains, just the ones in the API loading path. I believe there are some loader extensions that also use chains of strcmp() .

@charles-lunarg
Copy link
Collaborator

Ahh I shot from the hip when I sent this response. I quickly scanned through the readme originally, and only saw the comparison between glad & volk. Took a minute to piece together which was the 'unpatched' vs 'patched.

These findings confirm my suspicion that the time taken with strcmp is not ideal, but not a deal breaker either. A lot of init time can be found inside of vkCreateInstance & vkCreateDevice, both calling the create functions on all drivers, but also in setting up the internal function dispatch tables, which are a constant overhead that grows with each new function added in the table.

The patch is a wonderful proof of concept for the viability of this idea. I had always wanted to implement it, but never found the time nor strong reason.

Side note from reading the readme:

vkEnumerateInstanceExtensionProperties
vkEnumerateDeviceExtensionProperties
These APIs are unfortunately very expensive to call, because they end up loading and unloading ICDs each time they are called.

vkEnumerateInstanceExtensionProperties is expensive for this reason. vkEnumerateDeviceExtensionProperties occurs after all drivers & layers that are to be loaded have been loaded. The loader has already sped up vkEnumerateInstanceExtensionProperties & vkCreateInstance by caching loaded drivers, which eliminates the dlopen/dlclose overhead to just 'once'. But this is only for drivers, not layers, and because the API was designed to not have global state, the loader doesn't cache the current state of the filesystem between global API calls.

@charles-lunarg
Copy link
Collaborator

Also, my patch doesn't kill every one of the strcmp() chains, just the ones in the API loading path. I believe there are some loader extensions that also use chains of strcmp() .

Makes much sense - especially considering how the first hundred or so strcmp's are being done EVERY single time, while the unknown function support and more dynamic logic is done near the end, rarely being run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants