Loading entry points can be much faster #1631

ShabbyX · 2025-01-07T18:29:10Z

As @tycho demonstrates in this repo loading entry points can be made much faster with this patch.

The basic principle is to hash the entry point names (offline) and the name being looked up, and then do a numerical lookup instead of strcmp compares. One could take this a step further and sort the pre-generated list of hashes sorted and do a binary search look up.

The text was updated successfully, but these errors were encountered:

spencer-lunarg · 2025-01-07T18:33:22Z

cc @charles-lunarg

charles-lunarg · 2025-01-07T21:13:43Z

What is the measured time difference?

And an even better solution is to create a hash table at compile time so that no iteration needs to occur at runtime (except for unknown functions since those aren't knowable at compile time).

ShabbyX · 2025-01-07T21:23:34Z

What is the measured time difference?

Please see the link to the repo's README in the OP.

And an even better solution is to create a hash table at compile time

Yes, indeed that is the idea, there would be a table of hashes of all known functions (derived from xml) baked in the repo. The hash that is done at runtime is on the function name that is being looked up. You'd still need an iteration, like in trampoline_get_proc_addr

tycho · 2025-01-07T21:23:48Z

Yes, a compiled in hash table would be ideal. This was just a POC to show that strcmp is guilty of massive overhead for libvulkan, and changing it benefits everyone greatly.

The time difference is in the page linked above, but I'll show here for completeness.

With stock libvulkan and using Volk as the API loader:

Task	Iterations	Total Time (µs)	Average Time (µs)
Load instance functions	200	14186	70.93
Load device functions	200	116361	581.805
Teardown and full init	20	13179	658.95

With libvulkan using my POC patch, and again using Volk:

Task	Iterations	Total Time (µs)	Average Time (µs)
Load instance functions	200	2275	11.375
Load device functions	200	13006	65.03
Teardown and full init	20	1530	76.5

The "average time" for each row is the average time to complete one call of e.g. volkLoadInstanceOnly, volkLoadDevice, etc. The last row is doing everything (volkInitialize, volkLoadInstanceOnly, volkLoadDevice, volkFinalize in sequence).

tycho · 2025-01-07T21:24:44Z

Also, my patch doesn't kill every one of the strcmp() chains, just the ones in the API loading path. I believe there are some loader extensions that also use chains of strcmp() .

charles-lunarg · 2025-01-07T21:30:57Z

Ahh I shot from the hip when I sent this response. I quickly scanned through the readme originally, and only saw the comparison between glad & volk. Took a minute to piece together which was the 'unpatched' vs 'patched.

These findings confirm my suspicion that the time taken with strcmp is not ideal, but not a deal breaker either. A lot of init time can be found inside of vkCreateInstance & vkCreateDevice, both calling the create functions on all drivers, but also in setting up the internal function dispatch tables, which are a constant overhead that grows with each new function added in the table.

The patch is a wonderful proof of concept for the viability of this idea. I had always wanted to implement it, but never found the time nor strong reason.

Side note from reading the readme:

vkEnumerateInstanceExtensionProperties
vkEnumerateDeviceExtensionProperties
These APIs are unfortunately very expensive to call, because they end up loading and unloading ICDs each time they are called.

vkEnumerateInstanceExtensionProperties is expensive for this reason. vkEnumerateDeviceExtensionProperties occurs after all drivers & layers that are to be loaded have been loaded. The loader has already sped up vkEnumerateInstanceExtensionProperties & vkCreateInstance by caching loaded drivers, which eliminates the dlopen/dlclose overhead to just 'once'. But this is only for drivers, not layers, and because the API was designed to not have global state, the loader doesn't cache the current state of the filesystem between global API calls.

charles-lunarg · 2025-01-07T21:37:25Z

Also, my patch doesn't kill every one of the strcmp() chains, just the ones in the API loading path. I believe there are some loader extensions that also use chains of strcmp() .

Makes much sense - especially considering how the first hundred or so strcmp's are being done EVERY single time, while the unknown function support and more dynamic logic is done near the end, rarely being run.

ShabbyX added the enhancement New feature or request label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading entry points can be much faster #1631

Loading entry points can be much faster #1631

ShabbyX commented Jan 7, 2025

spencer-lunarg commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025

ShabbyX commented Jan 7, 2025

tycho commented Jan 7, 2025

tycho commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025

Loading entry points can be much faster #1631

Loading entry points can be much faster #1631

Comments

ShabbyX commented Jan 7, 2025

spencer-lunarg commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025

ShabbyX commented Jan 7, 2025

tycho commented Jan 7, 2025

tycho commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025

charles-lunarg commented Jan 7, 2025