Force All Computations to Run on GPU during Partial Offloading #11442

TheAiSandbox · 2025-01-27T02:34:08Z

TheAiSandbox
Jan 27, 2025

I propose adding an option in the form of a command-line argument to force all computations onto the GPU during partial offloading with CPU RAM as an offload buffer. This would allow us to maintain high performance and efficiency even when dealing with models that are too large for the available VRAM.

Motivation

High Bandwidth PCIe 5.0: The upcoming Nvidia and AMD consumer graphics cards support PCIe 5.0 x16, offering up to 64 GB/s bandwidth. The increased bandwidth allows for faster data transfer between the CPU and GPU, which can mitigate some of the overhead associated with transferring model layers across the PCIe bus during token generation. Additional PCIe slots will further increase the available bandwidth, and older versions of PCIe could significantly benefit as well.
Speculative Decoding: Speculative decoding can benefit significantly from the additional compute provided by the GPU, because the evaluation of the draft tokens by the larger model is done in a batched fashion and would otherwise be compute bound and ineffective on CPU.
Mixture-of-Experts Models + Multi-Token Prediction: The growing popularity of mixture-of-experts models and multi-token prediction methods, thanks to new the releases from DeepSeek, suggests a potential for much higher throughput even with very large parameter models.

All of these advancements combined, it is conceivable that we can get usable tokens/second with very high parameter models using partial offloading between GPU VRAM and CPU RAM if this enhancement is made.

Implementation

This is partially inspired by someone who had shown this was feasible by doing something very similar here:
https://github.com/Infini-AI-Lab/UMbreLLa

TheAiSandbox · 2025-01-27T02:58:39Z

TheAiSandbox
Jan 27, 2025
Author

This will also significantly speed up token generation when --parallel is used to process multiple prompt and the model is too large to fully fit into VRAM.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force All Computations to Run on GPU during Partial Offloading #11442

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Force All Computations to Run on GPU during Partial Offloading #11442

TheAiSandbox Jan 27, 2025

Motivation

Implementation

Replies: 1 comment

TheAiSandbox Jan 27, 2025 Author

TheAiSandbox
Jan 27, 2025

TheAiSandbox
Jan 27, 2025
Author