Force All Computations to Run on GPU during Partial Offloading #11442
TheAiSandbox
started this conversation in
Ideas
Replies: 1 comment
-
This will also significantly speed up token generation when |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I propose adding an option in the form of a command-line argument to force all computations onto the GPU during partial offloading with CPU RAM as an offload buffer. This would allow us to maintain high performance and efficiency even when dealing with models that are too large for the available VRAM.
Motivation
High Bandwidth PCIe 5.0: The upcoming Nvidia and AMD consumer graphics cards support PCIe 5.0 x16, offering up to 64 GB/s bandwidth. The increased bandwidth allows for faster data transfer between the CPU and GPU, which can mitigate some of the overhead associated with transferring model layers across the PCIe bus during token generation. Additional PCIe slots will further increase the available bandwidth, and older versions of PCIe could significantly benefit as well.
Speculative Decoding: Speculative decoding can benefit significantly from the additional compute provided by the GPU, because the evaluation of the draft tokens by the larger model is done in a batched fashion and would otherwise be compute bound and ineffective on CPU.
Mixture-of-Experts Models + Multi-Token Prediction: The growing popularity of mixture-of-experts models and multi-token prediction methods, thanks to new the releases from DeepSeek, suggests a potential for much higher throughput even with very large parameter models.
All of these advancements combined, it is conceivable that we can get usable tokens/second with very high parameter models using partial offloading between GPU VRAM and CPU RAM if this enhancement is made.
Implementation
This is partially inspired by someone who had shown this was feasible by doing something very similar here:
https://github.com/Infini-AI-Lab/UMbreLLa
Beta Was this translation helpful? Give feedback.
All reactions