Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BnB a disk space hog #1129

Closed
poedator opened this issue Mar 13, 2024 · 9 comments
Closed

BnB a disk space hog #1129

poedator opened this issue Mar 13, 2024 · 9 comments

Comments

@poedator
Copy link
Contributor

poedator commented Mar 13, 2024

System Info

somehow BnB likes to bring with it libs for all possible cuda versions. It makes it the largest lib in my env after torch, with 300+ Mb disk use (in each env!). is this really necessary? Is there a magic install parameter to avoid this?

Below are the largest files in bnb folder in my env:

ncdu 1.12                                                                   
--- /home/****/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes --------------
 Total disk usage: 324.4 MiB  Apparent size: 324.2 MiB  Items: 110                                                                        
   25.3 MiB [##########]  libbitsandbytes_cuda118_nocublaslt.so
   24.6 MiB [######### ]  libbitsandbytes_cuda123_nocublaslt.so
   24.6 MiB [######### ]  libbitsandbytes_cuda122_nocublaslt.so
   24.5 MiB [######### ]  libbitsandbytes_cuda121_nocublaslt.so
   24.5 MiB [######### ]  libbitsandbytes_cuda120_nocublaslt.so
   20.0 MiB [#######   ]  libbitsandbytes_cuda114_nocublaslt.so
   20.0 MiB [#######   ]  libbitsandbytes_cuda115_nocublaslt.so
   19.8 MiB [#######   ]  libbitsandbytes_cuda117_nocublaslt.so
   19.3 MiB [#######   ]  libbitsandbytes_cuda111_nocublaslt.so
   14.2 MiB [#####     ]  libbitsandbytes_cuda118.so
   13.9 MiB [#####     ]  libbitsandbytes_cuda123.so
   13.9 MiB [#####     ]  libbitsandbytes_cuda122.so
   13.8 MiB [#####     ]  libbitsandbytes_cuda121.so
   13.8 MiB [#####     ]  libbitsandbytes_cuda120.so
   10.6 MiB [####      ]  libbitsandbytes_cuda110_nocublaslt.so
    8.9 MiB [###       ]  libbitsandbytes_cuda114.so
    8.9 MiB [###       ]  libbitsandbytes_cuda115.so
    8.7 MiB [###       ]  libbitsandbytes_cuda117.so
    8.6 MiB [###       ]  libbitsandbytes_cuda111.so
    5.7 MiB [##        ]  libbitsandbytes_cuda110.so

also compare it with GPTQ libs:

$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/auto_gptq -s
832K    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/auto_gptq
$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/optimum -s
3.4M    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/optimum
$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes -s
325M    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes

Reproduction

intall bnb with pip, check disk use

Expected behavior

taking much less space

@matthewdouglas
Copy link
Member

I have a draft PR #1103 as a consideration to help slim this down, but need to conduct testing and validate it.

There's also some discussion on this here: #1032 (comment)

As of the latest v0.43.0 release, we dropped the shipped binaries to include only 11.7 - 12.3, but there's more work to be done.

@Titus-von-Koeller
Copy link
Collaborator

Yeah, we're working on slimming this down, but there's a clear trade-off is between ease of installation and disk space. Two main factors add to the volume, CUDA version support and the binaries being "fat binaries", i.e. that each binary for each CUDA version is much "fatter" as it includes the symbols for all compute capabilities. Both CUDA version and Compute Capability is something that is not detected by pip (please correct me if I'm wrong) and we can't therefore package different wheels while enabling a simple pip install bitsandbytes.

With Conda at least the detection of the CUDA installation seems possible, but this is quite the rabbit hole and tricky. We might look into that later.

Anyways, when compiling from source you can pass CLI args to CMake and specify just the CUDA version and CC that you need for your installation and GPU model. This will give you a very reasonably sized binary.

Another factor is that higher performance optimization when compiling gives larger binaries, partly due to inlining. We already chose only the second highest optimization setting, as a trade-off.

#1103 that @matthewdouglas mentioned is trying to simplify things by making sure that we only need to compile for each major CUDA version, which would slim things down potentially only to two binaries. This still needs thorough review and testing though. Hopefully ready for the next release or the one thereafter.

Anyone reading this, please let us know if you have any info that's not already mentioned here that could help us improve the status quo.

@Titus-von-Koeller
Copy link
Collaborator

Hmm, I wonder if this really needs to remain an open issue or if we could move this discussion to #1032 or a discussion in Github Discussions dev corner (I can transform the issue to that). Wdyt?

@poedator
Copy link
Contributor Author

as a temporary measure, is it safe for user to delete manually all non-relevant versions from /site-packages/bitsandbytes ?
like this (for keeping only cuda 12.1 versions):

cd ~/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes
find . -type f | grep -e libbitsandbytes_cuda | grep -v 121 | xargs rm

@matthewdouglas
Copy link
Member

@poedator Yes, that should be safe to do.

As @Titus-von-Koeller mentions, each compute capability that is included adds weight as we're shipping fat binaries compiled for >=Maxwell. Each of these seems to add ~2-3MB to the overall size.

Here's what was shipped in v0.43.0:

CUDA Targets
11.7.1 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, compute_86
11.8.0 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, compute_89
12.0.1 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.1.1 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.2.2 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.3.2 sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90

In #1032 (comment) I had proposed that we drop CUDA < 11.7, and try to align better with PyTorch's binary distributions. CUDA 11.7, 11.8, and 12.1 matches us with their distributions from torch>=1.13.0. The other versions are still there, for now, to bring parity with prior bitsandbytes releases.

bitsandbytes is shipped such that the end user shouldn't actually need the whole NVIDIA CUDA Toolkit and compiler toolchain in order to install and run. PyTorch's binary distributions come with all of the CUDA libraries we need at runtime. The problem has generally been with locating these at runtime. We tend to end up searching for CUDA toolkit installations instead, and that's part of why we have 12.0, 12.2, and 12.3 in the distribution. Colab for example has CUDA 12.2 installed now.

I'm looking at #1126 to make sure we try to load the libraries that come with PyTorch first before falling back to searching for CUDA libraries elsewhere. The point there is again that we want it to be much easier to install, and have broad compatibility across platforms and hardware, so there's a balancing act. But if we get that right, it should mean we can drop down to just those CUDA versions shipped with PyTorch and require others be built from source. That potentially shaves half of the binaries away.

Moving forward there are more options to explore, including:

  • Remove separate NO_CUBLASLT build. #1103 would drop all of the _nocublaslt variants and rely on runtime dispatch.
  • Since CUDA 11.1+, there is supposed to be binary forward compatibility with minor toolkit versions. We need to test this out, but if everything works as expected, we could actually ship just one binary for 11.x and one for 12.x. The minimum driver version required is now constant across major releases of the toolkit.
  • We could consider slimming down the number of architectures we compile cubins for. For example, both sm_50 and sm_52 are Maxwell. Strictly speaking, a sm_50 cubin will still run on an sm_52 device. The same is the case for Pascal with sm_60 and sm_61. For Turing and newer we definitely want to build optimized cubins for each target, but it may be worth considering a change for Maxwell. In fact, sm_50 support was marked deprecated in the CUDA 11.0 release, which was nearly 4 years ago.

@Titus-von-Koeller
Copy link
Collaborator

@poedator Yes, that should be safe to do.
Agreed.

Thanks @matthewdouglas for elaborating on all your current and upcoming work. Spelling out the details really helps in our shared understanding and getting other knowledgable people involved in fleshing out the tricky details! I'll engage on those topics with you more soon, once I have some other more urgent stuff out of the way.

@deep-pipeline
Copy link

@poedator on the basis of the above discussion, where it looks like folks are moving the codebase in the right direction and have clarified that manual deletion is fine, are you happy for this issue to be closed?

(I'm just trying to nudge down the total live-issue count on the basis that will improve contributor focus and bandwidth.)

@poedator
Copy link
Contributor Author

poedator commented Apr 2, 2024

@poedator on the basis of the above discussion, where it looks like folks are moving the codebase in the right direction and have clarified that manual deletion is fine, are you happy for this issue to be closed?

(I'm just trying to nudge down the total live-issue count on the basis that will improve contributor focus and bandwidth.)

OK to close if this will get worked on in #1032

@Titus-von-Koeller
Copy link
Collaborator

Yes, I'll keep your feedback in mind when addressing these topics in the coming weeks/months and we'll try to come up with a solution that's more space-saving.

Thanks everyone for your collaborative spirit. Really appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants