An implementation of the Huffman code in C17.
The code should work on any system that supports C17 and the uint64_t
datatype, including MSb-first as bit numbering is expected. Big-endian and little-endian systems are cross-compatible, except when the decoded file contains only one byte type. The last requirement is a byte consisting of 8 bits.
To compile the project, run make
.
The Makefile needs to be modified for non linux systems, the resulting executable is in the same directory as the Makefile and is called huffman
, to install it system wide run make install
. To remove it, make uninstall
. Use make debug
for development , the compiled binary is slower, but includes address sanitiser and debug symbols.
$ huffman help
Encodes and decodes data using huffman algorithm.
Usage:
huffman (encode|decode) [INFILE [OUTFILE]]
huffman help
Note:
- may be used for stdin / stdout
When omitting INFILE/OUTFILE, stdin/stdout is used
Benchmark on a laptop with an nvme, data is randomly generated:
encoding | decoding | |
---|---|---|
1 KiB | 0.004s | 0.003s |
1 MiB | 0.020s | 0.031s |
100 MiB | 2.018s | 2.982s |
1 GiB | 10.767s | 15.782s |
10 GiB | 61.474s | 101.383s |
The maximum depth for the tree is hardcoded to 64. This paper shows that
To build a tree of depth 65 (which would fail), you'd need
There is another limit: the frequency per symbol. The frequency of each symbol is stored as uint64_t
, so each symbol can occur at most
The absolute limit that this Huffman implementation can handle is when each symbol frequency is maxed out, so
The file format has a simple structure, first the tree is decoded, then comes the encoded data, and the last three bits indicate the number of significant bits in the last byte.
The tree is stored as described here:
- Start at root
- go left, write 0
- If leaf, write 1, then symbol, continue
- If not leaf, repeat this step
- go right, write 0
- If leaf, write 1, then symbol, continue
- If not leaf, repeat this step
There is a special case when only one type of symbol is encoded, e.g. "AAA". Then the first bit is 1, the second byte is the symbol to decode and the following 8 bytes are the frequency of the symbol as uint64_t
. Only this scenario isn't cross-compatible between little-endian and big-endian.