How Huffman Coding Works
The goal of Huffman coding is to create a set of variable-length codes for characters, with shorter codes for more frequent characters. This allows us to represent data more efficiently and compress the overall size.
Variable-Length Codes
Huffman coding assigns variable-length codes to input characters based on the frequency of those characters. More common characters that appear more often are assigned shorter bit sequences, while less common characters get longer codes.
For example, given the string "hello world", we would construct a frequency table:
1h - 1
2e - 1
3l - 3
4o - 2
5w - 1
6r - 1
7d - 1
The character 'l' appears most often, so it would get the shortest code. 'd' appears least often, so it would get the longest code. This encoding allows us to represent common characters very concisely.
Prefix-Free Codes
Huffman codes are constructed in a special way to ensure that no code is a prefix of another code. This prefix-free property allows each bit sequence to be decoded unambiguously.
For example, we would not assign code '01' to 'h' and code '011' to 'e', because '01' is a prefix of '011'. With prefix-free codes, the decoder can follow the bits until a complete code is formed, without ambiguity about when one code ends and another begins.
The prefix-free property is critical for allowing Huffman coding to be lossless and reversible. Given the encoded bit stream, we can unambiguously derive the original data by decoding each complete bit sequence into its matching character.
