Huffman Coding

The concept of data compression is very common in computer applications. Information is transmitted as a bit-stream of 0’s and 1’s over the network. The goal is to transmit information over the network with a minimum number of bits. Such transmission is faster in terms of speed and bandwidth. Huffman Coding is one such technique of data compression. This blog discusses Huffman coding, how it works, its implementation, and some applications of the method.

What is Huffman Coding?

This data compression method assigns codes to unique pieces of data for transmission. The data byte occurring most often is assigned a shorter code whereas the next most occurring byte is assigned a longer code. In this way, the whole data is encoded into bytes.

The two types of encoding methods used by Huffman coding are Static Huffman encoding and Dynamic Huffman encoding.

Static Huffman encoding encodes text using fixed-sized codes. Dynamic Huffman coding, on the other hand, encodes according to the frequency of data characters. This approach is often known as variable encoding.

Huffman Coding Implementation:

Huffman coding is done in two steps. The first step is to build a binary tree of data characters depending on the character frequency. The second step is to encode these binary codes in bits of 0’s and 1’s.

Building a Huffman Tree:

The binary tree consists of two types of nodes; leaf nodes and parent nodes. Each node contains the number of occurrences of a character. The parent node or internal node has two children. The left child is indicated by bit 0 and right the child by bit 1. The algorithm for building a Huffman binary tree using a priority queue includes the following steps:

1- Each character is added to the queue as a leaf node.

2- While the queue is not empty, pick two elements from the queue front, and create a parent node with these two as child nodes. The frequency of the parent node is the sum of these two nodes. Add this node to the priority queue.

Consider the following example for a better understanding. Suppose we have characters p, q, r, s, and t with frequencies 4, 6, 7, 7, and 16.

Now, build the binary tree:

Encoding the Huffman binary tree:

As discussed above, the left nodes are assigned 0 bit and the right nodes 1 bit. In this way, codes are generated for all paths from the root node to any child node.

Huffman Compression technique:

While building the Huffman tree, we have applied the technique of Huffman compression. It is also called Huffman encoding. We have encoded the data characters into the bits of 0s and 1s. This method reduces the overall bit size of the data. Hence, it is called the compression technique.

Let’s see how our above data characters p, q, r, s, and t are compressed.

Considering each character is represented by 8 bits or 1 byte in computer language, the total bit size of data characters (4 p, 6 q, 7 r, and 16 t) is 40 * 8 = 320. After the Huffman compression algorithm, the bit size of the data reduces to 40 + 40 + 88 = 168 bits.

Huffman Decompression Technique:

For the decompression technique, we need codes. Using these codes, we traverse the Huffman tree and decode the data.

We start from the root node, assign 0 to the left node, and 1 to the right node. When a leaf node is encountered, we stop.

To decode character t, we will start from the root node, and traverse the path of 111 till we reach a leaf node that is, t.

Time Complexity of Huffman Coding Algorithm:

As Huffman coding is implemented using a binary tree, it takes O(nlogn) time, where n is the number of data characters compressed.

Ending Note:

In this blog on Huffman coding, we have discussed an algorithm that forms the basis of many compression techniques used in software development. Various compression formats like GZIP and WinZip use Huffman coding. Image compression techniques like JPEG and PNG also work on the Huffman algorithm. Although it is sometimes deemed as a slow technique, especially for digital media compression, the algorithm is still widely used due to its storage efficiency and straightforward implementation.