User:C-processor/Polar Tree

Polar Tree is set of binary codes assigned to a symbol set in the process known in information theory as entropy encoding. The goal is to reduce the length taken by encoded fragment. Other binary trees created for the same purpose are Shannon-Fano coding and Huffman coding. All three methods of constructing codes are known as prefix coding because non of these codes is the prefix of another code, which allows unmistakably to restore sequence of symbols from encoded binary stream. Historically Shannon-Fano coding was introduced earlier than Huffman coding and as, it was proven, Huffman coding provides shortest possible encoded message for a given set of frequencies of occurrence. On that reason Huffman coding replaced Shannon-Fano coding for a very long time until adaptive entropy encoding technique was introduced and replaced completely static encoding. In adaptive encoding the tree is not output into encoded stream but updated synchronously during both encoding and decoding. The tree is created based on already seen symbols and applied to those new symbols that appear in the message. On that reason the difference in codes is not as important as precision in estimation frequencies for symbols that did not yet occur. For example, presume in adaptive encoding we start from default tree and update it after every 1000 symbols. In both encoding and decoding we may use statistics collected on first 1000 symbols and recreate the tree. New tree will be applied for encoding of next 1000 symbols, that are not yet occurred, and usage of Shannon-Fano coding or Huffman coding makes fraction of percent in the length of encoded fragment, while adaptation interval, strategy in updating frequencies or other mechanism of tree adjustment may provide significantly higher contribution into compression ratio. On that reason, in adaptive algorithms, the most important role plays the speed and convenience in recreation of the tree rather than its optimality, because optimality exists only for provided frequencies while actual frequencies in adaptive encoding are always different from presumed. Polar tree, in general case, is different from Huffman tree although may coincide in many cases, but it can be created or updated by very small number of operation, what makes it attractive in usage in different adaptive algorithms, where trees have to be updated frequently. In the table below the codes for all three trees are presented for the same example as considered in Shannon-Fano coding.

Symbol	Frequency	Huffman codes	Shannon-Fano codes	Polar codes
A	15	0	00	0
B	7	100	01	100
C	6	101	10	101
D	6	110	110	110
E	5	111	111	111

In this particular case Polar codes are identical to Huffman codes, but in general case there may be some difference. Same is true for Shannon-Fano coding, they might be identical to Huffman as well.

The algorithm of constructing of Polar codes goes over code lengths generation as it is shown in the next table.

Symbol	Original frequency	First run	Second run	Third run	Code length	Code
A	15	8	16	32	1	0
B	7	4	8	8	3	100
C	6	4	8	8	3	101
D	6	4	8	8	3	110
E	5	4	8	8	3	111

The starting data for building of the tree must be sorted list of all frequencies. The sum of all frequencies 39 is rounded up to a next power 2 number, which is 64, and, all frequencies are rounded down to the next power 2 number. On the next step we move top to bottom and double every frequency, which can be doubled without exceeding the total of 64, which we call target total. The result 16,8,8,8,8 is shown in the fourth column. Now we keep repeating previous step until sum of all frequencies becomes equal to target total. As we can see we need to repeat this step only once and obtain final list 32,8,8,8,8. The difference in bit lengths between modified frequency and target total is the code length for each symbol in the tree. Having code lengths we build actual codes by adding 1 and shifting to the left of each coding fragment to correspondent length. The last operation of building actual codes from the code lengths is known and used in building Huffman trees and also known as compact way of saving trees. So only generation of code lengths constitute novelty of algorithm. Operation of double of frequency is binary shift. When we add intermediate frequencies the numbers have only one positive bit and number of steps is rarely more than 10 even for trees built for 256 symbols.