Jump to content

User:C-processor/Polar Tree

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by C-processor (talk | contribs) at 04:06, 25 August 2010 (Additional optimization by adaptive sorting). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This template is being used in the wrong namespace. To nominate this user page for deletion, go to Miscellany for deletion.

The concept

Polar Tree is set of binary codes assigned to a set of symbols in the process known in information theory as entropy encoding. The goal is to reduce the length taken by encoded fragment. Other binary trees created for the same purpose are known as Shannon-Fano coding and Huffman coding. All three methods of constructing codes are called prefix coding because non of these codes is the prefix of another code, which allows unmistakably to restore sequence of symbols from encoded binary stream. Historically Shannon-Fano coding was introduced earlier than Huffman coding and as, it was proven, Huffman coding provides shortest possible encoded message for a given set of frequencies of occurrences. On that reason Huffman coding replaced Shannon-Fano coding for a very long time until adaptive entropy encoding technique was introduced and replaced completely static encoding. In adaptive encoding the tree is not output into encoded stream but updated in identical way during both encoding and decoding operations. The tree is created based on already seen symbols and applied to those new symbols that appear in the message. On that reason the difference in codes is not as important as precision in estimation frequencies for symbols that did not yet occur. For example, presume in adaptive encoding we start from default tree and update it after every 1000 symbols. In both encoding and decoding we may use statistics collected on first 1000 symbols and recreate the tree. New tree will be applied for encoding of next 1000 symbols, that are not yet occurred, and usage of Shannon-Fano coding or Huffman coding makes fraction of percent in the length of encoded fragment, while adaptation interval, strategy in updating frequencies or other mechanism of tree adjustment may provide significantly higher contribution into compression ratio. On that reason, in adaptive algorithms, the most important role plays the speed and convenience in recreation of the tree rather than its optimality, because optimality exists only for provided frequencies while actual frequencies in adaptive encoding are always different from presumed. Polar tree, in general case, is different from Huffman tree although may coincide in many cases, but it can be created or updated by very small number of operations, what makes it attractive in usage in different adaptive algorithms, where trees have to be updated frequently. In the table below the codes for all three trees are presented for the same example as considered in Shannon-Fano coding.

Symbol Frequency Huffman codes Shannon-Fano codes Polar codes
A 15 0 00 0
B 7 100 01 100
C 6 101 10 101
D 6 110 110 110
E 5 111 111 111

In this particular case Polar codes are identical to Huffman codes, but in general case there may be some difference. Same is true for Shannon-Fano coding, they might be identical to Huffman as well.

The algorithm of constructing of Polar codes goes over code lengths generation as it is shown in the next table.

Symbol Original frequency First run Second run Third run Code length Code
A 15 8 16 32 1 0
B 7 4 8 8 3 100
C 6 4 8 8 3 101
D 6 4 8 8 3 110
E 5 4 8 8 3 111

The starting data for building of the tree must be sorted list of all frequencies. The sum of all frequencies 39 is rounded up to a next power 2 number, which is 64, and, all frequencies are rounded down to the next power 2 number. On the next step we move top to bottom and double every frequency, which can be doubled without exceeding the total of 64, which we call target total. The result 16,8,8,8,8 is shown in the fourth column. Now we keep repeating previous step until sum of all frequencies becomes equal to target total. As we can see we need to repeat this step only once and obtain final list 32,8,8,8,8. The difference in bit lengths between modified frequency and target total is the code length for each symbol in the tree. Having code lengths we build actual codes by adding 1 and shifting to the left of each coding fragment to correspondent length. The last operation of building actual codes from the code lengths is known and used in building Huffman trees and also known as compact way of saving trees. So only generation of code lengths constitute novelty of algorithm. Operation of double of frequency is binary shift. When we add intermediate frequencies the numbers have only one positive bit and number of runs is rarely more than 10 even for trees built for 256 symbols. All that makes code lengths generation extremely fast and allows frequent updates of the tree in adaptive encoding, which overwhelmingly compensate slight non-optimality of Polar coding compared to Huffman.

History

The algorithm is published in July 2010 and backed up by DEMO program, which offers additional optimization by adaptive resorting of symbols on every step.

Independent implementation

Although the method was recently introduced it was already beta tested in LZHAM algorithm.

Additional optimization by adaptive sorting

The advantage of the algorithm is quick generation of code lengths. The trees, however, are replaced periodically after each relatively large fragment. To improve compression in original DEMO program the author suggests concept of floating leafs on the tree. This floating leafs concept is adaptive sorting on every step according to accumulated frequency. Presume in the middle of adaptation interval symbol B became more frequent than A. In this case they simply switch while tree stays the same and most frequent symbol encoded by shortest code already in the middle of adaptation interval. The same scenario is reproduced in decoding thus making decoding unmistakeably reversible. The additional advantage of this adaptive sorting or floating leafs is that at the point of recreation of the tree symbols are already sorted and time is not spent or sorting alphabet according to frequency.

Limitations of the method

Polar tree is subject to the same limitations in data compression as other two binary prefix coding. It can not possibly compress anything to less than 1 bit per symbol. And it might be far from statistical limit established by entropy for the data that could be compressed to low entropy. In the same way as Huffman coding and Shannon-Fano coding application of Polar coding is justified to data with large alphabets, which entropy is at least over 4 bits/symbol. According to description of LZHAM Polar coding was applied to literals left after application of LZ algorithm. As it is known literals are not very well compressed and may represent good ground for application of either of binary coding technique.