jueves, 11 de abril de 2013

Huffman coding

For this week we need to implement the tree-based Huffman coding for Unicode strings, implementing the tree structure, then we need to do a report with a design of an experiment to determine the worst-case and typical-case complexity and compression ratio.

Determining the worst-case and typical-case

As we know, fibonacci sequence are the sum of the previous two, knowing at first that Fo = 0 and F1 = 1, this sequence produces a tree that are the most unbalanced, because if we know that huffman are base in the last two frequencies which produce another node that are the sum of both, so, if we have a fibonacci sequence as a frequency of the letters that we would like to test, we can have this type of tree.

For example, let say that we have these case, when we have only 7 letter, just for know how works.

1, 1, 2, 3
a, b, cc, ddd

Alphabet and probability:
a : 0.142857 = 1/7
b : 0.142857 = 1/7
c : 0.285714 = 2/7
d : 0.428571 = 3/7


                  [1] d
[] cabd
                                 [1] b
                      [1] ab
                                 [0] a
           [0] cab
                      [0] c

The same:


             /        \
           d        cab
                     /    \
                    c   ab
                        /  \   
                        a  b

We can see that the tree have only new nodes just for one side, because we made the tree having the sum of the probabilities of frequency, and have a pattern that builds the tree in that way, also we can say that if the have in all the probabilities the same number, would have the same pattern.

A typical case is when we have a frequencies that are not the same for all the text or that don't have the pattern of fibonacci, so I made an experiment for the type that I think is interesting, I made a different graphics that can shows that these type of cases are bad compare to the average case.

We have in this graphic the sum of bits that need to handle with a entire alphabet, using ascii, without compress the bits in the line yellow, the average using huffman compression and then using fibonacci sequence frequency and compression with huffman, with a sample of 35 strings with different size.

In this graphic, we can see the compress ratio between the average-case and the worst-case, we can see that clearly are much difference between them, having a difference about 62% more compress than the strings that have a fibonacci sequence.


1 comentario:

  1. Está chido lo de Fibonacci. Para lo de caso típico, textos de verdad tendrían sentido. 7+7 pts.