Let’s use a little arithmetic to start this discussion. The song “Blackbird” by The Beatles is almost exactly 4 minutes long. This is 240 seconds, and if it was converted into digital form, it would be sampled at a rate of 44,100 samples each second. This means that the song has 240*44100 = 10.6 million samples. But wait—it’s stereo, so double that number to 21.2 million samples. A typical sample is 16 bits, so this works out to 42.4 million bytes: 42 megabytes. The MP3 file for this song is typically 1.9 megabytes. How is that possible? By using a compression algorithm.
Data compression is all about ways to take, for example, 100 bytes of information and turn it into 10 bytes while losing none of the essential message. Of course, compressed data is incomprehensible just to look at and must be decompressed in order for it to be used. Data is often compressed before storing it in a file to reduce its footprint on the storage device, or before transmitting it along a communications channel to take better advantage of limited bandwidth.
The question of how a string of data bytes can be made shorter while losing no important information remains, and a simple example may be in order. Consider a cartoon image. These have a relatively small number of distinct but vivid colors, usually less than 10 colors and the color variation within any region is small. The example image in Figure 10.1 is in PNG form and is 23.2 Kbytes in size at 400 x 456 (= 182400) pixels. As raw data it would be a little over 182 Kbytes in size, and it would be 547 Kbytes if RGB color was used.
A simple compression technique that works is called run-length encoding. In its simplest form, data bytes are preceded by a count indicating how many repetitions of that value were encountered in the data. If there was a section of data as follows:
1 1 0 0 0 0 0 0 2 2 2 1 2 1 2 0 0 0 0 0 0 2 2 2 2 2
This section of data would be encoded as
In this case, the original data required 26 bytes and the compressed data required 17 bytes. The new data takes 65% of the space that the original does. This is not a large savings, but is probably worth the effort. It does depend heavily on the nature of the data.
Consider the image of Figure 10.1. The color areas are uniform and rather large, so this image would be an ideal candidate for run-length encoding. When writing the program, it is important to use a binary file and convert the value and count into unsigned bytes before writing them to the file. This is a new data type called an unsigned byte that was not discussed in Chapter 8, and it has the code “B.” Writing the count and value could be done in the following way:
s = pack(“BB”, n, v[1])
f.write(s)
The entire program run-length encodes the image, reads the image file, and collects identical pixels, counting them as they are collected, until a change in the pixel value occurs. Then the (count, value) pair is written to the file. The pair is written if 255 pixels have been collected, since that is the biggest number that can be counted in 8 bits. The result is a binary file of pairs of numbers (count, value) that represent the pixels in the image. As there are only two colors, the value can be 0 or 1, 0 being white and 1 being green; in general, there can be 256 distinct values. The encoding program looks like this:
from struct import * import pygame
def emit(v, n, f):
s = pack(“BB”, n, v[1])
f.write(s)
b1 = pygame.image.load (“bl.png”)
sz = bl.get size()
width = sz[0]
height = sz[1]
screen = pygame.display.set mode((width, height))
clock = pygame.time.Clock()
pygame.init()
FPS = 10
outf = open (“b1.txt”, “wb”)
count = 0
value = b1.get at((0,0))
for j in range (0, height):
for i in range (0, width):
if count ==255: # Largest possible count.
emit (value, count, outf)
count = 0
c = b1.get at((i, j))
if c == value: # Same as before
count = count + 1
else:
emit(value, count, outf)
count = 1
value = c
if count>0:
emit (value, count, outf)
outf.close()
while True:
clock.tick(FPS)
for event in pygame.event.get():
if event.type == pygame.QUIT:
quit()
screen.fill((180, 180, 180))
screen.blit(b1, (0,0))
pygame.display.update()
The decoding program reads pairs of unsigned bytes from the binary file, and creates pixels. A pair (12, 0) would be 12 white pixels, for instance. A pair (12, 1) could be 12 pixels of some other color, and this program writes the pixels so it decides what color that will be. It will read pairs and draw pixels, into an image of 400 columns and 456 rows, until all are accounted for. A program that does this (not the only one possible) is
inf = open (“b1.txt”, “rb”)
i = 0
j = 0
cols = 400
rows = 456
screen = pygame.display.set mode((cols, rows))
clock = pygame.time.Clock()
pygame.init()
FPS = 10
while True:
s = inf.read(2)
if len(s) <= 0:
break
c,v = unpack(“BB”, s)
print (c, v)
if v == 255:
clr = (255, 255, 255)
else:
clr = (123, 210, 0)
for k in range (0, c):
if i >= cols:
i = 0
j = j + 1
screen.set at ((i,j), clr)
i = i + 1
if j>= 456:
break
while True:
clock.tick(FPS)
for event in pygame.event.get():
if event.type == pygame.QUIT:
quit()
pygame.display.update()
The more complex the data is, meaning the more distinct values the data can take, the less useful this encoding method is. In some cases, it can make the file size larger that the raw data would have been. In the case of this particular image, the run-length encoded file is about 6 K bytes, as opposed to half a million bytes that would have been needed for the raw image, saved as pixels. Still, this serves as proof that it is possible to compress a data file without losing any information. There are, of course, many more algorithms that compress data to a greater extent and with fewer constraints.
1. Huffman Encoding
If a typical text file is examined carefully, the vast majority of the file consists of relatively few characters. As a general estimate, over 95% of the characters can be accounted for by between 25-30 distinct values. A coding scheme that took this into account would reduce the size of a text file, and perhaps it would generalize to other kinds of file. For example, in many files, the value 0 is the most common, and giving it a smaller representation than, say, 9 may reduce the overall file size.
This is not really a novel idea. The international Morse code is based on this idea and has been around for a long time, beginning in 1836. The most commonly used letters in English are shown in Table 10.1. In the Morse code, the letter “E” is represented by a single dot, the letter “T” is a single dash, and “A” is a dot followed by a dash. The most common letters have the smallest code representation, as a general rule. This is how the Huffman code is organized, too.
A Huffman code is constructed from the ground up, like a wall. The lower levels of the wall represent the least frequently used symbols, and have the greatest
number of bricks above them. The final code includes binary numbers, and the length of the code in bits for a symbol is related to the number of bricks above it. The wall is shaped like a pyramid, and is called a binary tree. It’s a very useful structure in general, but the description is restricted here to its use in Huffman codes.
As an example, consider the English text:
I think that at that time none of us quite believed in the Time Machine11 The characters occur in this particular text with the following frequencies:
The “leaves” (or nodes) at the bottom of the tree (it is drawn upside-down) contain the lowest frequency items, and so are placed first. Each two nodes in the tree have one node above them, straddling them, containing the sum of the frequencies of all nodes below. All characters are turned into nodes, and each also contains the number of occurrences of that letter. This collection of nodes is called a heap. Initially, all have only one character, but this changes.
The rule in building the tree is to pick the pair of nodes (initially characters) that sum to the smallest number and connect them using another node, one above them that has a left and right node. The first bricks, alphabetically, are “b” and “c,” both with a frequency of 1.
The bottom nodes have characters and counts. The one above has only a count, and it is the sum of the counts of the two nodes it is connected to. This new node, with a count of 2, is placed back in the heap and the nodes for B and C are removed. The heap always gets smaller.
Repeating this process with the others, the smallest pair we can make is with “d” and “f,” then “k” and “l,” and then “q” and “s.” At that point, the smallest node is “v” with a count of 1, but there are no more nodes with a count of 1. The smallest sum is 2, which uses “v” and “o”
All of these are in the heap, and a search is done for the smallest sum of nodes. The character “u” has a count of 2 and so do any of the nodes above that link to two other characters. These are nodes too, so link “u” with the leftmost node above to get a bigger grouping—this is called a subtree, because it is a tree, but it is also part of a bigger tree.
The “u” node and the others give a sum of 4.
The tree that is being built has the least commonly used characters placed at a greater distance from the top of the tree than are the frequently used characters. This distance is used to construct the codes, smaller for common characters.
The smallest sum of two nodes in the tree is 4, the nodes connecting “d,” “f,” “k,” and “l.”
The method takes the smallest two nodes, which are going to create the smallest sum, and connects them, removing the original nodes and replacing them with the new one. The smallest nodes now are the node connecting “q” and “2” (value 2), the node with “m” (value 3), and the node connecting “v” and “o” (value 3). The node with “m” will be selected to link to the 2-valued node. The tree is a disconnected collection of nodes.
So, what’s next? The smallest valued character remaining is “a” at 4. That would make the smallest sum 7 after connecting it with the subtree on the right (“v” and “o”). Next in the heap are the two 4-nodes above to create an 8, and linking “h” (5) and “n” (also 5) to get a 10.
The pattern should be clear by now. Notice that the nodes with nothing below them always consist of characters, and the nodes above have only numbers. However, the space characters were not counted, and they must be for the message to make any sense. There are 14 spaces in the message. The final sum is 14+9 for the space. A node for a space has to be added to the heap.
The last two steps don’t involve any new characters, but they will link all of the nodes together and make them accessible from one single node at the top. The final (top) node should have a value that is the length of the original string.
The tree that has been constructed will be used to construct the codes for each letter, and the length of each code is the number of nodes between the characters and the top (root) of the tree. The path to each left node is labeled with a digit, in this case a 0, and the path to the right nodes is labelled with a 1, as in the tree above. The code for any character is read off of the links that were followed to get from the top of the tree to the node containing the character. The space character, the most common one, is reached by going left two times; its code is 00. The “t” is the second most frequent character, and is reached from the top node by going right, then left, then right; the code is 101. The complete set of codes is as follows:
The coded message is the concatenation of all of the codes for the characters in the order they appear in the message. The encoded message is as follows:
This amounts to 259 bits = 33 bytes. The original string is 71 bytes long, so the compressed data is 46% of the size of the original data. The Huffman coded string is broken into 8-bit bytes and transmitted that way:
Decoding requires the table or the tree. If a known table is used, such as the natural frequencies of English letters, then it would not have to be transmitted along with the message. The use of a Python dictionary type makes the program for decoding very elegant indeed. Given the table and the message, bits are removed from the beginning of the message and placed into code string until they match one of the codes in the table. The Huffman code has the property that the bit sequences are unique when appended as a long message. The first bit sequence that matches a code is the code for the first letter in the message.
# Huffman decode
# This is the coded message:
bitstring = “01000101110001011010111100010111001111110100111111”+\
“0100101110011111101001010101110110000110111110111011000011110”+\
“1011101000110011100100111000011000101011000001101010001111101”+\
“0100111100100011100000101101001011100100001010101110110000111”+\
“011111101101111000101101100”
table = {} # This is the table of codes
table[’00’] = ” “
table[“11111”] = “A”
table[“011100”] = “D”
table[“111100”] = “V”
table[“101”] = “T”
table[“11101”] = “M”
table[“011101”] = “F”
table[“100”] = “E”
table[“01100”] = “U”
table[“011110”] = “K”
table[“010”] = “I”
table[“111101”] = “O”
table[“011111”] = “L”
table[“1100”] = “H”
table[“011010”] = “B”
table[“111000”] = “Q”
table[“1101”] = “N”
table[“011011”] = “C”
table[“111001”] = “S”
# Pull bits from the string making a substring until the
# substring is found in the dictionary. Then emit the
# character indexed.
# Loop until all bits are used while len(bitstring) > 0:
code = “” # Clear the current code
# While code NOT in the dictionary … while not (code in table):
# Add the next bit from the message code = code + bitstring[0]
# Remove that bit from the message bitstring = bitstring[1:]
# When the code matches, print the character corresponding to the code
print (table[code], end=””)
2. LZW Compression
Like many algorithms, LZW compression is named after the people who devised it: A. Lempel, J. Ziv, and Terry Welch. It has been the standard for data compression for many years, it was the method used in the GIF file format, and it was used in many versions of PDF. It is not the most effective method of compression, but it is lossless and efficient. Like the Huffman code, LZW creates a table from the original text and uses the codes in the table to perform the compression. Unlike the Huffman code, the decompression stage does not require that the table be known in advance; it builds the table as it decompresses the file. The LZW algorithm also replaces multiple characters with single codes, thus increasing the compression rate.
LZW compression usually begins with a known code table, most often the 256 ASCII characters, but any table known by the compressor and decompressor will work. As an example, another short section of text from The Time Machine will be compressed.
The Time Traveller for so it will be convenient to speak of him was expounding a recondite matter to us His grey eyes shone and twinkled and his usually pale face was flushed and animated The fire burned brightly and the soft radiance of the incandescent lights in the lilies of silver caught the bubbles that flashed and passed in our glasses.
Punctuation has been removed for simplicity. The algorithm begins with a table of characters, in this instance, the ones that appear in the quote, but in general, the table can contain any starting set of symbols. This is called the code table, and associates a numerical code with a string. The code table in this case consists of the letters (uppercase) and their values starting with 0: A=0, B=1, and so on. The space has to be included as well. The code sequence 024 is the string “ACE” using this scheme.
Naturally, there has to be more to this if it is to be a viable compression method. When encoding, the characters are examined one at a time and appended to an input string, and looked up in the table. If the string is found in the table, then the next character is read and appended to the string and it is looked up again. This repeats until the string is not found, at which point a few things happen: the code for the last string that was found is written to the output, the new string that was encountered in the string but not found in the table is added to the tables, and the process continues using the last character read in. This means that not only characters, but also short strings that occur in the text have numeric codes, and that the table will be created from the text that was given.
Consider the text in the example: The first character seen is “T”
- “T” exists in the table already, so a new character is read in and appended to the “T” to create the pair “TH.”
- “TH” is not in the table. The character “T” has the code 19, so 19 is written to the output file.
- The string “TH” is added to the table. It will be code 27.
- The input string is now “H.”
- The character “H” is in the table and has code 7. The next character is read in and appended to “H,” creating “HE.”
- “HE” is not in the table, so the code for character “H,” which is 7, is written to the output file.
- The string “HE” is added to the table, code 28.
- The input string is now “E”
The process repeats. If a multiple-character string is found in the table, then the steps are basically the same. Hypothetically,
- The character “T” is next and is in the table. Read the next character “H” and append to “T” to get “TH.”
- “TH” is in the table. Read the next character “E” and append to “T” to get “THE.”
- “THE” is not in the table to emit the code for “TH,” which is 27.
- Input string is now “E.”
Step 1 repeats until a string is obtained that has not been seen before. In the example here, the first 27 codes are letters and the space character. The next few codes are as follows:
The first 3-character string (trigram) in the table is “E T”
Python’s dictionary type is especially valuable for coding the LZW algorithm. The facility for looking up a string in a table is exactly what is required here. The critical part of the program could be written as follows:
When decoding the LZW file, the initial table is known. Again, this is often just the ASCII characters, but can be something else, and in this case is the letters plus the space. The file contains codes, not characters, but the codes are in the table, right? No, only the starting codes are in the table. Decoding the message in the example starts easily. The first few codes in the message are as follows:
19 7 4 26 19 8 12 29 19 17 0 21 …
The first code is read in and is the code for the letter “T.” This is followed by 7 (H) and 4 (E) and so on until the code 29 is reached. There is no entry for the code 29 in the table. This is where the really clever part of the LZW algorithm happens.
When decoding, the program builds the table again. After all, the characters are in the same order in the encoded data, so it should be possible to reproduce the process that was used to build the code table in the first place. When the first code is read in, the code is expected to be in the table, and the corresponding letter “T” is written and placed into a string. The next code is read and corresponds to “H.” Now “TH” is added to the dictionary, and “H” is written and becomes the current string. Now “E” is seen, “HE” is added to the table, and “E” is written, and so on. Again, a dictionary can be used to store the codes, but a list is more efficient. The indices are codes, which are numbers, so a list is fine here. The central part of the process is as follows:
A pseudocode summary of both the encoding and decoding processes is given in Figure 10.9, and working programs are provided on the disk (lzwe.py and lzwd.py). If punctuation is to be added, then a different conversion to uppercase would have to be done. For practical applications, the entire ASCII character set would be used at the outset.
Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.