Standard File Types in Python

Everyone’s computer has files on it that the owner did not create. Some have been downloaded; some merely came with the machine. It is common practice to associate specific kinds of files, as indicated initially by some letters at the end of the file name, with certain applications. A file that ends in “.doc,” for example, is usually a file created by Microsoft Word, and a file ending in “.mp3” is usually a sound file, often music. Such files have a format that is understood by existing software packages, and some of them (.gif) have been around for thirty years.

Each file type has been designed to make certain operations easy, and to pass certain information to the application. A set of de facto standards have evolved for how these files are laid out, and for what data are provided for what kinds of files.

1. Image Files

Images have been processed using computers since the 1960s, when NASA started processing images at the Jet Propulsion Laboratory. Scientists decided that having standards for computer images would be useful. The first formats were ad hoc, and based essentially on raw pixel data. Raw data means knowing what the image size is in advance, so headers were introduced providing at least that information, leading to the TARGA format (.tga) and tiff (Tagged Image File Format) in the mid-1980s. When the Internet and the World Wide Web became popular, the GIF was invented, which compressed the image data. This was fol­lowed by JPEG and other formats that could be used by Web designers and ren­dered by browsers, and each had a specific advantage. After all, reducing size meant reducing the time it took to download an image.

Many of the image file formats created in the 1980s are still being used. Some formats, like PNG (Portable Network Graphics), have been specifically de­signed for the Internet. Older ones (like JPEG) have found common uses in new technologies, like digital cameras.

 2. GIF

The Graphics Interchange Format (GIF) is interesting. First, it uses com­pression to reduce the size of the file, but the compression method is not lossy, meaning that the image does not change after being compressed and then decom­pressed. The compression algorithm used is called LZW, which is discussed in Chapter 10. GIF uses a color map representation, so an element in the image is not a color, but instead is an index into an array that holds the color. That is, if v = image

[column] then the color of that pixel is (red[v], green[v], blue[v]). The color itself could be a full 24 bits, but the value v is a byte, and so in a GIF there can only be 256 distinct colors. GIF uses a little-endian representation, meaning that the least significant byte of multi-byte objects comes first on the file.

One advantage of the GIF is that one of the colors can be made transparent. This means that when this color is drawn over another, the color below shows through. It is essentially a “do not draw this pixel” value. It is important for things like sprites in computer games. Another advantage of GIF is that multiple images can be stored in a single file, allowing an animation to be saved in a single file. GIF animations have been common on the Internet for many years, and while they usually represent small, brief animations such as Christmas trees with flash­ing lights, they can be as long and complex as television programs. Still, the fact that there can only be 256 different colors can be a problem.

A GIF is a binary file, but the first six characters are a header block contain­ing what is called a magic number, or an identifying label. For a GIF file the three characters are always “GIF” and the next three represent the version; for the 1989 standard the first six characters are “GIF89a.” Magic numbers are common in binary files, and are used to identify the file type. The file name suffix does not always tell the truth.

Following the header is the logical screen descriptor, which explains how much screen space the image requires. This is seven bytes:

Canvas width          2 bytes

Canvas height         2 bytes

Packed byte            1  byte

A set of flags and small values

This is followed by the global color table, other descriptors, and the image data. The details can be found in manuals and online. The information in the first few bytes is critical, though, and the knowledge that LZW compression is used means that the pixels are not immediately available. Decompression is done to the image as a whole.

from struct import *

f = open (“test.gif”, “rb”)

s = f.read (13)                # Read the header

id, ht, wd, flags, bci,par = unpack(‘6shhBBB’, s)

#6s h h B B B

f.close()

id = id.decode(“utf-8”) print (id)

print (“Height”, ht, “Width”, wd)

print(“Flags:”, flags)

print (“Background color index:      “, bci)

print (“Pixel aspect ratio:”, par)

3. JPEG

A JPEG image uses a lossy compression scheme, and so the image is not the same after compression as it was before compression. For this reason, it should never be used for scientific or forensic purposes when measurements will be made using the image. It should never be used for astronomy, for example, al­though it is perfectly fine for portraits and landscape photographs.

The name JPEG is an acronym for the Joint Photographic Experts Group, and this refers to the nature of the compression algorithm. The file format is an enve­lope that contains the image, and it is referred to as JFIF (JPEG File Interchange Format). The file header contains 20 bytes. The first 4 bytes are hex FF, D8, FF, and E0. Bytes 6-10 are “JFIF\0,” and this is followed by a revision number. A short program that decodes the header is as follows:

from struct import * f = open (“test.jpg”, “rb”)

s = f.read (20)                # Read the header

b1, b2,a1,a2,sz,id,v1, v2,unit,xd,yd, xt,yt = unpack(‘BBBBh5sBBBhhBB’, s)

#B B B B h 5s B B B h h B B f.close()

id = id.decode(“utf-8”)

print (id, “revision”, v1, v2)

if b1==0xff and b2==0xd8: print (“SOI checks.”)

else:

print (“SOI fails.”)

if a1==0xff and a2==0xe0:

print (“Application marker checks.”)

else:

print(“Application marker fails.”)

print (“App 0 segment is”, sz, “bytes long.”)

if unit == 0:

print (“No units given.”)

elif unit == 1:

print (“Units are dots per inch.”)

elif unit == 2:

print (“Units are dots per centimeter.”)

if unit==0:

print (“Aspect ratio is “, xd, “:”, yd)

else:

print (“Xdensity:  “, xd, ” Ydensity:  “, yd)

if xt==0 and yt==0:

print (“No thumbnail”)

else:

print (“Thumbnail image is “, xt, “x”, yt)

The compression scheme used in JPEG is very involved, but it does cause certain identifiable artifacts in an image. In particular, pixels near edges and boundaries are smeared, essentially averaging the values across small regions (Figure 8.1). This can cause problems if a JPEG image is to be edited in Photo­shop or Paint.

4. TIFF

The Tagged Image File Format has a potentially large amount of metadata associated with it, and that is all in text form in the file. The device used to cap­ture the image, the focal length of the lens, time, subject, and scores of other information can accompany the image. In fact, the TIFF has been seconded for use with numeric non-image data, as well. The other reason it is popular is that is can be used with uncompressed (raw) data.

The word Tagged comes from the fact that information is stored in the file using tags, such as might be found in an HTML file—except that the tags in a TIFF are not in text form. A tag has four components: an ID (2 bytes, what tag is this?), a data type (2 bytes, what type are the items in this tag?), a data count (4 bytes, how many items?), and a byte offset (4 bytes, where are these items?). Tags are identified by number, and each tag has a specific meaning. Tag 257 means Image Height and 256 is Image Width; 315 is the code meaning Artist, 306 means Date/Time, and 270 is the Image Description. They can be in any order. In fact, the whole file structure is flexible because all components are referenced using a byte offset into the file.

A TIFF begins with an 8-byte Image File Header (IFH):

Byte order:      This is 2 bytes, and is “II” if data is in little-endian form and “MM” if it is big-endian.

Version Number:     Always 42.

First Image File Directory offset:    4 bytes, the offset in the file of the first image.

The other important part of a TIFF is the Image File Directory (IFD), which contains information about the specific image, including the descriptive tags and data. The IFH is always 8 bytes long and is at the beginning of the file. An IFD can be almost any size and can be anywhere in the file; there can be more than one, as well. The first IFD is found by positioning the file to the offset found in the IFH. Subsequent ones are indicated in the IFD. The IFD stricture is as fol­lows:

Number of tags:    2 bytes

Tags:    Array of tags, size unknown

Next IFD offset:   4 bytes. File offset of the next IFD. Ifthere are no more, then =0.

The structure of a tag was given previously, so a TIF is now defined. The image data can be, and frequently is, raw pixels, but can also be compressed in many ways as defined by the tags.

The program below reads the IFH and the first IFD, dumping the information to the screen:

# TIFF

from struct import * f = open (“test.tif”, “rb”)

s = f.read (8)                     # Read the IFH

id, ver, off = unpack(‘2shL’, s)

#2s h L

id = id.decode(“utf-8”)

print (“TIFF ID is “, id, end=””)

if id == “II”:

print (“which means little-endian.”)

elif id == “mm”:

print (“which means big-endian”)

else:

print (“which means this is not a TIFF.”)

print (“Version”, ver)

print(“Offset”, off)

f.seek(off)                      # Get the first IFD

n = 0

b = f.read (2)                   # Number of tags

n = b[0] + b[1]*256

#n = int(s.decode(“utf-8”))

for i in range(0,n):

s = f.read (12)              # Read a tag

id,dt,dc,do = unpack (“hhLL”, s)

print (“Tag “, id, “type”, dt, “count”, dc, “Offset”, do)

f.close()

When this program executes using “test.tif” as the input file, the first two tags in the IFD are 256 and 257 (width and height), which are correct.

5. PNG

A PNG (Portable Network Graphics) file consists of a signature and consists of 8 bytes, and a collection of chunks, which resemble TIFF tags. There are 18 different kinds of chunk, the first of which is an image header. The signature is always 137 80 78 71 13 10 26 10. The bytes 80 78 71 are the letters “PNG.”

A chunk has either 3 or 4 fields: a length field, a chunk type, an optional chunk data field, and a check code based on all previous bytes in the chunk that is used to detect errors (called a cyclic redundancy check, or CRC).

The image header chink (IHDR) has the following structure:

This file has compression, but it is non-lossy. It also, like GIF, allows trans­parency, but also allows full RGB color. It does not have an option for animations, though. Reading the signature and the first (IHDR) chunk is done in the follow­ing way:

#PNG

from struct import *

b2 = (137, 80, 78, 71, 13, 10, 26, 10) # Correct header

types = (“Grey”, “”, “RGB”, “Color map”, “Grey with alpha”, “”, “RGBA”) # Color types

f = open (“test.png”, “rb”)

s = f.read (8) # Read the header

b1 = unpack(‘8B’, s)

if b1 == b2:

print (“Header OK”)

else:

print (“Bad header”)

s = f.read(8) # The next chunk must be the IHDR

length, type = unpack (“>I4s”, s) # Unpack the first 8 bytes

print (“First chunk: Length is”, length, “Type:”, type)

s = f.read (length) # We know the length, read the chunk wd,ht,dep,ctype,compress, filter, interlace = unpack(“>ii5B”, s)

#I I B B B B B

print (“PNG Image width=”, wd, “Height=”, ht)

print (“Image has “, dep, “bytes per sample.”)

print (“Color type is “, types[ctype])

if compress == 0:

print (“Compression OK”)

else:

print (“Compression should be 0 but is”, compress)

if filter==0:

print (“Filter is OK”)

else:

print (“Filter should be 0 but is”, filter)

if interlace==0:

print (“No interlace”)

elif interlace == 1:

print (“Adam7 interlace”)

else:

print (“Bad interlace specified:   “,

f.close()

6. Sound Files

A sound file can be more complex than an image file and substantially larger. To properly play back a sound, it is critical to know how it was sampled: how many bits per sample, how many channels, how many samples per second, com­pression schemes, and so on. The file must be readable in real time or the sound cannot be played without a separate decoding step. All that is really needed to display an image is its size pixel format and compression.

There are, once again, many existing audio file formats. MP3 is quite com­plex, too much so to discuss here. The usual option on a PC would be “.wav” and, as it happens, that format is not especially complicated.

7. WAV

A WAV file has three parts: the initial header, used to identify the file type; the format sub-chunk, which specifies the parameters of the sound file; and the data sub-chunk, which holds the sound data.

The initial header should contain the string “RIFF” followed by the size of the file minus 8 bytes (i.e., the size from this point forward), and the string “WAVE.” This is 12 bytes in size.

The next sub-chunk has the following form:

A program that reads the first two sub-chunks is as follows:

# WAV

from struct import *

f = open(”test.wav”, “rb”)

s = f.read (12)

riff,sz,fmt = unpack (“4si4s”, s)

riff = riff.decode(“utf-8”)

fmt = fmt.decode(“utf-8”)

print (riff, sz, “bytes “, fmt)

s = f.read (24)

id, sz1, fmt,nchan,rate,bytes,algn, bps = unpack (“4sihhiihh”, s)

#4s i h h i i h h

id = id.decode (“utf-8)”)

print (“ID is”, id, “Channels “, nchan,

“Sample rate is “, rate) print (“Bits per sample is “, bps)

if fmt==1:

print (“File is PCM”)

else:

print (“File is compressed “, fmt)

print (“Byterate was “, bytes, “should be “, rate*nchan*bps/8)

8. Other Files

Every type of file has a specific purpose and a format that is appropriate for that purpose. For that reason, the nature of the headers and the file contents dif­fer. When a program is asked to open a file, there should be some way to confirm that the contents of the file can be read by the program. The code that has been presented so far is only sufficient to determine the file type and some of its basic parameters. The code needed to read and display a GIF, for example, would likely be over 1,000 lines long. It is important to see how to construct a file so that it can be used effectively by others and so that other programmers can create code that can identify that file and use it.

9.  HTML

An HTML (HyperText Markup Language) file is one that is recognized by a browser and can be displayed as a Web page. It is a text file, and can be edited, saved, and redisplayed using simple tools.

The first line of text in an HTML file should be either

<!DOCTYPE html>

or

<html>

The problem is that these are text files, so spaces and tabs and newlines can appear without affecting the meaning. Browsers are also supposed to be some­what forgiving about errors, displaying the page if at all possible. A simple ex­ample that shows some of the problems while being largely correct is as follows:

if html:

webbrowser.open new tab(‘other.html’)

else:

print (“This is not an HTML file.”)

This program uses the webbrowser module of Python to display the web page if it is one. The call webbrowser.open new tab (‘other. html’) opens the page in a new tab, if the browser is open. This module is not a browser itself. It simply opens an existing installed browser to do the work of displaying the page.

10. EXE

This is a Microsoft executable file. The details of the format are challenging to understand, and require a knowledge of computers and formats beyond a first- year level, but detecting one is relatively simple. The first two bytes that identify an EXE file are as follows:

Byte 0: 0x4D

Byte 1 : 0x5 a

It is always possible that the first two bytes of a file will be these two by ac­cident, but it is unlikely. If the file being examined is, in fact, an EXE file, then a Python program can execute it. This uses the operating system interface module os:

import os

os.system (“program.exe”)

 

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *