Advanced Data Files in Python

File operations were discussed Chapter 5, but the discussion was limited to files containing text. Text is crucial because it is how humans communicate with the computer. However, text files take up more space than needed to hold the information they do. Each character requires at least one byte. The number 3.1415926535 thus takes up 12 bytes, but if it is stored as a floating point number, it needs only 4 or 8, depending on the precision.

The file system on most computers also permits a variety of operations that have not been discussed. This includes reading from any point in a file, append­ing data to files, and modifying data. The need for processing data effectively is a main reason for computers to exist at all, so it is important to know as much as possible about how to program a computer for these purposes.

1. Binary Files

A binary file is one that does not contain text, but instead holds the raw, in­ternal representation of its data. Of course, all files on a computer disk are binary in a sense, because they all contain numbers in binary form, but a binary file in this discussion does not contain information that can be read by a human. Binary files can be more efficient that other kinds, both in file size (smaller) and the time it takes to read and write them (less). Many standard files types, such as MP3, exist as binary files, so it is important to understand how to manipulate them.

Example: Create a File of Integers

The array type holds data in a form that is more natural for most computers than a list, and also has the tofile() method built in. If a collection of integers is written as a binary file, the first step is to place them into an array. If a set of 10,000 consecutive integers are to be written to a file named “ints,” the first step is to import the array class and open the output file. Notice that the file is open in “wb” mode, which means “write binary:”

from array import array

output file = open(‘ints’, ‘wb’)

Now create an array to hold the elements and fill the array with the consecu­tive integers:

arr = array(‘i’)

for k in range (10000, 20000):

arr.append(k)

Finally, write the data in the array to the file:

arr.tofile(out)

out.close()

This file has a size of 40 kb on a PC. A file with the same integers written as text is 49 kb. This is not exactly a large space savings, but it does add up.

Reading these values back is simple:

inf = open (‘ints’, ‘rb’)

arrin = array(‘i’)

for k in range (0,   10001):

try:

arrin.fromfile(inf, 1)

except:

break

print (arrin[k])

inf.close()

The try is used to catch an end of file error in cases where the number of items on the file is not known in advance.

Sometimes a binary file contains data that is all of the same type, but that situation is not very common. It is more likely that the file has strings, integers, and floats intermixed. Imagine a file of data for bank accounts or magazine sub­scriptions; the information includes names and addresses, dates, financial values, and optional data, depending on the situation (some customers have multiple ac­counts). By using structs, we can create binary files that contain more than one kind of information.

2. The Struct Module

The struct module permits variables and objects of various types to be con­verted into what amounts to a sequence of bytes. It is a common claim that this is in order to convert between Python forms and C forms, because C has a struct type (short for structure). However, many files exist that consist of mixed-type data in raw (i.e., machine compatible) form that have been created by many pro­ grams in many languages. It is possible that C is singled out because the name struct was used.

Example: A Video Game High Score File

Video game players need little incentive to try hard to win a game, but for many years, a special reward used to be given to the better players. The game “remembered” the best players and listed them at the beginning and end of the game. This kind of ego boost is a part of the reward system of the game. The game program stores the information on a file in descending order of score. The data that was saved was usually the player’s name or initials, the score, and the date. This mixes string with numeric data.

Consider that the player’s name is held in a variable name, the score is an integer score, and the date is a set of three strings year, month, and day. In this situation, the size of each value needs to be fixed, so allow 32 characters for the name, 4 for year, 2 for month, and 2 for day. The file was created with the name first, then the score, then the year, month, and day. The order matters because it will be read in the same order that it was written. In the file, the data appears as follows:

cccccccccccccccccccccccccccccccc iiii     cccc cc      cc

Player’s name                     Score Year Month Day

Each letter in the first string represents a byte in the data for this entry. “C” represents characters; “i” represents bytes that are part of an integer. There are 44 bytes in all, which is the size of one data record, which is what one set of related data is generally called. A file contains the records for all of the elements in the data set, and in this case, a record is the data for one player, or at least one time that the player played the game. There can be multiple entries for a player.

One way to convert mixed data like this into a struct is to use the pack() method. It takes a format parameter first, which indicates what the struct will consist of in terms of bytes. Then the values are passed that will be converted into components of the final struct. For the example here, the call to pack() is as follows:

s = pack (“32si4s2s2s”, name, score, year, month, day)

The format string is 32si4s2s2s; there are 5 parts to this, one for each of the values to be packed:

32s   is a 32-character long string. It should be of type bytes.

      is one integer. However, 2i would be two integers, and 12i is 12 integers. 4s

is      a 4-character long string.

2s     is a 2-character long string.

Other important format items are as follows:

c     is a character f is a float

d     is a double precision float

The value returned from pack() has the type bytes, and in this case, it is 44 bytes long. The high score file consists of many of these records, all of which are the same size. A record can be written to a file using write(). A program that writes just one such record is as follows:

from struct import *

f = open (“hiscores”, “wb”)

name = bytes(“Jim Parker”, ‘UTF-8’)

score = 109800

year = b”2015″

month = b”12″

day = b”26″

s = pack (“32si4s2s2s”, name, score, year, month, day)

f.write(s)

Reading this file involves first reading the string of bytes that represents a data record. Then it is unpacked, which is the reverse of what pack() does, and the variables are passed to the unpack() function to be filled with data. The unpack() method takes a format string as the first parameter, the same kind of format string as pack() uses. It returns a tuple. An example that reads the record in the above code is as follows:

from struct import *

f = open(“hiscores”, “rb”) s = f.read(44)

name,score,year,month,day = unpack(“32si4s2s2s”, s) name = name.decode(“UTF-8”)

year = year.decode(“UTF-8”)

month = month.decode(“UTF-8”)

day = day.decode(“UTF-8”)

The data returned by unpack are bytes, and need to be converted into strings before being used in most cases. Note the input mode on the open() call is “rb,” or “read binary.”

A file in this format has been provided, named “hiscore.” When a player plays the game, they will enter their name; the computer knows their score and the date. A new entry must be made in the “hiscore” file with this new score in it. How is that done?

Start with the new player data for Karl Holter, with a score of 100,000. To update the file, it is opened and records are read and written to a new temporary file (named “tmp”) until one is found that has a smaller score than the 100,000 that Karl achieved. Then Karl’s record is written to the temporary file, and the remainder of “hiscores” is copied there. This creates a new file named “tmp” that has Karl’s data added to it in the correct place. Now that file can be copied to “hiscores” replacing the old file, or the file named “tmp” can be renamed as “hiscores.” This is called a sequential file update.

Renaming the file requires access to some of the operating system functions in the module os, in particular,

os.rename (“tmp”, “hiscores”)

3. Random Access

It seems natural to begin reading a file from the beginning, but that is not always necessary. If the data that is desired is located at a known place in the file, then the location being read from can be set to that point. This is a natural consequence of the fact that disk devices can be positioned at any location at any time. Why not files too?

The function that positions the file at a specific byte location is seek():

f.seek(44) # Position the file at byte 44,

           # which is the second record in the

           # hiscores

It’s also possible to position the file relative to the current location:

f.seek(44, 1)    # Position the file 44 bytes from

                 # this location,

                 # which skips over the next

                 # record in

A file can be re-wound so that it can be read over again by calling f.seek(0), and it positions the file at the beginning. It is otherwise difficult to make use of this feature, unless the records on the file are of a fixed size, as they are in the file “hiscores,” or the information on record sizes is saved in the file. Some files are intended from the outset to be used as random access files. Those files have an index that allows specific records to be read on demand. This is very much like a dictionary, but on a file. Assuming that the score for player Arlen Franks is needed, the name is searched for in the index. The result is the byte offset for Arlen’s high score entry in the file.

Arlen’s record starts at byte 352 (8th record multiplied by 44 bytes). He just played the game again and improved his score. Why not update his record on the file? The file needs to be open for input and output, so we use mode “rb+,” mean­ing open a binary file for input and output. Then we position the file to Arlen’s record, create a new record, and write that one record. This is a new approach, being able to both read and write the same file. However, if the data being written is exactly the same size as the record on the file, then no harm should come from it. The program is as follows:

This works fine, provided that the position of Arlen’s data in the file is known. It does not maintain the file in descending order, though.

Example: Maintaining the High Score File in Order

The circumstances of the new problem are that a player only appears in the high score file once and the file is maintained in descending order of score. If a player improves their score, then their entry should move closer to the beginning of the file. This is a more difficult problem than before, but one that is still prac­tical. Let’s presume that a player has achieved a new score. The entire process should be as follows:

The process is like moving a playing card closer to the top of the deck while leaving the other cards in the same order. It’s probably more efficient to move the record while searching for the correct position, though. Each time the previous record is examined, if it does not have a larger score then the record being placed is copied ahead one position. This results in a pretty compact program, given the nature of the problem, but it is a bit tricky to get right. For example, what if the new score is the highest? What if the current high score gets a higher score?

 

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *