Using Files in Python: Less Theory, More Practice

The general paradigm for reading and writing files is the same in Python as it is in most other languages. The steps for reading or writing a file are these:

  1. Open the file. This involves calling a function, usually named open, and passing the name of the file to be used. Sometimes the mode for open­ing is passed; that is, a file can be opened for input, output, update (both input and output), and in binary modes. The function locates the file us­ing the name and returns a variable that keeps track of the current state of input from the file. A special case exists if there is no file having the given name.
  2. Read data from the file. Using the variable returned by open, a func­tion is called to read the data. The function might read a character, num­ber, line, or the whole file. The function is often called read, and can be called multiple times. The next call to read will read from where the last call ended. A special case exists when all of the data has been read from the file (called the end of file condition)

OR

Write data to the file. Using the variable returned by open, a function is called to write data to the file. The function might write a character, number, line, or many lines. The function is often called write, and can be called multiple times. The next call to write will continue writing data from where the last call ended. Writing data most frequently appends data to the end of the file.

  1. Close the file. Closing a file is also accomplished using a call to a func­tion (usually named close). This function frees storage associated with the input process and in some cases unlocks the file so it can be used by other programs. A variable returned by open is passed to close, and afterwards that variable cannot be used for input anymore. The file is no longer open.

1. Open a File

Python provides a function named open that opens a file and returns a value that can be used to read from or write to the file. That value refers to a complex collection of values that refers to the file status and is called a handle or a file descriptor. It can be thought of as having the type file, and must be assigned to a variable or the file cannot be accessed. The open function is given the name of the file to be opened, and a flag that indicates whether the file is to be read from or written to. Both of these are strings. A simple example of a call to open is as follows:

infile = open (“datafile.txt”, “r”)

This opens a file named “datafile.txt” that resides in the same directory as does the Python program, and opens it for input: the “r” flag means read. It re­turns the handle to the variable infile, which can now be used to read data from the file.

There are some details that are crucial. The name of the file on most comput­er systems can be a path name, which is to say, the name including all directory names that are used to find it on your computer. For example, on some computers, the name “datafile.txt” might have the complete path name C:/parker/introPro- gramming/chapter05/datafile.txt. If path names are used, the file can be opened from any directory on the computer. This is handy for large data sets that are used by multiple programs, such as names of customers or suppliers.

The read flag “r” that is the second parameter is what was called the mode in the previous discussion. The “r” flag means that the file will be open for reading only, and starts reading at the beginning of the file. The default is to read char­acters from the file, which is presumed to be a text file. Opening with the mode “rb” opens the file in binary format and allows reading non-text files, such as MP3 and video files.

Passing the mode “w” means that the file is to be written to. If the file exists, then it will be overwritten; if not, the file will be created. Using “wb” means that a binary file is to be written.

Append mode is indicated by the mode parameter “a,” and it means that the file will be opened for writing and if the file exists then writing will begin at the end of the existing file. In other words, the file will not start over as being empty, but will be added to, at the end of the file. The mode “ab” appends data to a binary file.

If the file does not exist and it is being opened for input, there is a problem. It’s an error, of course; a non-existent file cannot be read from. There are ways to tell whether a file exists, and the error caused by a non-existent file can be caught and handled from within Python. This involves an exception. It is always a bad idea to assume that everything works properly, and when dealing with files it is especially important to check for all likely problems.

File Not Found Exceptions

The proper way to open a file is within a try-except pair of statements. This ensures that nonexistent files or permission errors are caught rather than causing the program to terminate. The basic scheme is simple:

try:

infile = open (“datafile.txt”, “r”) except FileNotFoundError:

print (“There is no file named ‘datafile.txt’.

Please try again”)

return            # end program or abort this section

                  # of code

The exception FileNotFoundError occurs if the file name cannot be found. What to do in that case depends on the program: if the file name was typed in by the user, then perhaps they should get another chance. In any case, the file is not open and data cannot be read.

There are multiple versions of Python on computers around the world, and some versions have different names for things. The examples here all use Python 3.4. In other versions, the FileNotFoundError exception has another name; it may be IOError or even OSError. The documentation for the version being used should be consulted if a compilation error occurs when using exceptions and some built-in functions. For the 3.4 compiler version, all three seem to work with a missing file.

All attempts to open a file should take place while catching the FileNot- FoundError exception.

2. Reading from Files

After a file is opened with a read mode, the file descriptor returned can be used to read data from the file. Using the variable infile returned from the call to open() above, a call to the method read() can get a character from the file:

s = infile.read(l)

Reading one character at a time is always good enough, but is inefficient. If a block on disk is 512 characters (bytes), then that should be a good number of bytes to read at one time or a multiple of that. Reading more data than you need and saving it is called buffering, and buffers are used in many instances: live video and audio streaming, audio players, and even in programming language compilers. The idea is to read a larger block of data than is needed at the moment and to hand it out as needed. Reading a buffer could be done as follows:

s = infile.read(512)

and then dealing characters from the strings one at a time as needed. A buffer is a collection of memory locations that is temporary storage for data that was recently on secondary storage.

Text files, those that contain printable characters that humans can read, are normally arranged as lines separated by a carriage return or a linefeed character, called a newline. An entire line can be read using the readline() function:

s = infile.readline()

A line is not usually a sentence, so many lines might be needed to read one sentence, or perhaps only half of a line. Computer text files are structured so that humans can read them, but the structure of human language and convention is not understood by the computer nor it is built into the file structure. However, it is normal for people to make data files that contain data for a particular item or event on one line, followed by data for the next item. If this is true, then one call to readline() will return all of the information for a particular thing.

End of File

When there are no more characters in the file, read() will return the empty string: “”. This is called the end of file condition, and it is important that it be detected. There are many ways to open and read files, but for reading characters in this way, the end of file is checked as follows:

infile = open(“data.txt”, “r”) while True:

c = infile.read(l)

if c == ”:

print (“End of file”)

exit()

else:

c = infile.read(l)

When reading a file in a for statement, the end of file is handled automati­cally. In this case, the loop runs from the first line to the final line and then stops.

for c in f:

 print (“‘”, c, “‘”)

An exception cannot be used in an obvious way for handling the end of file on file input. However, when reading from the console using the input() function, the exception EOFError can be caught:

while True:

try:

c = input()

print (c)

except EOFError:

print (“Endfile”)

break

There are many errors that could occur for any set of statements. It is possible to determine what specific exception has occurred in the following manner:

while True:

try:

c = input()

print (c)

except Exception as x:

print(x)

break

This code prints “EOF when reading a line” when the end of file is encoun­tered.

Common File Input Operations

There are a few common ways to use files that should be mentioned as pat­terns. Although one should never use a pattern if it is not understood, it’s some­times handy to have a few simple snippets of code that are known to perform basic tasks correctly. For example, on common operation to use with files is to read each line from a file, followed by some processing step. This looks like

f = open (“data.txt”, “r”)

for c in f:

 print (“‘”, c, “‘”)

f.close()

The expression c in f results in consecutive lines being read from the files into a string variable c, and this stops when no more data can be read from the file.

Another way to do the same thing would be to use the readline() function:

f = open (“data.txt”, “r”)

c = f.readline()

while c != ”:

print (“‘”, c, “‘”)

c = f.readline()

f.close()

In this case, the end of file has to be determined explicitly by checking the string value that was read to see if it is null.

Another common file operation is to copy a file to another, character by character. A file is opened for input and another for output. The basic “read a file” pattern is used, with the addition of a file output after each character is read:

f = open (“data.txt”, “r”)

g = open (“copy.txt”, “w”)

c = f.read(1)

while c != ”:

g.write(c)

c = f.readline(1)

f.close()

g.close()

A filter is a program that reads data from a file and converts it to some other form, then writes it out. This is often done from standard input and output, but can be done in the middle of a file copy. For example, to convert a text file to all lower case, the pattern above is used with a small modification:

f     = open (“data.txt”, “r”)

g     = open (“copy.txt”, “w”)

c = f.read(1)

while c != ”:

g.write(c.lower())

c = f.readline(l)

f.close()

g.close()

This filter can be done using less code if the entire file can be read in at once. The read() function can read all data into a string.

f     = open (“data.txt”, “r”)

g     = open (“copy.txt”, “w”)

c = f.read()

g.write(c.lower())

f.close()

g.close()

Two files can be merged into a single file in many ways: one file after anoth­er, a line from one file followed by a line from another, or character by character. A simple merging of two files where one is copied first followed by the other is as follows:

f = open (“data1.txt”, “r”)

outfile = open (“copy.txt”, “w”)

c = f.read()

outfile.write(c)

f.close()

g = open (“data2.txt”, “r”)

c = g.read()

outfile.write(c)

g.close()

outfile.close()

A more complex problem occurs when both files are sorted and are to re­main sorted after the merge. If each line is in alphabetical order in each file, then merging them means reading a line from each and writing the one that is smallest. When one file is complete, the remainder of the second file is written and all files are closed.

f = open (“data1.txt”, “r”)

g = open (“data2.txt”, “r”)

outfile = open (“copy.txt”, “w”)

cf = f.readline() cg = g.readline()

while cf!=”” and cg!=””:

if cf<cg:

outfile.write(cf)

cf = f.readline()

else:

outfile.write(cg)

cg = g.readline()

if cf == “”:

outfile.write(cg)

cg = g.read()

outfile.write(cg)

else:

outfile.write(cf)

cf = f.read()

outfile.write (cf)

f.close()

g.close()

outfile.close()

Copying the input from the console to a file means reading each line using input() and writing it to the file. This code assumes that an empty input line im­plies that the copying is complete.

outfile = open (“copy.txt”, “w”)

line = input (“!    “)

while len(line)>1 or line[0]!=”!”:

outfile.write(line)

outfile.write (“\n”)

line = input(“! “)

outfile.close()

The end of the line is indicated by a character, which is represented by the string “\n”. Reading characters from a file will read the end of line character also, and detecting it can be very important.

f = open (“data.txt”, “r”)

c = f.read(1)

while c != ”:

 print (“‘”, c, “‘”)

c = f.read(1) if c == ‘\n’:

print (“Newline”)

CSV Files

A very common format for storing data is called Comma Separated Variable (CSV) format, named for the fact that each pair of data items have a comma be­tween them. CSV files can be used directly by spreadsheets such as Excel and by a large collection of data analysis tools, so it is important to be able to read them correctly.

A simple CSV file named planets.txt is provided for experimenting with reading CSV files. It contains some basic data for the planets in Earth’s solar sys­tem, and while there is no actual standard for how CSV files must look, this one is typical of what is usually seen. The first line in the file contains headings for each of the variables or columns, separated by commas. This is followed by nine lines of data, one for each planet. It’s a small data file, as these things are counted, but illustrative for the purpose.

This is not a very profound problem, and uses the raw data as it appears on the file. The file must be opened and then each line of data is read, and the value of the 11th data element (i.e., index 10) retrieved and compared against 10. If larger, the name of the planet (index 0) is printed. The plan is as follows:

Open the file

Read (skip over) the header line

For each planet

Read a line as string s

Break s into components based on commas giving list P

If P[10] < 10, print the planet name, which is P[0]

It is all something that has been done before except for breaking the string into parts based on the comma. Fortunately, the designers of Python anticipated this kind of problem and have provided a very useful function: split(). This func­tion breaks up a string into parts using a specified delimiter character or string and returns a list in which each component if one section of the fractured string. For example,

s = “This is a string”

z = s.split(” “)

yields the list z = [“This”, “is”, “a”, “string”]. It splits the string s into sub­strings at each space character. A call like s.split(“,”) should give substrings that are separated by a comma. Given the above outline and the split() function, the code is as follows.

Almost the entire program resides within a try statement, so that if the file does not exist, then a message is printed and the program ends normally. Note that P[10] has to be converted into an integer, because all components of the list P are strings. Strings are what has been read from the file.

CSV files are common enough so that Python provides a module for ma­nipulating them. The module contains quite a large collection of material, and for the purposes of the planets.py program, only the basics are needed. To avoid the details of a general package, a simpler version is included with this book: sim- pleCSV has the essentials needed to read most CSV files while being written in such a way that a beginning programmer should be able to read and understand it.

To use it, the simpleCSV module is first imported. This makes two impor­tant functions available: nextRecord() and getData(). The nextRecord() func­tion reads one entire line of CSV data. It allows skipping lines without examining them in detail (like headers). The function getData() will parse one line of data, the last one read, into a tuple, each element of which is one of the comma-sepa­rated fields.

The simpleCSV library needs to be in the same directory as the program that uses it or be in the standard Python directory for installed modules. The source code resides on the accompanying disk and is called simpleCSV.py. The program can be re-written to use the simpleCSV module as follows:

Problem: Play Jeopardy using a CSV data set.

The television game show Jeopardy has been on the air for 35 years in one of its two incarnations, and is perhaps the best known such program on television. Players select a topic and a point value and are asked a trivia question that they must answer in the form of a question. There are sets of questions that have been used in Jeopardy over the years, some in CSV form, and so it should be possible to stage a simulated game using Python as the moderator.

A simple version of the game could work like this: read the questions and answers, and select the questions at random. Questions that have single-word unambiguous answers would be best. The player types in an answer, and wins if they answer ten correctly before getting three wrong.

A single line of data from the file might look like this:

5957,2010-07-06,Jeopardy!,”LET’S    BOUNCE”,”$600″,”In this

kid’s game, you bounce a small rubber ball while picking up 6-pronged metal objects”,”jacks”

There are 7 different data fields here separated by commas. They are: Show Number, Air Date, Round, Category, Value, Question, and Answer; all are strings, but some questions may contain commas. The CSV module can manage that.

There are many ways that a random question can be chosen. One would be to read all of the data into a list, but that would require a lot of memory. Another way would be to randomly read a question from the file, but that would be difficult to do because each line has a different length. What could be done relatively easily would be to pick a random number of questions to skip over before reading one to use. We therefore select a random number K between N and M, read K questions, and then read the next one and ask the user that question. When the end of the file is reached, it can be read again from the beginning. If the file is large enough, it would be unlikely to ask the same question twice in a short time period.

Here is an outline of how this might work:

Open infile as the file of questions to be used

While game continues:

Select a random number K between N and M For I = N to M:

Read a line from the file If no more lines:

Close infile and reopen

Read a question and print it, ask the user for an answer Read the user’s answer from the keyboard If the user’s answer is correct:

Count right answers

Else:

Count wrong answers

If the CSV module is used the parsing the input file is dealt with. What is new about his? When all of the data in the file has been used the program may not be complete. What is done then is new: close the file, reopen it, and start again from the beginning. This is an unusual action for a Python program but illustrates the flexibility of the file system. There is a nested try-except pair, the outer one that checks the existence of the file of questions and the inner one that checks for the end of the file. When the file is re-opened, a new reader has to be created, be­cause the old one is connected to a closed file. The file on the disk is the same, but when it is opened again, a new handle is built; the old CSV reader is linked to the old handle.

The program counts the number of right answers (CORRECT) and the num­ber of wrong ones (INCORRECT). When there are 10 correct answers or 3 in­correct ones, the game is over; a variable again is set to False and the main while loop exits. A break could have been used, but having the condition become False is the polite way to exit from a while loop.

The entire program looks like this:

The With Statement

A difficulty with the code presented so far is that it does not clean up after itself. A file should be closed after input from it or output to it is finished; none of the programs written so far do that, at least not after the file operations are com­plete. There has been no significant discussion of the close() operation, but what it does has been described. Normally, when a program terminates, its resources are returned to the system, including the closing of any open files. Intention­ally closing a file is important for three reasons: first, if the program aborts for some reason, open files should be closed by the system but may not be, and file problems can be the result. Second, as in the Jeopardy program, closing a file can be used as a step in re-using it. Opening it again starts reading it at the begin­ning. Third, closing a file frees its resources. Programs that use many files and/ or many resources will profit from freeing them when they are no longer needed.

The Python with statement, in its simplest form, takes care of many of the details surrounding file access. An example of its use is as follows:

Once the file is open, the with statement guarantees that certain errors will be dealt with and the file will be closed. The problem is that the file has to be open first, so the FileNotFound error should still be caught as an exception.

 

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *