Procedural Programming - Word Processing

In the early days of desktop publishing, the programs that writers used did not display the results on the screen in “what-you-see-is-what-you-get” form. Formatting commands were embedded within the text and were implemented by the program, which would create a printable version that was properly formatted.

Programs like roff, nroff, and tex are still used, but most writing tools now look like Word or PageMaker with commands being given through a graphical user interface.

There is a limit to what kind of text processing can be done using simple text files, but when you think about it, that’s really what a typewriter produces— simple text on paper with fixed size fonts.

The program developed here accepts text from a file and formats it according to a set of commands that have a specific format and are predefined by the system. The input resembles that accepted by nroff, an old Unix utility, but is a subset for simplicity. Since it uses standard text input and output, measurements are made in characters, not inches or points. Commands begin on a new line with a “” character and are alphabetic. A line beginning with “br,” for instance, results in a forced line break. Some commands take a parameter: the command “ll 55” sets the line length to 55 characters.

Here is a list of all of the commands that the system recognizes:

.pln Sets the page length to n lines .bp n Begin page n

.br Break

.fi Fill output lines (e.g., justify)

.nf Don’t fill output lines

.na No justification

.cen Center the next n input lines

.lsn Output n-1 line spaces after each line

.lln Line length is n characters

.inn Indent n characters

.tin Temporarily indent n characters

.nh Do not hyphenate

.hy Hyphenation on

.spn Generate n lines

The program reads a text file and identifies the words and the commands. The words are written to an output file formatted as described by the commands. The default is to right justify the text and to use empty lines as paragraph breaks. The questions to be answered here are as follows:

How does one begin creating such a program?
Can the process of program creation be described?
Is the process systematic or casual?
Is there only one process?

Beginning with the last question first, there is no single process. What is presented here is only one, but it should be understood that there are others, and that some processes probably work better than others for some kinds of program. The program we create here does not use classes, and it involves a classical or traditional methodology generally referred to as top-down. Some people only use object-oriented code, but a problem with teaching that way is that a class contains traditional, procedure-oriented code. To make a class, one must first know how to write a program.

1. Top-Down

In top-down programming, the higher levels of abstraction are described first. A description of what the entire program is to do is written in a kind of English/computer hybrid language ( pseudocode), and this description involves making calls to functions that have not yet been written but whose function is known. When the highest level description is acceptable, then the functions used are described. In this way, the high-level decisions are described in terms of the lower levels, whose implementation is postponed until the details are appropriate. The process repeats until all parts have been described, at which time the translation of the pseudocode into a real programming language can proceed, and should be straightforward. This can result in many distinct programs, but they all should do basically the same thing.

For the task at hand, the first step is to sketch the actions of the program as a whole. The program begins by opening the text file and opening an output file. The basic action is to copy from input to output, with certain additions to the output text. The data file is read in as characters or words, but output as lines and pages. The following is an example:

Open input file inf

Open output file outf

Read a word w from inf

While there is more text on inf:

If w is a command:

Process the command w

Else:

The next word is w. Process it

Read a word from inf

Close inf

Close outf

This represents the entire program, although the code lacks much detail. As Python code, this would look almost the same:

filename = input (“PYROFF: Enter the name if the input

file: “)

inf = open (filename, “r”)

outf = open (“pyroff.txt. “w”)

w = getword (inf)

while w != “”:

if iscommand(w):

process command (w)

else:

process word (w)

w = getword(inf)

inf.close()

outf.close()

The functions must exist for the program to compile them. They should initially be stubs, relatively non-functional, but resulting in output:

from random import *

def getword (f):

print (“Getword “)

def iscommand(w):

print (“ISCOMMAND given “, w)

if random()< 0.5:

return False

return True

def process command (w):

print (“Processing command “, w)

def process word (w):

print (“Processing the word “, w)

This program will run, but it never ends because it never reads the file. Still, we have a structure.

Now the functions need to be defined, and in the process, further design decisions are made. Consider getword(): what comprises a word and how does it differ from a command? A command starts at the beginning of a line with a “” character. It is followed by two alphabetic characters that are defined by the system. If the two characters do not match any combinations in the list of commands, then it is not a command. A word begins or ends with a white space (blank, tab, or end of line) and contains all of the characters between those white spaces. It may not be a word in the traditional sense, in that it may not be an English word; it could be a number or other sequence of characters. Those may cause problems, but it will be left up to the user to figure it out (for example, a long URL may extend over a line). The program has to do something, and so will probably put an end of line when the count of characters exceeds a maximum and leave the problem to the user to fix.

Let’s figure out the getword() function. It constructs a word as a character string from individual characters that have been read from the input file. A first try could be as follows:

def getword(f):

w = “”

while whitespace(ch(f)):

nextch(f)

while not whitespace(ch(f)):

w = w + ch(f)

nextch(f)

print (“Getword is “, w)

return w

The function whitespace() returns True if its parameter is a white space character. The function nextch() reads the next character from the specified file, and the function ch() returns the value of the current character. To effectively test getword(), we need to implement these three functions. Here’s a first attempt:

def whitespace (c):

if c == ” “: return True

if c == “\t”: return True

if c == “\n”: return True

return False

def ch(f):

global c

return (c)

def nextch(f):

global c

c = f.read(1)

This way of handling input is unusual, but there is a reason for it. We are anticipating a need to buffer characters or to place them back on the input stream. It is similar to the input scheme used in Pascal, or the system found in early forms of UNIX which used getchar – putchar – ungetc. The necessity of extracting commands from the input stream, and that commands must begin a new line, might make this particular scheme useful. The initial implementation of nextch() simply reads a new character from the file, but it could easily be modified to extract a character from a buffer, and refile the buffer if it is empty. Both would look the same to the programmer using them.

The program runs, but has a problem: it never terminates. After the text file has been read, the program seems to call nextch() repeatedly. After some thought the reason is clear—when the input request results in an empty string (“”), the current character is not a white space, and the loop in getword() that is building a word runs forever. This is a traditional end-of-file problem and can be solved in a few different ways: a special character can be used for EOF, a flag can be set, or the empty string can be tested for in the loop explicitly. The latter solution was chosen, and fixes the infinite loop. The word construction loop in getword() becomes

while not whitespace(ch(f)) and ch(f) !=””:

A possible next step is to distinguish between commands and words. There are two things to do because a command starts a line and begins with a period (.): mark the beginning of a new line, and look up the input string in a table of commands. The command could be searched first, then if it matches a command name, we could back up the input to see if it was preceded by a newline character (“\n”). A newline counts as a white space, and another option would be to set a flag when a newline character is seen, clearing it when another character is read in. Now a string is a command if the flag set before it was read in and it matches one of the commands. Timing is important in this method, but white space separates words, so it could work by simply remembering (saving) the last white space character seen before any word.

This code has a problem. When implemented, none of the commands are recognized. A table of names was implemented as a tuple:

table = (“.pl”,”.bp”,”.br”,”.fi”,”.nf”,”.na”,”.ce”,”.ls”,”.ll”,

“.in”,”.ti”,”.nh”,”.hy”,”.sp”)

The nextch() function was modified so

def nextch(f):

global c, lastws

c = f.read(1)

if whitespace(c):

lastws = c

and the function iscommand() is implemented by checking for the newline and the match of the string in the table:

def iscommand(w):

global table, lastws

if lastws == “\n”:

if w in table:

return True

return False

To discover the problem, some print statements were inserted that show the previous white space character and the match in the table for all calls to iscommand(). The problem, which should have been obvious, is that when the command is read in, the last white space seen will be the one that terminated it, not the one in front of it.

A solution: keeping the same theme of remembering white space characters, let’s save the previous two white space characters seen. The most recent white space is the one that terminated the word string, and the second most recent will always be the one before it. All of the others, if any, would have been skipped within getword(). The solution, as coded in the nextch() function, is as follows:

def nextch(f):

global c, clast, c2last

c = f.read(1)

if whitespace(c):

c2last = clast

clast = c

There are two variables needed, clast being the previous white space and c2last being the one encountered before clast. Now iscommand() is modified slightly to look for c2last:

def iscommand(w):

global table, c2last

if c2last == “\n”:

if w in table:

return True

return False

This code identifies the commands in the source file, even the text that looks like a command but is not: “xx.”

Notice that the development of the program consists of an initial sketch and then filling in the code as stubs and coding the stubs to be functional code, one at a time. Sometimes a stub requires further undefined functions to be used, and those could be coded as stubs too, or completed if they are small so as to allow testing to proceed. It’s a judgment call as to whether to complete the stubs down the chain for one part of the program or to proceed to the next one at the current level. For example, should we have completed the nextch() and ch() functions before trying to design process_command()? It does depend on how testing can proceed and what level we are at. The nextch() function looks like it won’t call other functions that have not been implemented, and it is difficult to test get- word() without finishing nextch().

This discussion speaks to what the next step will be from here, and there could be many. Let’s look at commands next, because they will dictate the output, and then deal with formatting last. It is known that a string represents a command, and the function called as a consequence is process_command(). This function must determine which command string was seen and what to do about it. The way commands are handled and the way the output document is specified has to be sorted out before this function can be finished, but a set of stubs can hold the place of future decisions as before.

The string that was seen to be a command is stored in a tuple. The index of the string within the tuple tells us which command was seen, although a string match could be done directly. Using a tuple is better because new commands can always be added to the end of the tuple during future modifications and it is easier to modify command names. The function, which used to be a stub, is now

This completes iteration 5 of the system and generates quite a few new stubs and defines how some of the output functions will operate. There are some flags (hyphenate, center, fill, and adjust) and some parameters for the output process (line_length and spacing) that are set, and so will be used in sending output text to the file. These parameters being known, it is time to define the output process, which is implemented starting with the function process_word().

As mentioned earlier, the program reads data one character at a time and emits it as words. There is a specified line length, and words can be read and stored until that length is neared or exceeded. Words could be stored in a string. When the line length is reached, the string could be written to the file. If right justification is being done, spaces could be added to some other spaces in the string until the line length was met exactly, or the final word could be hyphenated to meet the line length. If right justification is not being done, then the line length only has to be approached, but not exceeded.

For text centering, input lines are padded with equal numbers of spaces on both sides. The page size is met by counting lines, and then by creating a new page when the page size is met, possibly by entering a form feed or perhaps by printing empty lines until a specified count is reached. Indenting is simple: the in command results in a fixed number of spaces being placed at the beginning of each output line; the ti command results in a specified number of spaces being placed at the beginning of the current line. Hyphenation is done by table lookup. Certain suffixes and prefixes and letter combinations are possible locations for a hyphen. The final word on a line can be hyphenated if a location within it is subject to a hyphen as indicated by the table.

The process is to read and build words and copy them to a string, the next output line. No action is taken until the string nears the line length, at which point insertion of spaces, hyphenation, or other actions may be taken to make the string fit the line, either closely or precisely. After a line meets the size needed, it is written, perhaps followed by others if the line spacing is larger than one. The basic action of the process_word() function is to copy the word to a string, the output buffer, under the control of a set of variables that are defined by the user through commands:

The simplest version of process_word() copies words to the buffer until the line is full and then writes that line to the output file.

def process word (w):

global buffer, line length

if len(buffer) + len(w) + 1 <= line length:

buffer = buffer + ” ” + w

else:

emit(buffer)

buffer = w

The code above adds the given word plus a space to the buffer if there is room. Otherwise, it calls the emit() function to write the buffer to the output file and places the word at the beginning of a new line. This is nearly correct. Some of the output for the sample source is as follows:

This is sample text for testing Pyroff. The default is to right adjust continuously, but embedded commands can change this.

Now the line width should be 30 characters, and so the left margin is pulled back. This line is centered .xx not a command. Indented 4

Note that the command “ll 30” was correctly handled, but that there is an extra space at the beginning of the first line. That’s due to the fact that process_word() adds a space between words, and if the buffer is empty that space gets placed at the beginning. The solution is to check for an empty buffer:

if len(buffer) + len(w) + 1 <= line length:

if len(buffer) > 0:

buffer = buffer + ” ” + w

else:

buffer = w

This was a successful fix, and completes iteration 6 of the system, which is now 150 lines long.

Within process_word(), there are multiple options for how words can be written to the output. What has been done so far amounts to filling but no right justification. Other options are no filling, centering, and justification. When the filling is turned off, an input line becomes an output line. This is true for centering as well. When justification is taking place, the program will make the output lines exactly line_length characters long by inserting spaces in the line to extend it and by hyphenation, where permitted, to shorten it. The rule is that the line must be the correct length and must not begin or end with a space. The implementation of this part of the program is at the heart of the overall system, but would not be possible without a sensible design up to this point.

2. Centering

First, a centered line is to be written to output when an end of line is seen on input. This means that the clast variable is used to identify the end of line and to emit the text. Next, the line has spaces added to the beginning and end to center it. The buffer holds the line to be written and has len(buffer) characters. The number of spaces to be added totals line_length – len(buffer), and half are added to the beginning of the line and half to the end. A function that centers a given string would be as follows:

def do center (s):

global line length

k = len(s) # How long is the string?

b1 = line length – k # How much shorter than the line?

b2 = b1//2 # Split that amount in two

b1 = line length – k – b2

s = ” “*b1 + s + ” “*b2 # Add spaces to center the text

emit(s) # Write to file

In the process_word() function, some code must be added to handle centering. This code has to detect the end of line and pass the buffer to do_center(). It also counts the lines, because the “ce” command specifies a number of lines to be centered.

if center: # Text is being centered, no fill

if len(buffer) > 0: # Add this word to the line

buffer = buffer + ” ” + w

else:

buffer = w

if clast == “\n”: # An input line = an output line

do center(buffer) # Emit the text

center count = center count – 1 # Count lines

if center count <= 0: # Done?

center = False # Yes. Stop centering.

This code is not quite enough. There are two problems observed. One problem is that the buffer could be partly full when the “.ce” command is seen, and must be emptied. This problem is serious, because filling may be taking place and the line might have to be justified. For the moment, a call to emit() happens when the “.ce” command is seen, but this will have to be expanded.

The other problem is simpler: the do_center() function does not empty the buffer, so the line being centered occurs twice in the output. For example,

The solution is to clear the buffer after do_center() is called:

do center(buffer) # Emit the text

buffer = “” # Clear the buffer

3. Right Justification

Centering text is a first step to understanding how to justify it. Right justified text has the sentences arranged so that the right margin is aligned to the line. When centering, spaces are added to the left and right ends of the string so as to place any text in the middle of the line. When justifying, any space in the line can be made into multiple spaces, thus extending the text until it reaches the right margin. Naturally it would not be acceptable to place all of the needed spaces in one spot. It looks best if they are distributed as evenly as possible. However, no matter what is done, there will be some situations that cause ugly spacing.

The number of spaces needed to fill up a line is line_length – len(buffer), just as it was when centering. As words are added to the line, this value becomes smaller. When it is smaller than the length of the next word to be added, then the extra spaces must be added and a new line started. That is, when

k = line length – len(buffer)

if k < len(word):

then adjusting is performed. First, count the spaces in the buffer and call this nspaces. If k>nspaces, then change each single space into k//nspaces space characters and set k = k%nspaces. This will rarely happen. Now, we need to change some of the spaces in the buffer into double spaces. Which ones? In an attempt to spread them around, set xk = k + k//2. This will be used as an increment to find consecutive spots to put spaces. So for example, let k = 5, in which case xk = 7. The first space could be placed in the middle, or at space number 2. Now count xk positions from 2, starting over at zero when you hit the end. This will give 4 as the next position, followed by 1, then 3, and then 0. This process seems to spread them out. Now the buffer is written out and the new word is placed in an empty buffer.

This sounds tricky, so let’s work through it. Never enter code that is not likely to work! Inside of the process_word() function, check to see if adjusting is going on. If so, check to see if the current word fits in the current line. If so, put it there and move on.

The function nth_space (buffer, xk) locates the n* space character in the string s modulo the string length. The spaces were not well distributed with this code in some cases. There was a suspicion that it depended on whether the number of remaining spaces was even or odd, so the code was modified to read

…

xk = k + (k+1)//2 # Space insert increment

if k%2 == 0:

xk = xk + 1

…

which worked better. The output for the first part of the test data was as follows:

The short lines are right justified, but the distribution of the spaces could still be better.

The function nth_space() is important, and looks like this:

4. Other Commands

The rest of the commands have to do with hyphenation, pagination, and indentation, except for the “br” command. Dealing with the indentation first, the command “.in” specifies a number of characters to indent, as does “ti” The “.in” command begins indenting lines from the current point on, whereas “.ti” only indents the next line. Since the “.ti” command only indents the next line of text, perhaps initializing the buffer to the correct number of spaces will be the right approach. The rest of the text for the line will be concatenated to the spaces, resulting in an indented line.

The “.in” command currently results in the setting of a variable named nindent to the number of spaces to be indented. Following the suggestion for a temporary indent, why not replace all initializations of the buffer with indented ones? There are multiple locations within the process_word() function where the buffer is set to the next word: buffer = w

These could be changed to

buffer = ” ”*nindent +w

This sounds clean and simple, but it fails miserably. Here is what it looks like. For the input text, we have

Indented 4 characters.

.in 2

The idea behind top-down programming is that the higher levels of abstraction are described first. A description of what he entire program is to do is written in a kind-of English/computer hybrid language (pseudocode), and this description involves making calls to functions that have not yet been written but whose function is known.

We get the following results:

Indented 4 characters. The idea behind top-down programming is that the higher levels of abstraction are described first. A description of what he entire program is to do is written in a kind-of English/computer hybrid language (pseudocode), and this description involves making calls to functions that have not yet been

Can you figure out where the problem is by looking at the output? This is a skill that develops as you read more code, write more code, and design more code. There is a place in the program that will add spaces to the text, and clearly that has been done here. It is how the text is right adjusted. The spaces are counted and sometimes replaced with double spaces. This happened here to some of the spaces used to implement the indent.

Possible solutions include the use of special characters instead of leading blanks, to be replaced when printed; finding another way to implement indenting; modifying the way right adjusting is done. Because the number of spaces at the beginning of the line is known, the latter should be possible: when counting spaces in the adjustment process, skip the nspaces characters at the beginning of the line. This is a modification to the function nth_character() to position the count after the indent:

def nth space (s, n):

global nindent

nn = 0

i = 0

while True:

print (“nn=”, nn)

if s[i] == ” ”:

nn = nn + 1

print (…. )

if nn >= n:

return i

i = (i + 1)%len(s)

if i < nindent+tempindent:

i = nindent+tempindent

A second problem in the indentation code is that there should be a line break when the command is seen. This is a matter of writing the buffer and then clearing it. This should also occur when a temporary indent occurs, but before it inserts the spaces. The temporary indent will have the same problem as the indent with respect to the right adjustment, and we have not dealt with that.

The line break can be handled with a new function:

def lbreak ():

global buffer, tempindent, nindent

if len(buffer) > 0:

emit(buffer)

buffer = ” “*(nindent+tempindent)

tempindent = 0

The break involves writing the buffer and clearing it. Clearing it also means setting the indentation. Because this sequence of operations happens elsewhere in the program, those sequences can be replaced by a call to lbreak(). Note that a new variable tempindent has been added; it holds the number of spaces for a temporary indentation, and it is added to the regular nindent value everywhere that a variable is used to obtain the total indentation for a line. Now the right adjustment of a temporarily indented line should work.

The lbreak() function is used directly to implement the “br” command. A stub previously named genline() can be removed and replaced by a call to lbreak().

Line spacing can be handled in emit(), which is where lines are written to the output file. After the current buffer is written, a number of newline characters are written to equal the correct line spacing. The new emit() function is

def emit (s):

global outf, lines, tempindent, spacing, page length

outf.write(s+”\n”)

lines = (lines + 1)%page_length

for i in range (1, spacing):

outf.write (“\n”)

lines = (lines + 1)%page length

tempindent = 0

What about pages? There is a command that deals with pages directly, and that is “.bp,” which starts a new page. The page length is known in terms of the number of lines, and emit counts the lines as it writes them. Implementing the “.bp” command should be a matter of emitting the number of lines needed to complete the current page. The code looks like this:

def genpage():

global page length, lines

lbreak()

for i in range (lines, page length):

emit (“”)

All that is missing is the ability to hyphenate, which is left as one of the exercises. The system appears to do what is needed using the small test file, so the time has come to construct more thorough tests. The file “preface.txt” holds the text for the preface of a book named Practical Computer Vision Using C. This book was written using Nroff, and the commands not available in Pyroff were removed from the source text so that it could be used as test data. It consists of over 500 lines of text. The result of the first try with this program was interesting.

Pyroff appeared to run using this input file, but never terminated. No output file was created. The first step was to try to see where it was having trouble, so a print statement was added to show what word had been processed last. That word was “spectrograms,” and it appears in the first paragraph of text, after headings and such. Now the data that caused the problem is known. What is the program doing? There must be an unterminated loop someplace. Putting prints in likely spots identifies the culprit as the loop in the nth_space() function. Tracing through that loop finds an odd thing: the value of nindent becomes negative, and that causes the loop never to terminate. The test data contained a situation that caused the program to fail, and that situation resulted from a difference between Nroff and pyroff: in Nroff the command ‘.in -4’ subtracts 4 from the current indentation, whereas in pyroff it sets the current indent to -4.

This kind of error is very common. All values entered by a user must be tested against the legal bounds for that variable. This was not done here, and the fix is simple. However, it reminds us to do that for all other user input values. These are processed in the function process_command(), so locating those values is easy. Once this was done things worked pretty well. There was one problem observed, and that was an indentation error. Consider the input text:

.nf
1. Introduction
.in 3
1.1 Images as digital objects
1.2 Image storage and display
1.3 Image acquisition
1.4 Image types and applications

The program formats this text as follows:

1. Introduction
1.1 Images as digital objects
1.2 Image storage and display
1.3 Image acquisition
1.4 Image types and applications

There is an extra space in the first line after the indent. This seems like it should be easy to find where the problem is, but the function that implements the command, indent(), looks fine. However, on careful examination (and printing some buffer values), it can be seen that it should not call lbreak() because that function sets the buffer to the properly indented number of space characters. This means that when the later tests for an empty buffer occur, the buffer is not empty and text is appended to it rather than being simply assigned to it. That is, for an empty buffer the first word is placed into it:

buffer = word

whereas, if text is present, the word is appended after adding a space:

buffer = buffer + ” ” + word

The indent function now looks like this:

def indent (n):

global nindent, buffer

nindent = n

emit(buffer)

buffer = “”

The preface is now formatted well. Other problems may exist, and these should be reported to the author and publisher when discovered. (The book’s wiki is the place for such discussions.)

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Python

Procedural Programming – Word Processing

1. Top-Down

2. Centering

3. Right Justification

4. Other Commands

Leave a Reply Cancel reply

1. Top-Down

2. Centering

3. Right Justification

4. Other Commands

Leave a Reply Cancel reply

Login