Language Symbols and Scanning in Python

Many symbols in a language consist of multiple characters. Some don’t, like “>” and “<.” However “>=” is greater than or equal to and is two characters. An identifier, like a variable name, can be many characters, as can numeric con­stants. Some identifiers are special, like key words. “If’ and “while” have special meanings and cannot be used as identifiers.

A scanner is a module that reads input characters and replaces them with symbols. In a parse, symbols are constants that take the place of more complex symbols, so the less than symbol “<” could be represented by a name lessSy that had the numeric al value 102. The name and value are arbitrary, but the idea is to simplify the input.

Consider the PyJ statement:

for control = (pi*pi):maxangle

As symbols, this could be represented as: forsy ident eqsy lparen ident mult ident rparen colon ident where each of those names was a number. The result could be

6 21 9 24 21 23 21 25 32 21

This is easier to handle in a parser, and the symbolic names can make the code easier to read. The complete list of symbols in PyJ is shown in Table 14.1.

The scanner opens the input file and reads characters, either one at a time or into a buffer. In either case, characters are examined one at a time and are com­bined into symbols. At all times, there is a variable that contains the last symbol that was encountered: the current symbol. It is named sy. The parser makes its decisions based on the value of sy.

An essential function in the scanner is the one that gets the next character. It is named nextCh, and it looks something like this:

def nextCh ():

global ch,eof

if len(ch)==0:      # End of file means no more characters

return eof

try:

ch = fp.read(l) # Here we read characters one at a time.

return ch[0]          # Set global var ch and return it

If all characters are read into a buffer, then this code might look like this:

def nextCh ():

global ch,eof, indx

if length(buffer)<=indx: # End of file means no more

                         # characters

return eof

try:

ch = buffer[indx]    # Here we read characters one

                          # at a time.

Indx = indx + 1

return ch               # Set global var ch and return it

Using a buffer is faster.

The next part of the scanner consists of some code that builds numbers and identifiers from characters. Building a number from characters has two parts, though. First is “is this a legal real number?” and the second is “what number is it?”. Both can be done concurrently.

def scanNumber():

global numberVal

numberVal

fracVal = 0

while digit(ch):                             # Integer part

numberVal = numberVal*10 + digitVal(ch) # collect

                                             # value

nextCh()                                # next digit?

if ch == “.”:             # Fractional part

nextCh()

fracVal = 0.0

pten = 10.0

while digit(ch):     # Each fractional digit has its

fracVal = fracVal+digitVal(ch)/pten  # Value

                                     # divided

                      # by 10 and summed.

nextCh()              #     Next digit

pten = pten/10        # next power of 10

numberVal = numberVal + fracVal

The scanNumber function is called when the scanner sees a digit. It accepts digits and accumulates a numerical value by multiplying the value of the digit by its appropriate power of ten. At the end of this, we have an integer value. If that is followed by a decimal point, then each digit that follows, if any, is part of a fraction. A fraction is accumulated by multiplying the digit values by a negative power of ten, or dividing by a power of ten, and accumulating a sum in the vari­able fracVal. When no more digits are seen, the resulting number is numberVal + fracVal.

The process for identifiers, which is to say variable and function names, is similar.

def scanIdent (): global ident

ident = “”               # Start with empty string

while identChar(ch):     # A letter?

ident = ident + ch       # Add to the identifer

nextCh()                 # Get the next character

The global variable ident contains all of the characters in the identifier. Some identifiers are key words like “if.” We’ll work that out now. How do we know what an identifier means? We can look it up in a dictionary.

A global dictionary is created that stores symbols indexed by their identifier string. It’s probably the simplest and fastest way to see if an identifier is a key word:

If an identifier is found in this dictionary, then it represents the correspond­ing key word symbol. Otherwise, it is a variable or function name.

We are now ready to build the main scanner function, called nextSy(). This function is 60 lines long and is not reproduced here completely, but can be found on the website and on the accompanying DVD. However, the main parts of it can be described without seeing all of it.

It uses the global variable ch to determine what the next symbol will be.

If the character ch is a letter, then nextSy calls scanIdent to build an identi­fier string. It looks that up in the dictionary, and if found, then it returns the key word symbol, otherwise it returns the identSy symbol:

if letter(ch):

scanldent()

try:

k = keywords[ident]

return k

except:

return identSy

In a similar way, if ch is a digit, then it scans and creates a number by calling scanNumber and returns the generic symbol for a number, numberSy:

if digit(ch):

scanNumber()

return numberSy

If ch is one of the single character symbols, then skip the character and return the symbol, like this:

if ch == “+”:

nextCh()

return plusSy

elif ch == “-“:

nextCh()

return minusSy

Finally, if the symbol consists of two characters (called a digraph) then we read another character and see if it fits as the second part. If so, read another and return the digraph symbol, otherwise return the original single character symbol, like this:

elif ch == “<“:          # < can start one of thee symbols:

                         # < <= <>

nextCh()                 # Look at the next character

if ch == “=”:            # Is is ‘=’. Then we have ‘<=’

nextCh()            # read another

return lesseqSy     # Return <=

elif ch == “>”:          # OK, not ‘=’. Is it ‘>’?

nextCh()            # Yup. Skip the character

return noteqSy      # And return <> (noteqSy)

else:

return lessSy # Nope. It was just < (lessSy)

Notice that in all three case above, the value of ch is the next character in sequence, one that has not yet been used to build a symbol. This has to be true in all situations.

Now we have a scheme that will give us the next symbol in all cases. That’s what the parser needs.

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *