A *grammar* is a formal specification of the structure of some data item or collection. For the purposes here, we’ll assume that data consists of characters, because in the context of parsing, that is frequently true. A grammar defines all possible forms that a dataset can take in a formal way. The structure of a CSV file has already been seen, so let’s start with something simple yet common – a number.

An integer or floating point number is a sequence of characters that has a specific structure. FA grammar can be specified in multiple ways, but here we’ll use a description called the *Backus-Naur form* (BNF). When using BNF, a symbol that is on the left side of the definition is defined by or can be replaced by a set of symbols on the right. This is called a rule. Here’s an example:

<integer> ::= <digit> | <integer><digit>

The symbol on the left, “<integer>,” is being defined. The symbol “::=” separates the symbol being defined from the definition. The definition here says that an “<integer>” can be a “<digit>” OR (the symbol “|”) an “<integer>” followed by a “<digit”>.

Any name enclosed in “<>” is called a *non-terminal* symbol, because it cannot appear in the final result. *Terminal* symbols are the actual components of what is being defined, which we call a *language*. Let’s continue by defining a “<digit>:”

<digit> ::= “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” |”8″ | “9”

This defines the abstract, non-terminal symbol “<digit>” as being any one of ten terminal symbols, which in this case are the characters “0” through “9.” A digit is one of the numerals that form our number system. If we now define

<ssnumber> ::= <digit><digit><digit><digit><digit><digit> <digit><digit><digit>

we have a social security number, a nine-digit sequence. They are concatenated to form a long number.

Note that “<integer>” appears on the left side and the right side. This definition is recursive, but all that means is that it is defined partly in terms of itself. An “<integer>” can be one digit. The rule above for “<integer>” is really two rules separated by the OR symbol. It could be

<integer> ::= <digit>

<integer> ::= <integer><digit>

Let’s label the rules. <integer> ::= <digit> is Rule 1a, and <integer> ::= <integer><digit> is Rule 1b, because they appear in the same rule. The rule for the digit could be Rule 2, and Rule 2a is <digit> ::= 0, and Rule 2b is <digit> ::= “1” and so on.

Now, consider the input “5.” By this grammar, we can use Rule 2f to change the input symbol “5” into “<digit>.” By Rule 1a, we can change this to “<integer>.” This process of using rules to change the terminal symbols into target nonterminal symbol is *parsing*. Defined this way, it’s pretty formal.

Now consider the input string “320.” If we are trying to read an integer, we use Rule 2d to get the following:

<digit>20

Rule 1a to get <integer>20

Then Rule 2c: <integer><digit>0

Rule 1b: <integer>0

Rule 2a: <integer><digit>

Rule 1b: <integer>

And we’re done, because there is no more input.

Extended BNF offers things we can use. Items within square brackets are optional; they may be there or not. So

<postal code> ::= <letter><digit><letter>

[-]<digit><letter><digit>

could define a Canadian postal code. A code could be T2N1N4 or T2N-1N4.

Items in braces can appear zero or more times,

<integer> ::= <digit> {<digit>}

Or they can be suffixed with an asterisk (“*”), as in a regular expression:

<integer> ::= <digit><digit>*

Items that occur ONE or more times are suffixed with a “+:”

<integer> ::= <digit>+

Let’s define a floating point number. It is a sequence of digits, or an “<integer>,” possibly followed by a decimal point which is possibly followed by another sequence of digits. Here’s one possible rule:

<float> ::= <digit>+ “.” {<digit>}

Which demands one digit at least to the left of the decimal point. In this case “1.0” is legal but not “.12”. Another option is

<float> ::= <digit>* “.” <digit>+

which requires one digit at least to the right of the decimal point. In this case “.12” is legal, but not “0.” What we need is one of the other. We need at least one digit and a decimal point. Simply combine these:

<float> ::= <digit>+ “.” {<digit>} | <digit>* “.” <digit>+

This discussion is very brief, and a full understanding of grammars and languages would take a great deal of time. What we need is an understanding of how to specify a language that we want to parse. The example language to parse is a simple programming language called PyJ.

Source: Parker James R. (2021), *Python: An Introduction to Programming*, Mercury Learning and Information; Second edition.