PEGs and the Structure of Languages

How Parsing Expression Grammars help us (and computers) understand the structure of programming languages.

Estimated reading time: 7 mins

Bruce Hill May 29, 2021

When I first began working on my programming language Nomsu, one of the first challenges I faced was how to get the computer to see code written in my language as more than just a long sequence of letters. Code has structure, and finding that structure is called parsing. After spending a long time digging through different approaches to the problem, one lesser-known (though increasingly popular) approach called Parsing Expression Grammars became my favorite, and allowed me to make my own programming language. This post will explain what Parsing Expression Grammars (PEGs) are, and a bit of background on the problem they solve. There are some very good PEG libraries available (I first found out about PEGs through the Lua library LPEG), and in a follow-up post, I’ll even explain how to implement a PEG parser from scratch. Hopefully, after reading this post, you’ll have enough understanding to try defining your own language (programming or otherwise) with a Parsing Expression Grammar.

A Note on Tokens

Parsing is the problem of understanding the structure and meaning of a series of tokens. “Tokens” can refer to discrete chunks of an input. For example, “bytes in a file” or multi-byte chunks like “words” or “whitespace”. If a parser does split input into multi-byte tokens before processing, this step is called “tokenization” or “lexing.” PEGs are very good at concisely expressing low-level concepts, so you often don’t need to do this preprocessing step. In this post, I’m going to assume that the input is just text without any preprocessing.

Some Background

Formal specifications of the structure of a language go back a long way to linguists like Noam Chomsky. Two of the more influential concepts are context free grammars and regular expressions. A full explanation of these topics is beyond the scope of this post, but I will say that these paradigms have been massively useful for describing the structure of a language or pattern. Unfortunately, these paradigms come from a generative perspective, in other words, they’re structured like a choose-your-own-adventure. Consider this grammar:

sentence <-
    subject (intransitive_verb | intransitive_verb object) "."
subject <- noun_phrase
object <- noun_phrase
noun_phrase <-
    article (noun | adjective noun) | pronoun | proper_noun
article <- "the" | "a" | "an"
adjective <- ...
noun <- ...
pronoun <- ...
proper_noun <- ...
intransitive_verb <- ...

If you read it as a choose-your-own-adventure, you can generate a great number of sentences with even this simple grammar like “The hungry cat ate the mouse.” or “I like turtles.” or “A raindrop fell.” From a linguistic perspective, this is a very useful paradigm for understanding the rules governing how humans produce language, but it runs into some problems if you use this approach for having a computer parse a language.

Namely, this approach can be very inefficient if it has to do a lot of backtracking. Grammars of this sort can have many possible parsings for each token in the sentence, and it may be the case that only one possible set of assignments will lead to a successful parsing, or worse: many possible sets of assignments will lead to many different successful parsings. As a trivial example, imagine parsing a sentence one word at a time with the grammar above:

Token	Guess	Commentary
The	`article`	This one’s pretty straightforward.
clever	`adjective`	This could be either a noun or an adjective, but let’s go with adjective for now.
trick	`noun`	If “clever” was an `adjective`, this must be a `noun` (as in the noun phrase “The clever trick”).
parsers	`verb`?!?!	Uh oh, “parsers” definitely isn’t a verb. That means that we have to go back and re-evaluate all the choices we’ve made so far.

Let’s try again:

Token	Guess	Commentary
The	`article`
clever	`noun`	Since labeling this as an adjective didn’t work last time, this must be a noun (i.e. “the clever”, meaning “clever people”)
trick	`verb`	Since “the clever” is a noun phrase, this must be the verb.
parsers	`noun`	Har har. “The clever trick parsers.” (i.e. “Clever people trick parsers.”)
.	`period`	That’s the end of the sentence, no more surprises.

Sentences like “The clever trick parsers.” are known in linguistics as garden path sentences. They don’t come up all that often in normal usage, because people instinctively avoid them, but they can be surprisingly common in computer languages. For example, consider scanning along some C code one character at a time and so far, you have seen:

y = x

At this point, you might assume that the variable x is being assigned to the variable y, but if you scan a little further and see:

y = x(z)

then suddenly you’ll have to re-evaluate your assumption that x is a variable, because now it appears to be function, called with the argument z, so y is being assigned the return value of a function call. But if the next two characters you see are:

y = x(z)+1

then you need to re-interpret the right hand side of the assignment as the sum of a function’s return value and a number literal.

All of that backtracking in such a simple expression!

Parsing Expression Grammars

The real strength of Parsing Expression Grammars is that they’re deterministic and unambiguous. Parsing Expression Grammars will never backtrack beyond the current rule, unless the language explicitly requires it, and the grammar specifies the order in which the rules will attempt to match. Here is a simple grammar that defines a lisp-like language:

File <- (expression / ws+)* !.
FunctionCall <- "(" ws* name ws*
  (expression ws* (expression ws*)*)? ")"
expression <- FunctionCall / Variable / Number
  / Text / "(" ws* expression ws* ")"
Variable <- name
Number <- [0-9]+
Text <- '"' ('\"' / '\\' / [^"])* '"'
name <- [a-zA-Z]+
ws <- [ \n\t]

The PEG above will handle parsing a file like:

(print
    "The golden ratio is:"
    (divide (add (sqrt 5) 1) 2))
(print "A random number:" (random))

Here, you can see an interactive demo of the parsing in action (click “Run” to give it a go):

PEGs have a lot of similarity to Regular Expressions and context-free grammars, but with some key differences. Instead of the | (pipe) operator in Regular Expressions and Backus-Naur form, PEGs use the / (slash) operator, which indicates an ordered choice. If one of the operands matches, none of the following operands will be attempted. This same principle applies to the * and + repetition operators. This means that although it might seem like a file containing (foo) could parse as either a function call or a parenthesized variable, it is actually guaranteed to parse as a function call. This is because the PEG rule for expression formally specifies that it will attempt to match FunctionCall before attempting to match a "(" ws* expression ws* ")". When FunctionCall attempts to match, it will either fail, or succeed and greedily gobble up as much input as it can. If the expression rule were instead written as expression <- "(" ws* expression ws* ")" / FunctionCall / Variable / Number / Text (with parenthesized expressions before FunctionCalls), then (foo) would parse as a parenthesized variable. Unlike regular expressions, PEGs will never backtrack and consider different possibilities for something that successfully matched. This makes them much more predictable and less likely to have degenerate performance. This limited backtracking might seem very restricting, but with proper caching it actually allows PEGs to guarantee that arbitrarily long strings can be parsed in linear time, using linear space, while still being extremely expressive.

Summing Up

The Parsing Expression Grammars described in this post are a great way to formalize the description of strictly-defined languages like programming languages or data storage formats. PEGs are a powerful tool that let you explain to a computer how to understand patterns and structure in text, and in a future blog post I’ll also explain how to turn a PEG grammar into a usable program that can actually parse text. Until then, you can check out Guido van Rossum’s posts on PEG parsing, which also explains some of the topics covered in this post.

← Load-Bearing C Projects Packrat Parsing from Scratch →