The lexical grammars of Python and Haskell are not regular. The what and the why.

In the context of programming language parsing, a lexical grammar defines the rules for breaking down raw source code into a stream of elementary units called tokens (e.g., keywords, identifiers, operators, literals). A grammar is considered regular if it can be described by a regular expression or, equivalently, recognized by a finite automaton (FA).

When we say the lexical grammars of Python and Haskell are not regular, it means that there are certain aspects of their lexical structure that cannot be handled by a simple finite automaton. A finite automaton has a limited “memory” – it can only keep track of a finite number of states. It cannot “count” or remember arbitrary amounts of information encountered earlier in the input.

Here’s why Python and Haskell’s lexical grammars are not regular:

Python: Significant Indentation (Off-side Rule)

Python’s primary non-regular feature at the lexical level is its significant indentation, often referred to as the “off-side rule.”

  • What it means: Instead of using explicit delimiters like curly braces ({}) or begin/end keywords to define code blocks (like in C, Java, or Pascal), Python uses indentation. The level of indentation determines the scope of a block of code.
  • Why it’s not regular: To correctly identify INDENT and DEDENT tokens (which mark the beginning and end of indented blocks), a lexical analyzer needs to keep track of the current indentation level and compare it to the previous line’s indentation. If the indentation increases, it’s an INDENT. If it decreases, it’s one or more DEDENTs. This requires a “stack” or a counter to remember previous indentation levels, which goes beyond the capabilities of a finite automaton. A finite automaton cannot “count” an arbitrary number of nested indentation levels.

For example:

if condition:
    # INDENT
    statement1
    if another_condition:
        # INDENT
        statement2
    # DEDENT (back to first if's level)
    statement3
# DEDENT (back to global level)

To parse this correctly, the lexer needs to know the exact indentation of statement1 and then compare it to statement2 and statement3’s indentation. This is a context-sensitive task that requires more than a simple state machine.

Haskell: Layout Rule

Haskell also employs a layout rule (similar to Python’s off-side rule) where indentation can be used to implicitly delimit blocks of code (e.g., in do, let, where, case, class, and instance blocks).

  • What it means: Similar to Python, Haskell uses indentation to infer structure and introduce implicit semi-colon and close-brace tokens. This allows programmers to omit explicit delimiters for readability.
  • Why it’s not regular: The Haskell lexical analyzer needs to track the indentation level to determine when these implicit tokens should be inserted. This requires a stack-like mechanism to manage the nesting of layout blocks, which a regular grammar cannot handle.

For example, in a where clause:

product xs = prod xs 1
  where prod [] a = a          -- Indentation determines this belongs to 'where'
        prod (x:xs) a = prod xs (a*x) -- Aligned with the previous line

The lexer needs to know the starting indentation of the where clause to correctly parse the subsequent lines. If the indentation changes in a way that signals the end of the where block, the lexer must generate an implicit } token. This kind of contextual awareness is beyond regular languages.

In Summary:

  • Regular Grammars: Can only recognize patterns that don’t require “counting” or remembering arbitrarily deep nested structures. They are suitable for simple token recognition like keywords, identifiers (sequences of letters/numbers), and basic operators.
  • Non-Regular Features (Python & Haskell): Indentation-based scoping requires the lexical analyzer to maintain a stack of indentation levels. This “memory” is what makes their lexical grammars non-regular, pushing them into the realm of context-free grammars or even more powerful formalisms for their full parsing. While the individual tokens themselves (like if, def, variable names) are regular, the process of recognizing INDENT and DEDENT tokens based on arbitrary nesting makes the overall lexical analysis non-regular.

Lox generated gemini