Skip to content

Latest commit

 

History

History
695 lines (517 loc) · 25.4 KB

specification.md

File metadata and controls

695 lines (517 loc) · 25.4 KB

Specification

The pe grammar specification extends the PEG syntactic specification with a description of the semantic effect of parsing. Any pe grammar without extended features or rule actions should parse exactly as a standard PEG grammar.

Contents

Quick Reference

The following is a brief description of the available operators and their behavior. More detailed information is available in the following sections.

Syntax Operator Type Description
. Dot Primary Match any single character
'abc' Literal Primary Match the string 'abc'
"abc" Literal Primary Match the string 'abc'
[abc] Class Primary Match any one character in 'abc'
A Nonterminal Primary Match the expression in definition A
(e) Group Primary Match the subexpression e
p (default) Quantified Match primary term p exactly once
p? Optional Quantified Match p zero or one times
p* Star Quantified Match p zero or more times
p+ Plus Quantified Match p one or more times
p{n} Repeat Quantified Match p exactly n times
p{m,n} Repeat Quantified Match p from m to n times
p{,n} Repeat Quantified Match p up to n times
p{m,} Repeat Quantified Match p at least m times
q (default) Valued Match quantified term q; consume input; pass up values
&q And Valued Succeed if q matches; consume no input; suppress values
!q Not Valued Fail if q matches; consume no input; suppress values
~q Capture Valued Match q; consume input; emit substring matched by q
name:q Bind Valued Match q; consume input; bind value to 'name'
v (default) Sequential Match valued term v
v s Sequence Sequential Match sequential term s only if v succeeded
s (default) Applicative Match sequential term s
(none) Rule Applicative Match sequential term s, apply the defined action
a (default) Prioritized Match applicative term a
a / e Choice Prioritized Match prioritized term e only if a failed
A <- e Grammar Definitive Define A as the prioritized term e
A < e Grammar Definitive Define A as the prioritized term e with autoignore

Grammar Syntax

The following is a formal description of pe's PEG syntax describing itself. This PEG is based on Bryan Ford's original paper.

# Hierarchical syntax
Start      <- Spacing (Grammar / Expression) EndOfFile
Grammar    <- Definition+
Definition <- Identifier defop:Operator Expression
Operator   <- LEFTARROW / LEFTANGLE
Expression <- Sequence (SLASH Sequence)*
Sequence   <- Valued*
Valued     <- (prefix:Prefix)? Quantified
Prefix     <- AND / NOT / TILDE / Binding
Binding    <- Identifier COLON
Quantified <- Primary (quantifier:Quantifier)?
Quantifier <- QUESTION / STAR / PLUS / Repeat
Repeat     <- LEFTBRACE RepeatSpec RIGHTBRACE
RepeatSpec <- (min:Integer)? COMMA (max:Integer)?
            / count:Integer
Primary    <- Name / Group / Literal / Class / DOT
Name       <- Identifier !Operator
Group      <- OPEN Expression CLOSE

# Lexical syntax
Identifier <- ~(IdentStart IdentCont*) Spacing
IdentStart <- [a-zA-Z_]
IdentCont  <- IdentStart / [0-9]

Literal    <- ~(['] ( !['] Char )* [']) Spacing
            / ~(["] ( !["] Char )* ["]) Spacing

Class      <- ~('[' ( !']' Range )* ']') Spacing
Range      <- Char '-' Char / Char
Char       <- '\\' [tnvfr"'-\[\\\]]
            / '\\' Oct Oct? Oct?
            / '\\' 'x' Hex Hex
            / '\\' 'u' Hex Hex Hex Hex
            / '\\' 'U' Hex Hex Hex Hex Hex Hex Hex Hex
            / !'\\' .
Oct        <- [0-7]
Hex        <- [0-9a-fA-F]

Integer    <- ~( [0-9]+ ) Spacing

LEFTARROW  <- '<-' Spacing
LEFTANGLE  <- '<' Space Spacing
LEFTBRACE  <- '{' Spacing
RIGHTBRACE <- '}' Spacing
SLASH      <- '/' Spacing
AND        <- '&' Spacing
NOT        <- '!' Spacing
TILDE      <- '~' Spacing
COLON      <- ':' Spacing
QUESTION   <- '?' Spacing
STAR       <- '*' Spacing
PLUS       <- '+' Spacing
OPEN       <- '(' Spacing
CLOSE      <- ')' Spacing
COMMA      <- ',' Spacing
DOT        <- '.' Spacing

Spacing    <- (Space / Comment)*
Comment    <- '#' (!EndOfLine .)* EndOfLine
Space      <- ' ' / '\t' / EndOfLine
EndOfLine  <- '\r\n' / '\n' / '\r'
EndOfFile  <- !.

Expression Types and Precedence

Expression Types

Primary

Primary expressions include terminals (Dot, Literal, or Class), nonterminal symbols, and grouped expressions. They are the only expression type that may be quantified.

Quantified

Quantified expressions indicate how many times they must occur for the expression to match. The default (unannotated) quantified expression must occur exactly once. The Optional, Star, Plus, and Repeat operators change this number.

Valued

Valued expressions specify the semantic effects of the expression, such as whether it consumes input, emits values, or binds values to names. The default (unannotated) valued expression consumes the input of its match and passes up any emitted or bound values. The And, Not, Capture, and Bind operators all may change some of these effects, such as whether input is consumed or what happens with emitted or bound values.

Sequential

Sequential expressions match multiple valued expressions in sequence. Emitted and bound values are accumulated as each subexpression in the sequence is parsed.

Applicative

Applicative expressions match a single sequential expression and apply an action on its result. The default applicative expression (without an action) passes up emitted and bound values as they are. A Rule with an action transforms the emitted and bound values.

Prioritized

Prioritized expressions match multiple applicative expressions as a prioritized choice; only the first expression to match is used. The emitted and bound values of the first matching expression are passed up.

Definitive

Definitive expressions are associated with a nonterminal symbol in a Grammar. When a nonterminal symbol appears within an expression, it is as if the expression defined by that nonterminal appeared in-situ. That is, no special treatment is given to definitive expressions compared to the equivalent in-situ expression.

Operator Precedence

Syntax Operator Precedence Expression Type
. Dot 6 Primary
" " or ' ' Literal 6 Primary
[ ] Class 6 Primary
Abc [Name] 6 Primary
(e) Group 6 Primary
e? Optional 5 Quantified
e* Star 5 Quantified
e+ Plus 5 Quantified
e{n} Repeat 5 Quantified
e{m,n} Repeat 5 Quantified
&e And 4 Valued
!e Not 4 Valued
~e Capture 4 Valued
name:e Bind 4 Valued
e1 e2 Sequence 3 Sequential
(none) Rule 2 Applicative
e1 / e2 Choice 1 Prioritized
Abc <- e Grammar 0 Definitive
Abc < e Grammar 0 Definitive

Expression Values and Behavior

During parsing, pe accumulates and transforms values in two channels: the emitted-value sequence and the bound-value mapping. The Capture and Rule expressions emit values, i.e., introduce a value onto the emitted-value sequence. The Bind expression binds a determined value to a name in the bound-value mapping. The And, Not, Capture, and Rule expressions also suppress previously emitted or bound values, while Bind only suppresses the previously emitted values. All other expressions pass up any emitted or bound values.

Value Determination

When Bind associates a name with values or when getting the final value of a parsing expression match, the following routine selects, or determines, the actual value from the emitted-value sequence:

  1. If there are any emitted values, the first value is returned and the rest are suppressed
  2. Otherwise, None is returned

Thus, if one wants all emitted values to be returned, they must explicitly pack them into a data structure such as a list.

Emitted and Bound Values

An emitted value is one that becomes a positional argument for any function applied on a Rule, or is returned as the value of an expression, while bound values become keyword arguments in functions. This distinction allows pe to map the results of a parse to perhaps any Python function.

The following table gives parsing expressions, the input string parsed by the expressions, then the emitted and bound values that would result from such a parse:

Expression Input Emitted-Value Sequence Bound-Value Mapping
'a' a () {}
~'a' a ('a',) {}
~'a'* aaa ('aaa',) {}
(~'a')* aaa ('a', 'a', 'a') {}
'a' ~'b' ab ('b',) {}
~('a' 'b') ab ('ab',) {}
x:'a' 'b' ab () {}
x:'a' ~'b' ab ('b',) {}
x:(~'a') 'b' ab () {'x': 'a'}
x:(~'a' ~'b') ab () {'x': 'a'}
x:(~('a' 'b')) ab () {'x': 'ab'}
&(x:('a')) a () {}

Parsing Preliminaries

Identifiers

All identifiers (rule and binding names) are limited to the ASCII letters (a through z and A through Z), ASCII digits (0 through 9), and the ASCII underscore (_). In addition, the first character may not be a digit. Identifiers must be one or more of these characters.

Special Punctuation

The following ASCII punctuation characters, in addition to whitespace, have special meaning in expressions (but not necessarily inside string literals or character classes; for these see below):

! " # & ' ( ) * + . / : ? [ \ ] _ { } ~

Special characters inside string literals and character classes are different. For both, the \ character is used for escape sequences, and therefore it must be escaped for the literal character. In addition, the following must be escaped:

  • ' in single-quoted string literals
  • " in double-quoted string literals
  • [ and ] inside character classes

The other ASCII punctuation characters are currently unused but are reserved in expressions for potential future uses:

$ % , - ; < = > @ ` |

Whitespace

Whitespace, including the space, tab, and newline characters (which include \n, \r, and \r\n sequences) are only significant in in some contexts:

Characters

Characters in pe are unicode code-points and not bytes, so for instance the dot operator consumes a single unicode code-point and not a single byte. Escape sequences of two or more ASCII characters in an expression match a single unicode character in the input. In pe, any character that is not allowed by identifiers, special punctuation, or whitespace must occur in a string literals or character classes.

Comments

Comments may occur anywhere outside of string literals and character classes. They begin with the ASCII # character and continue until the end of the line.

Escape Sequences

The \ character is used inside string literals and character classes for escape sequences. These allow one to use special punctuation characters for their literal values or to use unicode code-point encodings. The following escape sequences or schemes are allowed:

  • \t: horizontal tab
  • \n: newline (line feed)
  • \v: vertical tab
  • \f: form feed
  • \r: carriage return
  • \": " character
  • \': ' character
  • \[: [ character
  • \]: ] character
  • \\: \ character
  • \N, \NN, \NNN: octal sequence of one to three octal digits; the maximum value is \777; leading zeros are permitted
  • \xNN: sequence of exactly 2 hexidecimal digits
  • \uNNNN: sequence of exactly 4 hexadecimal digits
  • \UNNNNNNNN: sequence of exactly 8 hexadecimal digits

All other escape sequences are invalid.

Note that for the octal and hexadecimal escapes, each sequence corresponds to a single character. That is, multiple UTF-8 or UTF-16 escapes are not combined into a single character.

Operators

This document defines the operators available in pe.

Dot

The Dot operator always succeeds if there is any remaining input and fails otherwise. It consumes a single character.

Literal

A Literal operator (also called a string literal or string) succeeds if the input at the current position matches the given string exactly and fails otherwise. It consumes input equal in amount to the length of the given string.

The ', ", and \ characters are special inside a string and must be escaped if they are used literally, however the ' character may be used unescaped inside a double-quoted string (e.g, "'") and the " character may be used unescaped inside a single-quoted string (e.g., '"').

Note that newlines do not need to be escaped but doing so can make an expression more explicit and, thus, more clear.

Class

The Class operator (also called a character class) succeeds if the next character of input is in a set of characters defined by the given character ranges. It fails if the next character is not in the set or if there is no remaining input. On success, it consumes one character of input.

Each range is either one (possibly escaped) character or two (possibly escaped) characters separated by -. In the former case, the range consists of the single character. In the latter case, the range consists of the character before the -, the character after the -, and all characters that occur between these two in the unicode tables. The set of characters used by the class is the union of all ranges.

It is invalid if a range separated by - has a first character with a value higher than the second character.

The [, ], and \ characters are special inside a character class and must be escaped if they are used literally. To use the - character literally, place it at one of the following locations:

  • at the beginning of the character class (e.g., [-a-z] matches - or a through z)
  • immediately after a range ([a-z-_] matches a through z or - or _, [a-z--/] matches a through z or - through /)
  • as the second character of a range ([*--/] matches * through - or /)

If the - is meant literally in a character class, it is recommended to be placed as the first character to avoid confusion.

Nonterminal

A Nonterminal operator is given by an identifier in an expression. It corresponds to a grammar rule of the same name. A nonterminal operator succeeds if, at the current position in the input, the expression of the named grammar rule succeeds at the same position. Similarly, it fails if the grammar rule fails at the same position.

In all cases, the behavior of the nonterminal is the same as if the corresponding rule's expression had been written in place of the nonterminal. Schematically, that means that A1 and A2 below are equivalent for any expression e.

A1 <- B
B  <- e

A2 <- e

Group

  • Notation: (e)
  • Function: (none)
  • Type: Primary

Groups do not actually refer to a construct in pe, but they are used to aid in the parsing of a grammar. This is helpful when one wants to apply a lower-precedence operator with a higher-precedence one, such as a sequence of choices:

[0-9] '+' / '-' [0-9]    # parses "1+" or "-2" but not "1+2" or "1-2"

[0-9] ('+' / '-') [0-9]  # parses "1+2", "1-2", etc.

Or repeated subexpressions:

[0-9] ('+' [0-9])*  # parses "1", "1+2", "3+5+8", etc.

In the API, the three expressions above would translate to the following function calls:

Choice(Sequence(Class('0-9'), Literal('+')), Sequence(Literal('-'), Class('0-9')))

Sequence(Class('0-9'), Choice(Literal('+'), Literal('-')), Class('0-9'))

Sequence(Class('0-9'), Star(Sequence(Literal('+'), Class('0-9'))))

Optional

The Optional operator succeeds whether the given expression succeeds or fails. If the given expression succeeds, the values and consumed input are the same as that of the given expression. If the given expression fails, no input is consumed and no values are passed up.

Note that the Optional operator also succeeds when there is no remaining input.

Star

The Star operator succeeds if the given expression succeeds zero or more times. Emitted values of the given expression are accumulated while bound values get overwritten (therefore only the value bound by the last match is passed up).

Note that the Star operator also succeeds when there is no remaining input.

Plus

The Plus operator succeeds if the given expression succeeds one or more times. Emitted values of the given expression are accumulated while bound values get overwritten (therefore only the value bound by the last match is passed up).

Repeat

  • Notation: e{n} or e{m,n}
  • Function: Repeat(expression, count=-1, min=0, max=-1)
  • Type: Quantified

The Repeat operator succeeds if the given expression succeeds exactly n times in the e{n} form, or between m and n times (inclusive) in the e{m,n} form. Emitted values of the given expression are accumulated while bound values get overwritten (therefore only the value bound by the last match is passed up).

It is a more general quantifier than those above as it can replace Optional (e{0,1}), Star (e{0,}), and Plus (e{1,}) while also allowing for bounded repetition with minimum and maximum counts greater than one as well as fixed count repetitions.

And

  • Notation: &e
  • Function: And(expression)
  • Type: Valued

The And operator (also called positive lookahead) succeeds when the given expression succeeds at the current position in the input, but no input is consumed (even if the given expression would otherwise consume input) and no values are passed up.

Not

  • Notation: !e
  • Function: Not(expression)
  • Type: Valued

The Not operator (also called negative lookahead) succeeds when the given expression fails at the current position in the input, but no input is consumed and no values are passed up.

Capture

The Capture operator succeeds when the given expression succeeds at the current position in the input. It suppresses any emitted or bound values and bindings from the given expression and then emits the substring matched by the given expression.

Bind

  • Notation: :e or name:e
  • Function: Bind(expression, name)
  • Type: Valued

The Bind operator succeeds when the given expression succeeds at the current position in the input. It associates the determined value of the given expression with the given name in the bound-variable mapping. Any other previously bound values are passed up, but any values emitted by the given expression are then suppressed.

Sequence

The Sequence operator succeeds when all of its given expressions succeed, in turn, from the current position in the input. After an expression succeeds, possibly consuming input, successive expressions start matching at the new position in the input, and so on. The Sequence operator passes up the accumulated values of its given expressions. If any given expression fails, the entire Sequence fails and no values are passed up and no input is consumed.

A Sequence with no given expressions is invalid. A Sequence with one given expression is reduced to the given expression only.

Rule

  • Notation: (none)
  • Function: Rule(expression, action, name='<anonymous>')
  • Type: Applicative

A Rule is often associated with a name in the grammar, but the rule itself is just a wrapper around a given expression that defines an action that transforms the emitted and bound values of the given expression. The rule thus succeeds when the given expression succeeds, and it consumes the input of the given expression. The emitted and bound values of a rule are those returned by the action, and all previously emitted or bound values are suppressed.

Note that, while Rules may only be defined in the PEG notation with a name in the grammar, anonymized rules may appear inside of expressions. This situation can occur by defining expressions using the API instead of the notation, or as the result of inlining optimizations. To help with debugging a grammar with inlined rules, the name of an associated nonterminal is kept with the rule, even though it has no other use.

Choice

The Choice operator succeeds when any of its given expressions succeeds. If a given expression fails, the next given expression is attempted at the current position in the input. If a given expression succeeds, the Choice operator succeeds and immediately passes up the given expression's values and consumes its input.

A Choice with no given expressions is invalid. A Choice with one given expression is reduced to the given expression only.

Grammar

  • Notation: Abc <- e
  • Function: Grammar(definitions=None, actions=None, start='Start')
  • Type: Definitive