Parser library for Kotlin consisting of a tokenizer and expression parser.
Tokenization is the process of splitting the input into a stream of token that is consumed by a parser.
In Parsek, this is distributed between two classes called Lexer and Scanner.
The lexer (source, kdoc) is basically an iterator for a stream of tokens that is generated by splitting the input using regular expressions.
Regular expressions are mapped to token types using a function which typically just returns a fixed token type inline. The function can be used to implement a second layer of mapping, but this should be fairly uncommon. Input mapped to null (typically whitespace) will not be reported.
The lexer is usually not used directly; instead, it's handed in to the Scanner, which in turn is used by the parser.
The reason for the Lexer/Scanner split is to separate "raw" parsing from providing a nice and convenient API. The small API surface of the Lexer allows us to easily install additional processing between the Lexer and Scanner, for instance for context-sensitive newline filtering.
Typically, the Lexer is constructed directly inline where the Scanner is constructed.
The token class (source, kdoc) stores the token type (typically a user-defined enum), the token text and the token position. Token instances are generated by the Lexer.
The RegularExpressions object (source, kdoc) contains a set of useful regular expressions for source code and data format tokenization.
The Scanner class (source, kdoc) provides a simple API for convenient access to the token stream generated by the Lexer.
-
The scanner provides a notion of a "current" token that can be inspected multiple times -- opposed to iterator.next(), where the current token is "gone" after the call. This makes it easy to hand the scanner with the current token down in a recursive descend parser until it is consumed and processed by the corresponding handler.
-
It provides unlimited dynamic lookahead.
-
It provides a tryConsume() convenience method that checks for a given token text and consumes the token and returns true when it was found.
Typical use cases that only need a scanner and no expression parser are data formats such as JSON or CSV.
For a simple example, please refer to the JSON parser example.
The configurable expression parser (source, kdoc) operates on a tokenizer, is stateless and should be shared / reused.
- For ternary expressions, create a suffix expression and use the supplied tokenizer to consume the rest of the ternary.
- Functions / "Apply" can be implemented in a similar way. Alternatively, this can be implemented in primary expression
- parsing by checking for an opening brace after the primary expression.
- "Grouping" brackets should be implemented where primary expressions are processed, too.
-
A simple example evaluating mathematical expressions directly (opposed to building an explicit parse tree) can be found in the tests
-
A complete PL/0 parser is included in the examples module to illustrate how to use the expression parser and tokenizer for a simple but computational complete language: Parser.kt, Pl0Test.kt
-
A parser for mathematical expressions: ExpressionParser.kt, ExpressionsTest.kt
-
A simple example for using the scanner and expression parser to implement a simple indentation-based programming language: mython, MythonTest.kt
-
A BASIC interpreter using Parsek: https://github.com/stefanhaustein/basik