-
What is the closest solution to non-greedy subrules, like in ANTLR4? I'd like to parse text files and extract pieces of information embedded in them (these pieces has a structure), but obviously the text file itself has no complete grammar, so the I'd like to fuzzy parse only the relevant pieces. That would require non-greedy subrules. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hmm, I suspect Token Categories could be used for this purpose. Basically if you define a Token category for This is not actually none-greedy, but achieves the same (?) behavior using different semantics, e.g. Antlr// none-greedy ".*" to allow stopping at the constant
file : .*? (constant .*?)+ ; Chevrotain// here we are **greedily** consuming **everything** until we encounter a constant
file: (NOT_CONSTANT* constant NOT_CONSTANT*)+ |
Beta Was this translation helpful? Give feedback.
-
Thanks @bd82 for the quick reply. What you suggest does work, assuming we have a starting token, which never should be irrelevant. But I'm struggling to make it work, when the starting token can be relevant / irrelevant. Consider the following simplified input text:
So in the grammar I can use ...
const Header = createToken({ name: "Header", pattern: /account statement/i })
const DMY = createToken({
name: "DMY",
pattern: /\d{2}\.\d{2}\.\d{4}/,
categories: [SKIP_TOKEN],
})
...
const Any = createToken({
name: "Any",
pattern: /[^\s]+/,
group: chevrotain.Lexer.SKIPPED,
})
$.RULE("statement", () => {
$.SUBRULE($.anyText)
$.CONSUME(header)
$.SUBRULE2($.anyText)
$.MANY(() => $.SUBRULE($.transaction))
$.SUBRULE3($.anyText)
})
$.RULE("header", () => {
$.CONSUME(Header)
$.CONSUME(Currency)
})
$.RULE("transaction", () => {
$.CONSUME(DMY, { LABEL: "booking_date" })
$.CONSUME(Transfer)
$.CONSUME(NumericLiteral)
})
$.RULE("anyText", () => {
$.MANY(() => $.CONSUME(SKIP_TOKEN))
}) As you might guessed, it works finding the header (and skipping any date before the header), but then when it should find the start of the transactions, then the date is also skipped. I'm not sure what is the best possible solution for this. Would it be possible to have something like a |
Beta Was this translation helpful? Give feedback.
Probably not, I am not convinced this feature is needed beyond syntactic sugar nor am I planning any significant new features at this time.
Back to your example:
I suspect you need multiple categories of "anyText"
Note that Chevrotain is a JavaScript DSL.
So you can likely creat…