Question about best practice regarding the lexer and negative look aheads. #2039

khkiley · 2024-08-18T20:19:08Z

khkiley
Aug 18, 2024

Is it advantageous or redundant to include a negative look ahead when defining tokens ?

For example : /(?:0|1-9{0,8})(?=[^\da-zA-Z.])/

This regex represents a JavaScript type integer. At the end, I have included a negative look ahead to ensure nothing that could be construed as additional digits follows. I am doing something similar for all of my numeric definitions.

Is this best practice or a code smell ? Does it add unnecessary logical overhead to the lexer that would normally be handled by the parse ? Or does it make sense to catch this earlier on in during lexing prior to trying to parse the lexer output ?

This is my first attempt at writing a parser, so any opinions on this are welcome.

Thanks!

ETA: I made minor corrections to the original regex, which did not allow a plain 0, and too many digits for a integer.

Answered by msujew

Aug 18, 2024

Hey @khkiley,

Generally, I'd say that, if it's not absolutely necessary to facilitate correct lexing/parsing, you shouldn't use negative/positive lookaheads/lookbehinds. It adds complexity and most importantly has a negative performance impact.

For your specific question: It depends. In most normal languages, a string such as 123abc will be lexed as [INT, ID] tokens. If you add the negative lookahead, the lexer will need to find another matching token, maybe just [ID]. This is fine, if you allow variable names such as 123abc, but might otherwise lead to unexpected behavior. If the lexer doesn't find a matching token, it will yield a lexing error, Any characters included in that error will…

View full answer

msujew · 2024-08-18T21:34:51Z

msujew
Aug 18, 2024
Collaborator

Hey @khkiley,

Generally, I'd say that, if it's not absolutely necessary to facilitate correct lexing/parsing, you shouldn't use negative/positive lookaheads/lookbehinds. It adds complexity and most importantly has a negative performance impact.

For your specific question: It depends. In most normal languages, a string such as 123abc will be lexed as [INT, ID] tokens. If you add the negative lookahead, the lexer will need to find another matching token, maybe just [ID]. This is fine, if you allow variable names such as 123abc, but might otherwise lead to unexpected behavior. If the lexer doesn't find a matching token, it will yield a lexing error, Any characters included in that error will not be processed by the parser, which might also be unexpected.

In the end, it boils down to how you want certain strings to be interpreted as a INT or/and ID (or nothing at all). It's either specced out (if you're building a parser for an existing language), or you can define the exact semantics yourself. If it's the latter, I would recommend against using the lookahead.

0 replies

khkiley · 2024-08-19T14:04:10Z

khkiley
Aug 19, 2024
Author

Thank you @msujew.

What I'm trying to do with that (and I fixed the regex as it did not match just plain 0) is to prevent something like 9999999995 which would parse 999999999 as an int, and then 5 as an int. But I guess that would leave 9999999995 down in the errors grouping, with little context as to why it wasn't lexed.

It sounds like having implied logic during the lexing pass is a bad idea, unless absolutely necessary to disambiguate tokens.

I'm going to simplify these patterns.

Thanks for your help.

2 replies

bd82 Aug 22, 2024
Maintainer

It sounds like having implied logic during the lexing pass is a bad idea, unless absolutely necessary to disambiguate tokens.

Yes exactly.

I would parse 999999999999999999 as an integer.
and at a later phase (after parsing) I would perform a semantic check to detect the integer is "out of bounds".

khkiley Aug 23, 2024
Author

Exactly the method I am taking, thank you for confirming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about best practice regarding the lexer and negative look aheads. #2039

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about best practice regarding the lexer and negative look aheads. #2039

khkiley Aug 18, 2024

Replies: 2 comments · 2 replies

msujew Aug 18, 2024 Collaborator

khkiley Aug 19, 2024 Author

bd82 Aug 22, 2024 Maintainer

khkiley Aug 23, 2024 Author

khkiley
Aug 18, 2024

Replies: 2 comments 2 replies

msujew
Aug 18, 2024
Collaborator

khkiley
Aug 19, 2024
Author

bd82 Aug 22, 2024
Maintainer

khkiley Aug 23, 2024
Author