-
Is it advantageous or redundant to include a negative look ahead when defining tokens ? For example : /(?:0|1-9{0,8})(?=[^\da-zA-Z.])/ This regex represents a JavaScript type integer. At the end, I have included a negative look ahead to ensure nothing that could be construed as additional digits follows. I am doing something similar for all of my numeric definitions. Is this best practice or a code smell ? Does it add unnecessary logical overhead to the lexer that would normally be handled by the parse ? Or does it make sense to catch this earlier on in during lexing prior to trying to parse the lexer output ? This is my first attempt at writing a parser, so any opinions on this are welcome. Thanks! ETA: I made minor corrections to the original regex, which did not allow a plain 0, and too many digits for a integer. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hey @khkiley, Generally, I'd say that, if it's not absolutely necessary to facilitate correct lexing/parsing, you shouldn't use negative/positive lookaheads/lookbehinds. It adds complexity and most importantly has a negative performance impact. For your specific question: It depends. In most normal languages, a string such as In the end, it boils down to how you want certain strings to be interpreted as a |
Beta Was this translation helpful? Give feedback.
-
Thank you @msujew. What I'm trying to do with that (and I fixed the regex as it did not match just plain 0) is to prevent something like 9999999995 which would parse 999999999 as an int, and then 5 as an int. But I guess that would leave 9999999995 down in the errors grouping, with little context as to why it wasn't lexed. It sounds like having implied logic during the lexing pass is a bad idea, unless absolutely necessary to disambiguate tokens. I'm going to simplify these patterns. Thanks for your help. |
Beta Was this translation helpful? Give feedback.
Hey @khkiley,
Generally, I'd say that, if it's not absolutely necessary to facilitate correct lexing/parsing, you shouldn't use negative/positive lookaheads/lookbehinds. It adds complexity and most importantly has a negative performance impact.
For your specific question: It depends. In most normal languages, a string such as
123abc
will be lexed as[INT, ID]
tokens. If you add the negative lookahead, the lexer will need to find another matching token, maybe just[ID]
. This is fine, if you allow variable names such as123abc
, but might otherwise lead to unexpected behavior. If the lexer doesn't find a matching token, it will yield a lexing error, Any characters included in that error will…