-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regexp patterns cannot end with backslash (even if backslash-escaped) #9701
Comments
(note: I believe our grammar behaves in similar ways with double- and single-quoted strings that end in backslashes, and that a similar fix may be non-breaking there too) |
I believe I am hitting this when trying to set The proposed fix works only when |
Did you mean forward-slash here? A forward-slash works just fine as a field splitter as evidenced below, but a lone backslash will fail in parsing as described above (regardless of the value of
@honzakral which docs? I haven't been able to find unambiguous documentation on this bit, and whether or not we have the |
Sorry, I meant backslash of course, it was a typo, I edited my comment and fixed it as not to confuse more people, my bad! It did understand it but then the docs for
which implies that without this setting it is not so. |
Correct, but there is more than one way for it to be "not so" 😩 Unfortunately, as with most escaping problems, this one is like an onion with all of the layers. For both strings and regexps, the parser has always allowed users to escape a close-quote character from the parser ( Our Regexp engine will gladly take the passthrough escape sequence, or it will ignore meaningless escapes, which combine to make it extra forgiving. On Strings though, emitting un-processed escape sequences from the parser was problematic. Many plugins either worked around the above behaviour (by implementing their own escape-sequence processing) or exploited it (e.g., by not escaping strings that are used to create plugin-internal regexp, like In order to specify a literal backslash in a quoted string or regexp in a pipeline config, you need to do one of two things:
While the former logically prevents strings from ending with unescaped backslashes, this ticket revolves around a bug with the latter, where an escaped backslash at the end of a string or regexp is being interpreted by the parser as In your case, once this bug is fixed, with
Yes, that is four backslashes:
Without
|
Thanks @yaauie for the very detailed analysis of the problem. What do you recommend moving forward? Your suggested changes to the regexp rule? We should probably also improve the documentation around that? @honzakral does the above explanation and solution makes sense with your issue? |
@colinsurprenant as long as I can do what I needed, I am ok. It would be great if it were documented in a way that would allow people to reason about what's going on - the way I undertsand it is that the escaping is done/interpretted twice, once (optionally) by the parser and once by the filter (I would also assume that different filters have different behaviors?) |
@honzakral yeah, this is what I understand and one of the specific problem is related to plugins that require a regex option like the kv filter. The config parser has no notion of a regex option for plugins so a regex must be passed as a string option. @yaauie our parser already has a notion of regex but for config expressions in conditionals, maybe we could look into introducing a regex option type for plugins? Do you think that could help? |
@colinsurprenant the option for The ultimate issue here is imho that both the parser and then the individual filters (sometimes) try and resolve the escaping. This leads to at best the need for double quotes at worst the inability to do something like specifying I believe the proper solution would be to unify the escaping rules by having the parser deal with it and hand over already unescaped string values to the plugins. It would also be possible to leave this to the plugins only but That seems less ideal to me as it would be implemented in many places instead of one. |
@honzakral ah yes sorry, my regex comment was for options like "field_split_pattern" and for some reason I forgot your original comment was with "field_split". |
@colinsurprenant this issue is around literal regexp in the pipelines; literal strings, which are interpreted by a plugin to be a regexp, fall into the
If you see no major flaw in my analysis of the regexp and why we fail to parse strings- or regexp-ending-in-escaped-backslashes, I can push forward and get to a PR with the updated treetop grammar and generated parser.
Your solution is that of |
@yaauie even with Of course I understand that that is a big change and this solution will fix the corner case which is causing problems. |
I agree that In this case, the kv plugin really shouldn't be injecting verbatim inputs into a regexp, and should perform the necessary escaping of any inputs it receives instead of requiring the user to add a layer of escaping into the current options because of this shortcoming. I've filed logstash-plugins/logstash-filter-kv#66 as a potential solution. Additionally, @colinsurprenant's suggestion of making regexp an option type for plugins could also be helpful. I'll open up a separate design discussion around that, since there may a bit of a chicken-vs-egg situation implementing in plugins that can be installed into Logstashes that don't yet have it implemented. |
Right. This is off-topic here but maybe for 7.0 we could discuss how to improve plugins compatibility vs logstash versions/features. That was the original intent of the |
@honzakral WDYT about the
Yes, I think we should move forward. |
@colinsurprenant sounds good, though I would consider it more of a bug fix so I wouldn't mind the |
I'm referencing tests I'm running on dissect: |
PS This issue does not apply to regex! |
@TheVastyDeep Not if you are using it for splitting, then there is a vast difference when you add |
An issue in the treetop grammar can cause the
regexp
to be too aggressive at capturing, which can ultimately cause the pipeline compilation to fail when a pattern ends with a backslash (even if that backslash is backslash-escaped).I believe I know how to fix this, but would like to discuss it to consider the risk before we tackle the compilation of the treetop grammar, since the tooling to compile the grammar does not immediately appear to be working for me and fixing it may take some effort.
The treetop grammar currently defines a regexp as:
In plain English that is:
'/'
)'\/'
); OR!'/' .
)'/'
)note: rule 2b uses a literal "Negative Lookahead Assertion" to prevent it from matching forward slashes
Using this definition, in a config file the sequence of characters
/foo\\bar/
will correctly generate a Regexp that matches the literal stringfoo\bar
, but the pattern/fubar\\/
, which attempts to match the literal stringfubar\
will capture the intended (3) closing forwardslash as (2a), fail to close, and continue capturing until it either hits EOL or an unescaped forward slash from another expression, typically resulting in a pipeline compilation error.The below table shows where the parser captures too eagerly; each parenthised group in the parse column indicates the rule that caused the pattern to match, followed by the match.
/foo/
/
)(2bf
)(2bo
)(2bo
)(3/
)/foo\\bar/
/
)(2bf
)(2bo
)(2bo
)(2b\
)(2b\
)(2bb
)(2ba
)(2br
)(3/
)/foo\/bar/
/
)(2bf
)(2bo
)(2bo
)(2a\/
)(2bb
)(2ba
)(2br
)(3/
)/fubar\\/
/
)(2bf
)(2bu
)(2bb
)(2ba
)(2br
)(2b\
)(2a\/
)...I believe that we can redefine the grammar for regexp to capture escape sequences better without significant risk:
In plain English that is:
'/'
)'\' .
); OR!'/' .
)'/'
)This will slightly change the meaning of how escape sequences get parsed (e.g., sequences that matched 2b twice before may match 2a now), but because backslashes and the character that followed them were always captured and this change does nothing to process the escape sequences, this fix poses no external risk.
/foo/
/
)(2bf
)(2bo
)(2bo
)(3/
)/foo\\bar/
/
)(2bf
)(2bo
)(2bo
)(2a\\
)(2bb
)(2ba
)(2br
)(3/
)/foo\/bar/
/
)(2bf
)(2bo
)(2bo
)(2a\/
)(2bb
)(2ba
)(2br
)(3/
)/fubar\\/
/
)(2bf
)(2bu
)(2bb
)(2ba
)(2br
)(2a\/
)(3\/
)...The text was updated successfully, but these errors were encountered: