-
Notifications
You must be signed in to change notification settings - Fork 728
Perf issue with regexes that start with repeating digits #1708
base: v3.3/dev
Are you sure you want to change the base?
Perf issue with regexes that start with repeating digits #1708
Conversation
If we are looking at In this case, this regex is equivalent to The former will match if and only if the later will match. So instead of
we could write
|
That makes sense to me. |
I think that's a great suggestion, Mirko. Much simpler. I'll either update this PR or if Airween is sending one I'll close this one. Thanks! |
Let me make some performance check, then we can decide. |
I do not claim to understand the patterns, but after starring for some minutes on the 3 changes, they looks fine to me. In all 3 cases we are changing a repeated element in the beginning of the regex (start of a top-level alternative) to a single occurrence. Which is semantically equivalent for our use case. A side remark: If you are looking into more detail here, there may be room for making these regexes smaller. For example, there is a repeating pattern
or in more readable
which seems to be the same as
Another side remark: Seeing this, I think that Another side remark: In a former company, we used macro-expansion to generate regexes from smaller snippets which made the whole stuff more readable. |
Good suggestions for simplifications. Maybe let's keep them in a separate PR, so it's easier to understand the reason for the changes? We do not have a macro expansion step for the regexes in this project, but we do as a matter of fact have a regex compilation step that combines smaller regexes into bigger ones. Had forgotten about it here until I saw your comment. Pushed a new iteration with the regexp data files updated. Somehow the regexp-assemble.pl script decided to move things around a bit though. Don't know why. It's fairly black box to me, that regexp-assemble.pl script. |
Oddly, regexp-assemble.pl added an extra |
Please don't hand-edit. @fgsch: Do you have an idea what's the matter here? Also: There might be regex-snippet re-use across multiple rules. But even if it would be a simplification in some regards, I'm reluctant to welcome this as it grows the complexity of the regex generation even more. |
We did use is for stuff which should be always the same: for example:
But you are right. It is a question of complexity. In our case back then, it reduced complexity |
A quick performance test result:
very similar values in case of a long matches pattern. I think Mirko's suggestion is better. Any other ideas (related to test, fix, ...) are welcome. |
Cool. Could you please share the code? Could we integrate this into utils? |
Yea I totally agree, don't use my initial suggestion. Removing code is much better than adding code to achieve the same. This PR is already updated to use Mirko's approach. |
But the problem with the assembly script persists, does not it? We should not merge as long as we have not sorted that one out. |
@airween do you have the timings for the original regex and the same input as comparison? |
Yes, I checked the original regex, but didn't listed here because it was the expected value - they have very loooong time... above 13 secs all. But you're right, here are the full and fresh list:
|
sure, have to review and clean the code - now it's a piece of junk :).
I think we can, but need to discuss about how (I think there isn't any C source under |
I do not mind introducing c-source there. And if we do not like it long-term, we can always replace with python code. Please start the cleanup process. :) |
@dune73 @airween As far as I know, there are python bindings for PCRE, so we may be able to write this in python. This would mean that we can use the python timeit module which gives stable results for such kind of tests. Maybe we can talk about this later. I have some ideas how I can contribute here and would like to discuss them. |
That sounds great. Want to contact me on Slack? |
Python's PCRE binding is a bit old (I just found this one), but PCRE is still old :), so it's not necessary to adds more update. Or we can use Python's FFI module. I've never used it before, I'm not clear about it's performance. The advantage of Python is that we can parse the whole rule set (with secrules_parsing or msc_pyparser) and collect the |
Just a quick look to the
As can be seen it has a much worse performance, but it seems useful anyway. Any ideas? |
I do not think the performance difference is that big between C and Python bindings. The original has a 30-40% difference, but with the faster regexes, it's really quite low. I think that's perfectly acceptable, especially if we can avoid C that way. :) |
Well, my curiosity won, I tried the CFFI module - this is something catastrophic... The results (Mirko's, Allan's):
Okay, they are just a bit more worse than
(had to sent a I think we should stay at |
Let me summarize the experiences for this issue. For the better analyze, I've made a small tool which helps to check arguments of The current used rules takes long times to evaulate the regular expressions in case of v2 - but only if the engine doesn't use PCRE JIT feature:
With JIT (or with v3), the problem doesn't exists:
v3:
First @allanrbo modified these regexes, then @mirkodziadzka-avi helped him to clean up. Now I reviewed the commits, and measured all scenarios. Here you can find the regex's. There are 9 files in 3 groups - the 3 affected rule, and 3 regex for every rule with a suffix. The 0 means the original (currently used), 1 is the first commit from Allan, 2 is Mirko's proposal. Methodology
Then I removed the MIN and MAX values from the set, summaryze the remained 8 values, and divided by 8. In the next lines, you can see these values. The bold showed the better result for the compared regexes (with suffix 1 and 2)
How can you check
Summary I think, we should close this PR, and open a new one, which contains Mirko's fix for 941120 and 942260, and Allan's for 942210. And that PR should be more cleaner (with less commit :)). @allanrbo, @mirkodziadzka-avi - and others: any opinion? |
@airween Thanks for the work! I would like to understand the difference in the performance of 942210. I noticed that the order of the 2nd and 3rd group between 942210_1.txt and 942210_2.txt is switched. I assume, regexp-assemble.pl did this? As a wild guess (no measurement involved), I would assume that this reorder is responsible for the performance difference here. Will check this if I find time. |
may be, I didn't checked yet the "raw" format of regex, just the compiled. Of course, we have to investigate it.
Thanks! |
I didn't find any regex raw data for @allanrbo's first commit. Assume he just modified the argument of |
Yea I probably (accidentally) just modified the .conf file directly. I can see in later commits I realized this mistake and modified the correct regex-assemble file, but after I had moved to Mirko's approach. I still think Mirko's approach is much cleaner. I don't agree that a 25% perf increase is significant enough to go for a much more complex and hard-to-maintain regex in this context. If I remember right, these regex are very corner cases, and the perf only becomes an issue with very specifically crafted input (likely attack input). I think we don't need to optimize for high perf for attackers :-) the happy path (non-attacks) is where perf matters. |
On the Slack channel 2020-03-02, Airween pointed out some PCRE performance issues with requests containing long consecutive strings of digits. He suggested a fix using the (*SKIP) flag. Since this flag is PCRE specific, I'd like to avoid it, in order to pave the way for modsec-alternatives that use regex libraries with guarenteed linear run time (RE2, Hyperscan, and Go's Regexp, for example).
This PR relates is a suggestion to modify 941120, 942210, and 942260.
Here's my understanding of the issue: The regexes in these rules all have the option to begin matching with a repeated number of digits, such as
[0-9]+
. For example the regex[0-9]+a
would match11111a
,111111111a
,11111111111111a
, etc. Now the problem arises with for example a string like11111
(without ana
). In this example, PCRE has consumed11111
and then looks fora
, doesn't find it, and therefore backtracks to try instead with just1111
, then111
, then11
, and then finally try with1
.To avoid this backtracking, we can instead anchor the regex before the[0-9]+
, to either the beginning of the string or some character that is not a digit. The fixed example then becomes(?:^|[^0-9])[0-9]+a
.Mirko suggested below a much cleaner approach than my first approach: simply don't bother with the quantifier. Updated the PR with this cleaner approach.
Tested manually with a large request as suggested by Airween, and verified that now perform many orders of magnitude faster.
There is also a perf issue with 942130, but I will send a PR for a suggestion to fix this separately, as I have (by chance) been working on some changes to this rule for some weeks.