Perf issue with regexes that start with repeating digits #1708

allanrbo · 2020-03-03T01:55:23Z

On the Slack channel 2020-03-02, Airween pointed out some PCRE performance issues with requests containing long consecutive strings of digits. He suggested a fix using the (*SKIP) flag. Since this flag is PCRE specific, I'd like to avoid it, in order to pave the way for modsec-alternatives that use regex libraries with guarenteed linear run time (RE2, Hyperscan, and Go's Regexp, for example).

This PR relates is a suggestion to modify 941120, 942210, and 942260.

Here's my understanding of the issue: The regexes in these rules all have the option to begin matching with a repeated number of digits, such as [0-9]+. For example the regex [0-9]+a would match 11111a, 111111111a, 11111111111111a, etc. Now the problem arises with for example a string like 11111 (without an a). In this example, PCRE has consumed 11111 and then looks for a, doesn't find it, and therefore backtracks to try instead with just 1111, then 111, then 11, and then finally try with 1. To avoid this backtracking, we can instead anchor the regex before the [0-9]+, to either the beginning of the string or some character that is not a digit. The fixed example then becomes (?:^|[^0-9])[0-9]+a.

Mirko suggested below a much cleaner approach than my first approach: simply don't bother with the quantifier. Updated the PR with this cleaner approach.

Tested manually with a large request as suggested by Airween, and verified that now perform many orders of magnitude faster.

There is also a perf issue with 942130, but I will send a PR for a suggestion to fix this separately, as I have (by chance) been working on some changes to this rule for some weeks.

mirkodziadzka-avi · 2020-03-03T12:31:15Z

If we are looking at [0-9]+a to detect an attack - we are not extracting part of data, but only want to know if this regex would match the input.

In this case, this regex is equivalent to [0-9]a

The former will match if and only if the later will match.
So we can simplify this without anchoring the request. And we would avoid the backtracking here.

So instead of

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|REQUEST_HEADERS:User-Agent|REQUEST_HEADERS:Referer|ARGS_NAMES|ARGS|XML:/* "@rx (?i)[\s\"'`;\/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]+on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=" \

we could write

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|REQUEST_HEADERS:User-Agent|REQUEST_HEADERS:Referer|ARGS_NAMES|ARGS|XML:/* "@rx (?i)[\s\"'`;\/0-9=\x0B\x09\x0C\x3B\x2C\x28\x3B]on[a-zA-Z]+[\s\x0B\x09\x0C\x3B\x2C\x28\x3B]*?=" \

dune73 · 2020-03-03T13:06:52Z

That makes sense to me.

allanrbo · 2020-03-03T19:31:23Z

I think that's a great suggestion, Mirko. Much simpler. I'll either update this PR or if Airween is sending one I'll close this one. Thanks!

airween · 2020-03-04T08:10:00Z

Let me make some performance check, then we can decide.

mirkodziadzka-avi · 2020-03-04T08:47:14Z

I do not claim to understand the patterns, but after starring for some minutes on the 3 changes, they looks fine to me. In all 3 cases we are changing a repeated element in the beginning of the regex (start of a top-level alternative) to a single occurrence. Which is semantically equivalent for our use case.

A side remark:

If you are looking into more detail here, there may be room for making these regexes smaller.

For example, there is a repeating pattern

|--\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|#\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|;\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

or in more readable

|
--\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|
#\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|
;\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

which seems to be the same as

|(?:--|#|;)\s*?(?:(?:insert|update)\s*?\w{2,}|alter|drop)|

Another side remark:

Seeing this, I think that \s* is better than \s*? here. We do not need the non-greedy qualified, because we do not want to extract data. And for the question match-or-no-match, the difference between * and *? is irrelevant. As far as I understand the implementation, even from a performance point of view. Our default case is "no-match" and in this case, the backtracking engine has to check all alternatives anyway

Another side remark:

In a former company, we used macro-expansion to generate regexes from smaller snippets which made the whole stuff more readable.
I wonder if this would be useful here. Do we have repeating patterns between different rules?

allanrbo · 2020-03-04T09:08:30Z

Good suggestions for simplifications. Maybe let's keep them in a separate PR, so it's easier to understand the reason for the changes?

We do not have a macro expansion step for the regexes in this project, but we do as a matter of fact have a regex compilation step that combines smaller regexes into bigger ones. Had forgotten about it here until I saw your comment. Pushed a new iteration with the regexp data files updated. Somehow the regexp-assemble.pl script decided to move things around a bit though. Don't know why. It's fairly black box to me, that regexp-assemble.pl script.

allanrbo · 2020-03-04T09:17:02Z

Oddly, regexp-assemble.pl added an extra \ in front of one of the + in 942260. This seems significant, and doesn't seem intentional (it's not there in the regexp-942260.data file). @franbuehler do you happen to know what's going on? Should I just hand-edit the output of regexp-assemble.pl a little in this strange case?

dune73 · 2020-03-04T09:35:37Z

Please don't hand-edit.

@fgsch: Do you have an idea what's the matter here?

Also: There might be regex-snippet re-use across multiple rules. But even if it would be a simplification in some regards, I'm reluctant to welcome this as it grows the complexity of the regex generation even more.

mirkodziadzka-avi · 2020-03-04T12:30:13Z

Also: There might be regex-snippet re-use across multiple rules. But even if it would be a simplification in some regards, I'm reluctant to welcome this as it grows the complexity of the regex generation even more.

We did use is for stuff which should be always the same: for example:

SQLI_START = ("|'|;|`|--)\s*
SQL_COMMENT = ...

But you are right. It is a question of complexity. In our case back then, it reduced complexity

airween · 2020-03-05T20:20:25Z

A quick performance test result:

a small C code
using libpcre
first argument is the regex pattern in file
second is the input
iterating 10 times

./demo mirko.txt 130knomatch.txt 
Time elapsed: 0.010099
Time elapsed: 0.004793
Time elapsed: 0.004677
Time elapsed: 0.005202
Time elapsed: 0.004632
Time elapsed: 0.005121
Time elapsed: 0.004704
Time elapsed: 0.004770
Time elapsed: 0.005645
Time elapsed: 0.004700

./demo allan.txt 130knomatch.txt 
Time elapsed: 0.013856
Time elapsed: 0.008231
Time elapsed: 0.008282
Time elapsed: 0.008459
Time elapsed: 0.008216
Time elapsed: 0.008523
Time elapsed: 0.008440
Time elapsed: 0.008185
Time elapsed: 0.008262
Time elapsed: 0.008202

very similar values in case of a long matches pattern.

I think Mirko's suggestion is better.

Any other ideas (related to test, fix, ...) are welcome.

dune73 · 2020-03-05T20:29:34Z

Cool. Could you please share the code? Could we integrate this into utils?

allanrbo · 2020-03-06T07:43:34Z

Yea I totally agree, don't use my initial suggestion. Removing code is much better than adding code to achieve the same. This PR is already updated to use Mirko's approach.

dune73 · 2020-03-06T07:53:02Z

But the problem with the assembly script persists, does not it? We should not merge as long as we have not sorted that one out.

mirkodziadzka-avi · 2020-03-06T07:57:14Z

@airween do you have the timings for the original regex and the same input as comparison?

airween · 2020-03-06T08:06:39Z

Yes, I checked the original regex, but didn't listed here because it was the expected value - they have very loooong time... above 13 secs all.

But you're right, here are the full and fresh list:

./demo original.txt 130knomatch.txt 
Time elapsed: 13.471386
Time elapsed: 13.823770
Time elapsed: 13.337333
Time elapsed: 13.382550
Time elapsed: 13.457501
Time elapsed: 13.569796
Time elapsed: 13.536542
Time elapsed: 13.344020
Time elapsed: 13.200333
Time elapsed: 13.445126

./demo mirko.txt 130knomatch.txt 
Time elapsed: 0.009132
Time elapsed: 0.006383
Time elapsed: 0.005335
Time elapsed: 0.004669
Time elapsed: 0.004787
Time elapsed: 0.004659
Time elapsed: 0.004647
Time elapsed: 0.004612
Time elapsed: 0.004631
Time elapsed: 0.004917

./demo allan.txt 130knomatch.txt 
Time elapsed: 0.014432
Time elapsed: 0.008248
Time elapsed: 0.008228
Time elapsed: 0.008176
Time elapsed: 0.008149
Time elapsed: 0.008233
Time elapsed: 0.008141
Time elapsed: 0.008217
Time elapsed: 0.008158
Time elapsed: 0.008269

airween · 2020-03-06T08:08:51Z

Cool. Could you please share the code?

sure, have to review and clean the code - now it's a piece of junk :).

Could we integrate this into utils?

I think we can, but need to discuss about how (I think there isn't any C source under utils/).

dune73 · 2020-03-06T08:10:21Z

I do not mind introducing c-source there. And if we do not like it long-term, we can always replace with python code. Please start the cleanup process. :)

mirkodziadzka-avi · 2020-03-06T08:43:08Z

@dune73 @airween As far as I know, there are python bindings for PCRE, so we may be able to write this in python. This would mean that we can use the python timeit module which gives stable results for such kind of tests.

Maybe we can talk about this later. I have some ideas how I can contribute here and would like to discuss them.

dune73 · 2020-03-06T08:47:59Z

That sounds great. Want to contact me on Slack?

airween · 2020-03-06T10:00:52Z

Python's PCRE binding is a bit old (I just found this one), but PCRE is still old :), so it's not necessary to adds more update.

Or we can use Python's FFI module. I've never used it before, I'm not clear about it's performance.

The advantage of Python is that we can parse the whole rule set (with secrules_parsing or msc_pyparser) and collect the @rx arguments - those are the regex's what we want to check, and can run as test. I also made a regex checker (it builds an AST from the regex) which tries to detect the infinity sequences, I introduced it here. We can continue that work too.

airween · 2020-03-06T15:00:54Z

Just a quick look to the python-pcre, and the first results (old, Mirko's, Allan's patterns):

Time elapsed: 17.679897
Time elapsed: 17.645132
Time elapsed: 17.548189
Time elapsed: 17.438846
Time elapsed: 17.513020
Time elapsed: 17.582721
Time elapsed: 17.630454
Time elapsed: 17.457768
Time elapsed: 17.510023
Time elapsed: 17.701713
=====
Time elapsed: 0.005044
Time elapsed: 0.005106
Time elapsed: 0.005306
Time elapsed: 0.005131
Time elapsed: 0.005197
Time elapsed: 0.005032
Time elapsed: 0.005181
Time elapsed: 0.005029
Time elapsed: 0.005245
Time elapsed: 0.005155
=====
Time elapsed: 0.009067
Time elapsed: 0.009174
Time elapsed: 0.009096
Time elapsed: 0.008877
Time elapsed: 0.009030
Time elapsed: 0.008947
Time elapsed: 0.009120
Time elapsed: 0.008731
Time elapsed: 0.008714
Time elapsed: 0.008739

As can be seen it has a much worse performance, but it seems useful anyway.

Any ideas?

dune73 · 2020-03-06T16:40:47Z

I do not think the performance difference is that big between C and Python bindings. The original has a 30-40% difference, but with the faster regexes, it's really quite low. I think that's perfectly acceptable, especially if we can avoid C that way. :)

airween · 2020-03-07T10:50:20Z

Well, my curiosity won, I tried the CFFI module - this is something catastrophic...
Here you can find the code.

The results (Mirko's, Allan's):

Time elapsed: 0.006856
Time elapsed: 0.006678
Time elapsed: 0.006633
Time elapsed: 0.006714
Time elapsed: 0.006630
Time elapsed: 0.006581
Time elapsed: 0.006755
Time elapsed: 0.006698
Time elapsed: 0.006680
Time elapsed: 0.006658
=======
Time elapsed: 0.012194
Time elapsed: 0.012084
Time elapsed: 0.012091
Time elapsed: 0.012331
Time elapsed: 0.012059
Time elapsed: 0.012057
Time elapsed: 0.012047
Time elapsed: 0.012062
Time elapsed: 0.012266
Time elapsed: 0.012164

Okay, they are just a bit more worse than python-pcre - but see the original pattern:

Time elapsed: 273.886888
Time elapsed: 270.109169
Time elapsed: 271.969392
^C

(had to sent a KILL signal to process...)

I think we should stay at python-pcre - if we choose the "native" Python check :).

airween · 2020-04-26T20:13:37Z

Let me summarize the experiences for this issue.

For the better analyze, I've made a small tool which helps to check arguments of @rx operator for any rule. The tool tries to emulate the real behavior of both modsec engine (v2 and v3).

The current used rules takes long times to evaulate the regular expressions in case of v2 - but only if the engine doesn't use PCRE JIT feature:

1708/941120_0.txt - time elapsed: 13.211421
1708/942210_0.txt - time elapsed: 8.617792
1708/942260_0.txt - time elapsed: 8.639488

With JIT (or with v3), the problem doesn't exists:

1708/941120_0.txt - time elapsed: 0.000532
1708/942210_0.txt - time elapsed: 0.009518
1708/942260_0.txt - time elapsed: 0.003248

v3:

1708/941120_0.txt - time elapsed: 0.000394
1708/942210_0.txt - time elapsed: 0.005053
1708/942260_0.txt - time elapsed: 0.001524

First @allanrbo modified these regexes, then @mirkodziadzka-avi helped him to clean up. Now I reviewed the commits, and measured all scenarios.

Here you can find the regex's.

There are 9 files in 3 groups - the 3 affected rule, and 3 regex for every rule with a suffix. The 0 means the original (currently used), 1 is the first commit from Allan, 2 is Mirko's proposal.

Methodology
I measured 3 times all regexes for all rules. Checked 10 times (with -n switch) the regex with

modsec2 with JIT
modsec3 (JIT enabled default)
modsec2 without JIT.

Then I removed the MIN and MAX values from the set, summaryze the remained 8 values, and divided by 8. In the next lines, you can see these values. The bold showed the better result for the compared regexes (with suffix 1 and 2)

regex name	engine	min value	max value	sum	avg value
941120_1.txt	v2 JIT	0.008214 s	0.009779 s	0.067451 s	0.008431375 s
941120_2.txt	v2 JIT	0.004878 s	0.007466 s	0.039632 s	0.004954000 s
941120_1.txt	v3	0.001065 s	0.002500 s	0.011098 s	0.00138725 s
941120_2.txt	v3	0.000052 s	0.000065 s	0.000418 s	0.00005225 s
941120_1.txt	v2 no JIT	0.000866 s	0.002111 s	0.010093 s	0.001261625 s
941120_2.txt	v2 no JIT	0.000032 s	0.000040 s	0.000293 s	0.000036625 s

942210_1.txt	v2 JIT	0.003827 s	0.008222 s	0.031062 s	0.003882750 s
942210_2.txt	v2 JIT	0.004782 s	0.009547 s	0.040991 s	0.005123875 s
942210_1.txt	v3	0.004066 s	0.007368 s	0.034355 s	0.004294375 s
942210_2.txt	v3	0.004728 s	0.005796 s	0.038315 s	0.004789375 s
942210_1.txt	v2 no JIT	0.073550 s	0.080512 s	0.592089 s	0.074011125 s
942210_2.txt	v2 no JIT	0.097047 s	0.104927 s	0.789010 s	0.098626250 s

942260_1.txt	v2 JIT	0.001630 s	0.003943 s	0.014886 s	0.001860750 s
942260_2.txt	v2 JIT	0.001633 s	0.003868 s	0.014176 s	0.001772000 s
942260_1.txt	v3	0.001812 s	0.004339 s	0.016818 s	0.002102250 s
942260_2.txt	v3	0.001638 s	0.001722 s	0.013368 s	0.001671000 s
942260_1.txt	v2 no JIT	0.030980 s	0.036878 s	0.249281 s	0.031160125 s
942260_2.txt	v2 no JIT	0.022523 s	0.025677 s	0.182497 s	0.022812125 s

How can you check
Download the tool above, compile it. Download the patterns above into a subfolder, eg. 1708. Here are the commands:

src/pcre4msc2 -j -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc2 -j -n 10 1708/941120_2.txt somuch1.txt
src/pcre4msc3 -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc3 -n 10 1708/941120_2.txt somuch1.txt
src/pcre4msc2 -n 10 1708/941120_1.txt somuch1.txt
src/pcre4msc2 -n 10 1708/941120_2.txt somuch1.txt

Summary
As you can see, Mirko's proposals has better performance than Allan's, except in case of 942210 - in case of that rule, Allan's commit is better nearly 25% which is significantly!

I think, we should close this PR, and open a new one, which contains Mirko's fix for 941120 and 942260, and Allan's for 942210. And that PR should be more cleaner (with less commit :)).

@allanrbo, @mirkodziadzka-avi - and others: any opinion?

mirkodziadzka-avi · 2020-04-27T09:08:42Z

@airween Thanks for the work!

I would like to understand the difference in the performance of 942210.

I noticed that the order of the 2nd and 3rd group between 942210_1.txt and 942210_2.txt is switched. I assume, regexp-assemble.pl did this? As a wild guess (no measurement involved), I would assume that this reorder is responsible for the performance difference here.

Will check this if I find time.

airween · 2020-04-27T09:17:20Z

I assume, regexp-assemble.pl did this?

may be, I didn't checked yet the "raw" format of regex, just the compiled. Of course, we have to investigate it.

Will check this if I find time.

Thanks!

airween · 2020-05-03T15:37:42Z

I noticed that the order of the 2nd and 3rd group between 942210_1.txt and 942210_2.txt is switched. I assume, regexp-assemble.pl did this?

I didn't find any regex raw data for @allanrbo's first commit. Assume he just modified the argument of @rx directly.

allanrbo · 2020-05-07T19:27:48Z

Yea I probably (accidentally) just modified the .conf file directly. I can see in later commits I realized this mistake and modified the correct regex-assemble file, but after I had moved to Mirko's approach. I still think Mirko's approach is much cleaner. I don't agree that a 25% perf increase is significant enough to go for a much more complex and hard-to-maintain regex in this context. If I remember right, these regex are very corner cases, and the perf only becomes an issue with very specifically crafted input (likely attack input). I think we don't need to optimize for high perf for attackers :-) the happy path (non-attacks) is where perf matters.

Anchor regexes that start with repeating digits

dd63bd0

Remove quantifier (Mirkos suggestion)

5ca01e6

allanrbo changed the title ~~Anchor regexes that start with repeating digits~~ Perf issue with regexes that start with repeating digits Mar 3, 2020

dune73 mentioned this pull request Mar 4, 2020

Monthly Chat Agenda March (2020-03-02) #1683

Closed

updated unassembled regexes

08a567c

allanrbo added 3 commits March 4, 2020 09:11

mend

41cc6a0

mend

dfc810b

mend

1c1ca6a

dune73 mentioned this pull request Apr 6, 2020

Monthly Chat Agenda April (2020-04-06) #1733

Closed

franbuehler mentioned this pull request Apr 29, 2020

Monthly Chat Agenda May (2020-05-04) #1749

Closed

CRS-migration-bot mentioned this pull request May 13, 2020

Perf issue with regexes that start with repeating digits coreruleset/coreruleset#1708

Closed

airween mentioned this pull request Jun 5, 2020

Decrease processing time of rules, fixes #1665 #1708 coreruleset/coreruleset#1791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf issue with regexes that start with repeating digits #1708

Perf issue with regexes that start with repeating digits #1708

allanrbo commented Mar 3, 2020 •

edited

Loading

mirkodziadzka-avi commented Mar 3, 2020 •

edited

Loading

dune73 commented Mar 3, 2020

allanrbo commented Mar 3, 2020

airween commented Mar 4, 2020

mirkodziadzka-avi commented Mar 4, 2020 •

edited

Loading

allanrbo commented Mar 4, 2020

allanrbo commented Mar 4, 2020

dune73 commented Mar 4, 2020 •

edited

Loading

mirkodziadzka-avi commented Mar 4, 2020

airween commented Mar 5, 2020

dune73 commented Mar 5, 2020

allanrbo commented Mar 6, 2020

dune73 commented Mar 6, 2020

mirkodziadzka-avi commented Mar 6, 2020

airween commented Mar 6, 2020

airween commented Mar 6, 2020

dune73 commented Mar 6, 2020

mirkodziadzka-avi commented Mar 6, 2020

dune73 commented Mar 6, 2020

airween commented Mar 6, 2020 •

edited

Loading

airween commented Mar 6, 2020

dune73 commented Mar 6, 2020

airween commented Mar 7, 2020

airween commented Apr 26, 2020

mirkodziadzka-avi commented Apr 27, 2020

airween commented Apr 27, 2020

airween commented May 3, 2020

allanrbo commented May 7, 2020 •

edited

Loading

Perf issue with regexes that start with repeating digits #1708

Are you sure you want to change the base?

Perf issue with regexes that start with repeating digits #1708

Conversation

allanrbo commented Mar 3, 2020 • edited Loading

mirkodziadzka-avi commented Mar 3, 2020 • edited Loading

dune73 commented Mar 3, 2020

allanrbo commented Mar 3, 2020

airween commented Mar 4, 2020

mirkodziadzka-avi commented Mar 4, 2020 • edited Loading

allanrbo commented Mar 4, 2020

allanrbo commented Mar 4, 2020

dune73 commented Mar 4, 2020 • edited Loading

mirkodziadzka-avi commented Mar 4, 2020

airween commented Mar 5, 2020

dune73 commented Mar 5, 2020

allanrbo commented Mar 6, 2020

dune73 commented Mar 6, 2020

mirkodziadzka-avi commented Mar 6, 2020

airween commented Mar 6, 2020

airween commented Mar 6, 2020

dune73 commented Mar 6, 2020

mirkodziadzka-avi commented Mar 6, 2020

dune73 commented Mar 6, 2020

airween commented Mar 6, 2020 • edited Loading

airween commented Mar 6, 2020

dune73 commented Mar 6, 2020

airween commented Mar 7, 2020

airween commented Apr 26, 2020

mirkodziadzka-avi commented Apr 27, 2020

airween commented Apr 27, 2020

airween commented May 3, 2020

allanrbo commented May 7, 2020 • edited Loading

allanrbo commented Mar 3, 2020 •

edited

Loading

mirkodziadzka-avi commented Mar 3, 2020 •

edited

Loading

mirkodziadzka-avi commented Mar 4, 2020 •

edited

Loading

dune73 commented Mar 4, 2020 •

edited

Loading

airween commented Mar 6, 2020 •

edited

Loading

allanrbo commented May 7, 2020 •

edited

Loading