-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(deps): bumping regex-syntax #320
Conversation
Funny enough: I also tried to update all dependencies in the logos eco system (not just regex-syntax). First, literals are not just one byte/char at a time, but instead a whole string/byte slice. Therefore the following line should be changed as following: logos/logos-codegen/src/mir.rs Line 75 in e2702f6
Second, regex-syntax does some crazy optimization, e.g. I'm not entirely sure, if they should be dropped or relaxed, but this works: --- a/logos-codegen/src/graph/regex.rs
+++ b/logos-codegen/src/graph/regex.rs
@@ -207,7 +206,7 @@ mod tests {
let mir = Mir::utf8("a|b").unwrap();
- assert_eq!(mir.priority(), 2);
+ assert!((1..=2).contains(&mir.priority()));
diff --git a/logos-codegen/src/mir.rs b/logos-codegen/src/mir.rs
index 18a6307..5f258e8 100644
--- a/logos-codegen/src/mir.rs
+++ b/logos-codegen/src/mir.rs
@@ -207,17 +204,17 @@ mod tests {
#[test]
fn priorities() {
let regexes = [
- ("[a-z]+", 1),
- ("a|b", 2),
- ("a|[b-z]", 1),
- ("(foo)+", 6),
- ("foobar", 12),
- ("(fooz|bar)+qux", 12),
+ ("[a-z]+", (1..=1)),
+ ("a|b", (1..=2)),
+ ("a|[b-z]", (1..=1)),
+ ("(foo)+", (6..=6)),
+ ("foobar", (12..=12)),
+ ("(fooz|bar)+qux", (12..=12)),
];
for (regex, expected) in regexes.iter() {
let mir = Mir::utf8(regex).unwrap();
- assert_eq!(mir.priority(), *expected);
+ assert!(expected.contains(&mir.priority()));
}
}
} |
logos-codegen/src/mir.rs
Outdated
Mir::Literal(lit) => { | ||
match std::str::from_utf8(&lit.0) { | ||
Ok(s) => 2 * s.len(), | ||
Err(_) => 2 * lit.0.len(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're doing this because of unicode strings?
Is it possible to mix string and byte pattern in logos?
If so, can you provide a testcase where this is covered?
If not, I think doing the simple solution, e.g. 2 * lit.0.len()
is enough here. But I don't have the last word of course ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should come with a test for that :-)
But I did that because Mir::Literal
can be constructed either from a char
or from an u8
. But a single char
, which should have priority 2
, can be made of up to 4 bytes. So using 2 * lit.0.len()
for such a char
could make the priority much higher. I will try to write some tests for that :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I made the dummy mistake, but str
's length should be s.chars().count()
:'-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we talk about grapheme clusters? I mean ẽ
could either be just ẽ
or e~
.
Please take a look at: https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talk about unicode chars, see regex#54.
I don't think regex-syntax
or logos
supports grapheme clusters, so I don't see why we should support them now?
Oh yeah... this optimization is quite annoying, and seems to be caused by the AST to HIR translator, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6e993bda9d1d375ea2dd0bd3927a38a6. If think there are three possible solutions to this:
It is also important to check whether regex-syntax's optimizations are actually impacting (or not) Logos' performances :'-) |
Ok so I did a small benchmark to compare equivalent patterns, and it seems that optimizing alternations into classes (at least a range) can improve performances, see this example: |
Yeah I thought so ;) Why should they do that, if not for performance :D I mean, honestly: There's no reason for us so stick to "old" behavoir. We just need to make sure, that priorities do not conflict with each other. And I think we can do this with more tests, but IMHO we shouldn't block the upgrade just because old tests are not working. |
True! But optimizations in the
I agree, but this is a major change, and hopefully we should be able to produce a stable result such that "equivalent regex patterns have the same priority". Maybe it is as simple as multiplicating the class priority by 2 ^^' Anyway, I'd like to have @maciejhirsz's opinion on that too. |
b055070
to
22a13f7
Compare
logos/src/lib.rs
Outdated
@@ -231,7 +231,7 @@ pub trait Logos<'source>: Sized { | |||
/// enum Token<'a> { | |||
/// // We will treat "abc" as if it was whitespace. | |||
/// // This is identical to using `logos::skip`. | |||
/// #[regex(" |abc", |_| Skip)] | |||
/// #[regex(" |abc", |_| Skip, priority = 3)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is necessary? Is this because the alternate form gets shortened to a class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is because the next pattern has seen its priority go from 1 to 2. So we increase this priority to have the same behaviour as before.
Your commits seem to be messed up. You have duplicate commits. Could you rebase your commits, or maybe squash them (I don't like squashing, but it's okay to use). |
@@ -71,8 +68,11 @@ impl Mir { | |||
Mir::Empty | Mir::Loop(_) | Mir::Maybe(_) => 0, | |||
Mir::Concat(concat) => concat.iter().map(Mir::priority).sum(), | |||
Mir::Alternation(alt) => alt.iter().map(Mir::priority).min().unwrap_or(0), | |||
Mir::Class(_) => 1, | |||
Mir::Literal(_) => 2, | |||
Mir::Class(_) => 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to adjust Class
to something different.
There is the iter
function on ClassUnicode
and those items have a len
method.
Maybe that is what we want instead of a hard coded class?! I'm not really sure tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will have a different result: a non-empty class means that at it must match one element from the class. All "special" characters are escaped in a class, and you can only match one byte or one char → priority is 2
tests/tests/ignore_case.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not really sure what you need to adjust all the priorities.
Can you explain that please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because bumping Mir::Class
's priority from 1 to 2 broke a lot of tests or regexes, and we now need to specify the priority in such cases
585dc8e
to
854575c
Compare
f64555a
to
8440f01
Compare
8440f01
to
6d03544
Compare
The
regex-syntax
introduces breaking changes between 0.6 and 0.8.This PR tries to bump the regex-syntax dependency, to make Logos a bit more future-proof regarding that :-)
Major changes
Since
regex-syntax
seems to do some optimization on the HIR, some patterns see their priority be affected by this.E.g.:
"a|b"
(priority = 2) is optimized into"[a-b]" (priority = 2)
.Those optimizations also result in better performances for Logos, so I suggest to not go against them :-) See benchmark below.
As a result, this PR proposes to change the priority of classes, so that equivalent regex patterns have the same priority. This "rule" is tested in
logos-codegen/src/mir.rs
.This is a breaking change, but will hopefully never change in the future, as regex HIRs cannot be optimized in something else than an equivalent pattern.