Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Fix depth-first and leftmost-longest matching rules #5218

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Jan 2, 2025

Fixes four bugs in the implementation of the ECMAScript (depth-first) and POSIX (leftmost-longest) matching rules. Fixes #731. Fixes #996.

Bug in ECMAScript matching rule implementation

ECMAScript prescribes depth-first matching. This effectively means: In a disjunction, try left alternative first; for a greedy quantifier, try greatest possible number of repetitions first; for a non-greedy quantifier, try smallest possible number of repetitions first. The matcher is supposed to yield the first match it finds using this strategy.

So why do we get incorrect match results in #731's test cases? Because the matcher deviates from ECMAScript evaluation rules: An optimization has been implemented in _Matcher::_Do_rep0(), the main purpose of which seems to be to reduce stack usage. This optimization essentially tries smallest number of repetitions first for greedy quantifiers as well, so the first matches are discovered as if a non-greedy quantifier is used. Consider the test case from #731 (comment) where "AAA BBB" is to be matched with regular expression (A+)\s*(B+)?\s*B*. If quantifiers are processed smallest-first, then the first discovered full match is the following: (A+) matches "AAA", first \s* matches "" (smallest number of repetitions), (\B+)? matches "", second \s* matches " " and B* matches "BBB". This is the wrong result produced by regex_match() in the test case.

However, the implementation of _Matcher::_Do_rep0() is somewhat aware that it is matching using the wrong strategy for greedy quantifiers: It doesn't just immediately stop matching when it encounters a first match, but keeps trying to add more repetitions until matching fails. Thus, the matcher will find the correct result as well: (A+) matches "AAA", first \s* matches " " (maximal number of repetitions), (\B+)? matches "BBB" (maximal number of repetitions), second \s* matches "" and B* matches "".

But when a new match is finally recognized when processing the _N_end node in _Matcher::_Match_pat(), there is no clear way to tell whether this one should have been encountered before during depth-first search. The test in _Matcher::_Better_match() is certainly wrong for this purpose, since it's implementing leftmost-longest rule (but incorrectly, see below), not the depth-first rule. One needs information about the path taken through the NFA to make this determination correctly, but the required information is buried somewhere in the call stack.

This observation suggests the fix: In ECMAScript mode, all recursively called functions in the matcher except _Matcher::_Do_rep0() return immediately when a match has been encountered, so the first match is described by _Tgt_state. _Matcher::_Do_rep0() does not return immediately for greedy quantifiers, since it's processing in smallest-first order, but it keeps track of the value of _Tgt_state for the greatest number of successfully matching repetitions in the local variable _Final and finally assigns _Final to _Tgt_state before it returns. So this function also ensures that _Tgt_state describes the first match encountered under depth-first matching rules.

So to make this long story short: The fix is that _Matcher::_Match() must construct the final result from _Tgt_state in ECMAScript mode, not _Res.

Since _Res is effectively no longer used and _Matcher::_Better_match() is only relevant for matching according to the leftmost-longest rule, we can also skip the assignment to _Res and the call of _Matcher::_Better_match() in _Matcher::_Match_pat() when depth-first matching is applied.

Two bugs in _Matcher::_Better_match()'s implementation of leftmost-longest matching rule

Interestingly, POSIX regular expressions yield the same wrong result for the (equivalent) regular expression (A+)[[:space:]]*(B+)?[[:space:]]*B*: Matching "AAA BBB", the first capture group is "AAA" and the second is unmatched, even though the second capture group should have been "BBB". This is due to bugs in _Matcher::_Better_match() which were already observed in #1547 (comment):

  • According to Section 9.1 of the POSIX standard, even a null (empty) string is a better match for a subexpression (capture group) than it not being matched at all. This was not implemented in _Better_match(). After adding this rule, "AAA BBB" gets matched correctly by regular expression (A+)[[:space:]]*(B+)?[[:space:]]*B*.
  • For a subexpression (capture group), a match starting further to the left is better. But _Better_match() kind of implemented the opposite rule, "rightmost-longest", because < was used instead of >. Matching "aabaac" with (aa|aabaac|ba|b|c)* exposes the bug: The leftmost-longest match for the first capture group is "aabaac", but "c" is returned without this PR because it is the right-most choice of the first capture group when the whole string is matched. (First repetition matches "aa", second matches "b", third matches "aa", and finally the last one matches "c", which is assigned to the capture group.)

Bug in _Matcher::_Do_rep()'s implementation of leftmost-longest matching rule

Without this PR, _Matcher::_Do_rep() only performs greedy matching when searching for a match under the leftmost-longest rule: When it finds a match with some number of repetitions, it does not check whether a match with fewer repetitions is better (e.g., longer) under the leftmost-longest rule. After this PR, _Matcher::_Do_rep() now checks any possible number of repetitions.

Remarks

MSVC STL's matcher uses the same interpretation of leftmost-longest as Boost (leftmost-longest capture groups, where the capture groups refer to the last matched strings), but the implementation has been quite buggy. This PR fixes the bugs in the implementation, so that the match results now agree with Boost's on the added test cases.

When applying the leftmost-longest rule, we still don't agree with libstdc++ (which sometimes seems to produce results closer to depth-first matching) and libc++ (namely for regular expression (aa|aabaac|ba|b|c)* as well as ^[[:blank:]]*#([^\\n]*\\\\[[:space:]]+)*[^\\n]* in awk mode and its sibling in extended mode) on some added test cases. I haven't checked, but there is a good chance that they implement some different interpretation of the leftmost-longest rule. The POSIX standard is surprisingly terse in describing leftmost-longest matching, leaving much room for interpretation.

@muellerj2 muellerj2 requested a review from a team as a code owner January 2, 2025 19:55
@muellerj2
Copy link
Contributor Author

After this PR, the matcher even seems to be compatible with Boost down to bugs. As Berglund et al. (2018) argue (and I agree with their assessment), Boost's leftmost-longest rule as specified implies that (.?){2} should match "x" such that "x" is assigned to the first capture group ("marked subexpression"). However, both MSVC STL and Boost assign "" to the capture group.

I think the fix is relatively straightforward (although I haven't tested it sufficiently yet), but I'm undecided whether we should apply it as this would mean deviation from Boost's behavior again:

diff --git a/stl/inc/regex b/stl/inc/regex
index 98519f61..da509f6e 100644
--- a/stl/inc/regex
+++ b/stl/inc/regex
@@ -3262,8 +3262,14 @@ bool _Matcher<_BidIt, _Elem, _RxTraits, _It>::_Do_rep(_Node_rep* _Node, bool _Gr
     if (0 <= _Node->_Max && _Node->_Max <= _Init_idx) {
         _Matched0 = _Match_pat(_Node->_End_rep->_Next); // reps done, try tail
     } else if (_Init_idx < _Node->_Min) { // try a required rep
-        if (!_Progress) {
-            _Matched0 = _Match_pat(_Node->_End_rep->_Next); // empty, try tail
+        if (!_Progress) { // empty match^M
+            if (_Longest) { // try one last repetition^M
+                _Psav->_Loop_idx  = _Node->_Min;^M
+                _Psav->_Loop_iter = _STD addressof(_Cur_iter);^M
+                _Matched0         = _Match_pat(_Node->_Next);^M
+            } else { // try tail^M
+                _Matched0 = _Match_pat(_Node->_End_rep->_Next);^M
+            }^M
         } else { // try another required match
             _Psav->_Loop_idx  = _Init_idx + 1;
             _Psav->_Loop_iter = _STD addressof(_Cur_iter);

@StephanTLavavej StephanTLavavej added the bug Something isn't working label Jan 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 4, 2025
@StephanTLavavej StephanTLavavej added the regex Everyone's favorite header label Jan 8, 2025
@muellerj2
Copy link
Contributor Author

muellerj2 commented Jan 17, 2025

I just noticed that this PR fixes #996 as well. I will quickly add the test case to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex Everyone's favorite header
Projects
Status: Initial Review
2 participants