Wednesday, 13 February 2019

Variable-Length Lookbehinds: actually possible in Perl/PCRE!

Get strapped in, because this is as close as you're going to get to breaking news in the world of regular expressions.



You know what the best thing is about being involved in regex circles? It's not the money or the fame or the women.. no no; it's that, on occasion, you get to witness incredible discoveries unfold before your very eyes. This is exactly what happened in a small StackExchange chatroom earlier, when fellow regex tinkerer Grimy cracked a problem that previously seemed insoluble. In doing so, he uncovered a technique for emulating a feature that has been coveted for years by Perl and PCRE users alike: variable-length lookbehinds.

Now, forget \K. That doesn't come close to having the expressive power we want. I'm talking about truly emulating variable-length lookbehinds; that is, doing so for all conceivable scenarios.

The Revelation


In response to a challenge to match an arbitrary string (in the domain of ^.*$) that contains no unique characters, a task I once dismissed as impossible in the general case, Grimy ingeniously came up with the following:

Wonderfully elegant and simple in principle. Let us begin by examining the core concept.

The Concept

((?<=(?= ... |(?3)).))
The trick is to remember that, while lookbehinds can only peek a fixed number of characters behind the current matching point, the lookaheads they contain are not subject to any such restriction. If \1 is a single character, (?<=(?=\1).) succeeds where (?<=\1) fails because in the former case you are telling the regex engine that it needs to go back precisely one character to look for \1.

Because recursive algorithms are tricky to follow, let us expand this subexpression a couple of levels deep to observe what's happening:
((?<=(?= ... |(?<=(?= ... |(?<=(?= ... |(?3)).)).)).))
Notice how we're moving back one character each step of the way. So if (?= ... ) fails to match, move back one character and try again. Thus, it is equivalent in essence to using (?<=(?= ...).*) if actual variable-length lookbehinds were possible.

The Nanny


If you've read and understood everything so far, you may have thought to look back at Grimy's original expression to try and understand it in full. What, then, you ask, is the purpose of \n in the lookahead? After all, \n, which matches a new-line character, should not exist in the domain of ^.*$ since `.` does not match a new-line by default. So why is it there?

Regex engines often have a bunch of nannies in place to prevent you from hurting yourself. One of these employed by PCRE is a compile-time check for recursive calls that appear, from a cursory static inspection, as though they may not terminate. The simplest example of this is the expression (?R), which yields the compile-time error:
40  recursive call could loop indefinitely
With Grimy's expression, the PCRE engine doesn't realize that \2 is guaranteed to match a single character. To work around this, we need to give the engine at least one non-empty path to follow. \n is a contradiction in the domain of ^.*$ and, more importantly, it functions as a non-empty alternative that convinces the nanny that we know what we're doing.

Note that in the more general domain of ^[\s\S]*$, which may include new-line characters, x^ serves as a non-empty contradiction.

The Emulation


Perspicacious readers may have observed that the method above involves an unbounded lookahead, one that may peer far past the position at which we wish to place our emulated lookbehind. This is fine for the problem it aimed to solve, but if we want to expand the method to work in the general case, this issue needs to be addressed.

And so, for a variable-length subexpression X, (?<=X) can be implemented in both Perl and PCRE as:
(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S]))
See it in action on regex101

To negate it and emulate (?<!X), simply enclose the whole thing in a negative lookahead:
(?!(?=(?'a'[\s\S]*))(?'b'X(?=\k'a'\z)|(?<=(?=x^|(?&b))[\s\S])))
Aside from the fact that I've used two capturing groups, that snippet is basically plug-and-play: you may insert it anywhere in your expression where a variable-length lookbehind is desired.

The main difference here is that a new group "a" has been added to capture the portion of the string after the current matching point, so that when we travel backwards and look ahead for X we know where to stop. This solves the issue mentioned above.

Next, X has been moved to the outer group "b" as it would fail to match if X were zero-width since the lookahead only begins at the position one character behind the current matching point, and only moves backwards thereafter.

Here is a breakdown of the various components:
(?=(?'a'[\s\S]*))  # Capture the rest of the string in "a"
(?'b'
  X(?=\k'a'\z)     # Match X followed by the contents of "a" to ensure
                   # the emulated lookbehind stops at the correct point.
  |                # OR
  (?<=             # Look behind (one character) match either:
    (?=
      x^           # A contradiction; non-empty to appease the nanny
      |            # OR
      (?&b)        # Recurse (match X OR look behind (one character)) etc..
    )
    [\s\S]         # How far we go back each step: one single character
  )
) 

Note that by emulating a positive lookbehind in this way, you lose the luxury of having capture groups in X that return their values to the caller. But are you complaining?

The End


There you have it: another impossibility made possible by the creative application of old tools. I hope you enjoy using this new toy as much as I certainly will. Big shout out once again to Grimy for his stroke of genius.

As usual, I welcome you to follow me on twitter to stay apprised of more regex wackiness in the future.

The Bonus


What about problems that require comparing the substring at the current matching point to multiple substrings at later positions in the string - can we now handle those? Turns out we can! By iterating forwards with a lookahead and using our new toy to look back to the original matching point, we can make short work of any problem of that nature. For example, "match the longest word in a line" can now be solved in the general case:
\b(\w++)(?=(.*))(?!(.*\W)\b((?<=(?=(?=\1\2$)(?:(?=\w*+\3(\5?+\w))\w)++\b|(?4)).)))
Figuring out how that works is left as an exercise to the reader :) Suffice it to say, there is now a whole new set of problems for which we can irresponsibly abuse regex.

Tuesday, 13 November 2018

For Regex Purists: Can Formal Regular Expressions Be Fun?

Pre Ramble


Can formal regular expressions be fun? This is a question I asked myself recently. Modern regex engines have strayed a ways away from regular expressions as they are formally defined, offering a vast playground of powerful new features. It’s almost as though they have departed from the Chomsky hierarchy and entered into a sort of Chomsky anarchy.

But can we still have fun within the shackles of Type-3 grammars? By "fun", I do of course mean "painful, difficult, and laborious exercise".

I believe the answer is yes!

Consider the exercise of negating a word, ie. matching a string that does not contain a particular word. It is true that most regex engines provide negative lookaheads for this sort of purpose, and regex implementations generally allow you to exclude matches or otherwise handle the result of a matching failure. However, if we restrict ourselves to the domain of formal regular expressions, in which the only objects are character symbols and the only permissible operations are concatenation, alternation, and quantification via the "Kleene star" (‘*’), the exercise becomes much more challenging and therefore interesting.

A ridiculous example


Enough foreplay. I know you came here to see a long and complicated regex, and so I scoured the dictionary (using regex, naturally) looking for words that would be most challenging to negate and found the perfect one: "phosphoribosylpyrophosphate". Technically two words, but that dictionary listed it as one. Here is its negation written in formal regular expressions (wherein start and end anchors are implied):
([^p]|p(p|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|lp(|y(|r(|op(|h(|o(|sp(|h(|a(|t))))))))))))))))))))p)*([^ph]|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*([^po]|o([^ps]|s(pho(ribosylpyrophospho)*s)*([^p]|p([^ph]|h([^po]|o(ribosylpyrophospho)*([^prs]|r([^pi]|i([^pb]|b([^po]|o([^ps]|s([^py]|y([^pl]|l([^p]|p([^pyh]|y([^pr]|r([^po]|o([^p]|p([^ph]|h([^po]|o([^ps]|s([^p]|p([^ph]|h([^pao]|a([^pt]|t[^pe]))))))))))))))))))))))))))*(|p(p|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|lp(|y(|r(|op(|h(|o(|sp(|h(|a(|t))))))))))))))))))))p)*(|h(os(pho(ribosylpyrophospho)*s)*pho(ribosylpyrophospho)*ribosylph)*(|o(|s(pho(ribosylpyrophospho)*s)*(|p(|h(|o(ribosylpyrophospho)*(|r(|i(|b(|o(|s(|y(|l(|p(|y(|r(|o(|p(|h(|o(|s(|p(|h(|a(|t))))))))))))))))))))))))))
I used negated character classes since we haven’t defined a formal alphabet, but if instead you’d like to visualize each one as a group of character alternations, you’re more than welcome.

Why is this difficult?


Obviously, the word is long. But this makes the task laborious, not necessarily difficult. What makes it difficult is the potential for many different "cycles"; that is, we can observe that partial matches of the word exist at several positions throughout the word itself: "phosph" appears twice; "pho" appears three times; etc.

To see why this sucks for us, let us consider a simpler example. Match a string that doesn’t contain "abc". How would you write a formal regular expression for this?

One good first attempt could be the following: ^([^a]|a([^b]|b([^c]|\$)|\$))*\$

(This is a shortened form of and equivalent to ^([^a]|a([^b]|\$)|ab([^c]|\$))*\$, another valiant attempt. But character classes aside, this is not technically formal since `\$` is not a component of formal regular expressions. We will ignore this for now in the interests of simplifying our investigation, but know that there is an elegant way to eliminate these end anchors that is too large to fit in the margin will be introduced later in the post)

At first glance, this may look passable. Clearly it matches all strings that don’t contain "abc". But does it only match those strings? Unfortunately not. It hiccups when either `[^b]` or `[^c]` matches ‘a’, such as in "aabc". Then the outer group quantified by `*` iterates, at which point `[^a]` is free to match ‘b’ and then ‘c’ after that. See here.

How can we fix this? If the problem is that we’re matching too much, perhaps we need to make the expression more restrictive. If `[^b]` and `[^c]` should not match ‘a’, let us simply add ‘a’ to those negated character classes:
 ^([^a]|a([^ab]|b([^ac]|\$)|\$))*\$
Have we fixed it? Not yet. Sorry. I’m afraid your regular expression is in another castle.

We’ve actually made it too restrictive: it successfully doesn’t match all strings that contain "abc", but now it also fails to match a few that don’t (such as "aba"). This is the opposite problem to before. The good news is that we’re getting closer to the correct solution.

Let’s think about this.

When we first entered the outer group, at the start of the string, no characters had been matched yet. More importantly, no partial occurrences of "abc" had been matched yet. If we enter the outer group after having matched "a" or "ab", then we leave ourselves open to matching the rest of "abc". This means that all partial matches of "abc" (ie. "a" and "ab") need to be fully matched by each iteration of the outer group.

This subtle yet important detail is the reason why our intuition falls short of being able to write these regular expressions quickly and easily. We must pay special attention to the potential for these partial matches. Here is what the correct solution looks like in this case:
 ^([^a]|a(a|ba)*([^b]|b([^c]|\$)|\$))*\$
`(a|ba)` is what I referred to earlier as a "cycle". In that position, "a" has been matched and "bc" is yet to be matched (or, indeed, not matched). If another "a" is matched, it is the same: "a" is matched and "bc" is yet to be matched. If "ba" is matched, the same still. In fact, we can continue in this fashion indefinitely, matching either "a" or "ba", because each time this happens we are simply cycling back to the very same position we were in when that first "a" was matched.

But not all cycles are this simple. A string may contain the potential for more than one cycle. Some cycles may take you back to positions past the first character. There may even be cycles within cycles. If we are to stand a chance of effectively tailoring these regular expressions to exclude any arbitrary string, we are going to need an algorithm.

JT’s Algorithm


I spent some time examining a wide variety of strings and building regular expressions to negate them, and I believe I’ve come up with the easiest and most direct method for doing so. I don’t know if I have any business labeling it an "algorithm" as I haven’t rigorously proven its effectiveness. However, I’m quite certain it works. Very scientific, I know.

The formal definition of the algorithm is presented below. It does a great job of making it look immensely complicated, but I assure you that even "phosphoribosylpyrophosphate" requires much less work to handle than it would seem.

If you are not interested in the technicalities and simply want to learn how to create your own string-negating regular expressions, skim the text here for technique then skip ahead to the examples.

JT’s Algorithm: Formal Definition


Let $s$ = $c_{1}c_{2}c_{3}...c_{n}$ be a string of characters, and let the negated character class $[$^$abc...]$ be a shorthand form of writing the group of alternations of all characters in the defined alphabet not contained in $abc...$. Then, using only the operations defined by formal regular expressions, you can follow the below procedure to write an expression that describes the language of all strings that do not contain $s$ at any position.

1. Create an upside-down word staircase by writing the full string on the first line and removing one character from the end on each subsequent line:
$l_{n}$ = $c_{1}c_{2} ... c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}$
...
$l_{3}$ =$c_{1}c_{2}c_{3}$
$l_{2}$ = $c_{1}c_{2}$
$l_{1}$ = $c_{1}$
2. Identify the cycles that may follow the substring on each line. For a given $l_{i}$, a cycle is any string of characters $o_{1}o_{2}...o_{m}$ such that:
  • For $k$ = 1 to $m-1$: $o_{k}$ = $c_{i+k}$
  • $o_{m}$ = $c_{i}$
  • $o_{m}$ != $c_{i+m}$
(Note that if $o_{m}$ = $c_{i+m}$ then you have not encountered a cycle, but rather a progression through $s$)

Look at the lines above $l_{i}$ and consider which of them, when followed by $c_{i}$ instead of the next character in the string $s$, would result in a partial match of $s$, wherein $c_{1}...c_{i}$ is matched and $c_{i+1}...c_{n}$ is yet to be matched. There could be multiple cycles on one line, or none at all. Write them out in the following way, combining them with the alternation operator '|' in a group quantified by the Kleene star `*`:
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}...c_{i}|c_{i+1}c_{i+2}c_{i+3}...c_{i}|...)*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}...c_{3}|c_{4}c_{5}c_{6}...c_{3}|...)*$
$l_{2}$ = $c_{1}c_{2}(c_{3}c_{4}...c_{2}|c_{3}c_{4}c_{5}...c_{2}|...)*$
$l_{1}$ = $c_{1}(c_{2}c_{3}...c_{1}|c_{2}c_{3}c_{4}...c_{1}|...)*$
(Note, save a copy of this formulation for step 5)

3. Combine the cycles in each group using the fact that $(c_{i+1}c_{i+2}...c_{i}|c_{i+1}c_{i+2}c_{i+3}...c_{i})$ = $(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})$. Aim to eliminate multiple occurrences of any particular character (except $c_{1}$) from each cycle group. This will be useful for the next step.
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*$
$l_{2}$ = $c_{1}c_{2}(c_{3}c_{4}(|c_{5})...c_{2})*$
$l_{1}$ = $c_{1}(c_{2}c_{3}(|c_{4})...c_{1})*$
4. Starting from the top, recursively copy the cycle group for each line $l_{i}$ (along with its Kleene star) to just after wherever $c_{i}$ appears in the cycles below it.
$l_{n}$ = $c_{1}c_{2}...c_{n-1}c_{n}$
$l_{n-1}$ = $c_{1}c_{2}...c_{n-1}(c_{n-1})*$
...
$l_{i}$ = $c_{1}c_{2}...c_{i}(c_{i+1}c_{i+2}(|c_{i+3})...c_{i})*$
...
$l_{3}$ = $c_{1}c_{2}c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*$
$l_{2}$ = $c_{1}c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*$
$l_{1}$ = $c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*$
5. Now we can start building our expression. Let $g_{i}$ be the cycle group with star on line $l_{i}$. Then:
  • For $i$ = 1 to $n-1$: concatenate "$([$^$c_{1}c_{i}]|c_{i}g_{i}$"
  • Then concatenate "$[$^$c_{1}c_{n}]$"
  • Then for $i$ = 1 to $n-1$: concatenate ")"
The expression may then look something like this:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$
6. Some of those negated character classes may need extra characters added to them to prevent cycles from being matched down that path in the expression. The easiest way to do this is to go back to the unsimplified cycles in step 2 and:
For $i$ = 2 to $n-1$: if $c_{i}$ != $c_{1}$ then: for each cycle $o$ with length $m$: add $c_{i}$ to the character class "$[$^$c_{1}c_{i+m}]$".
6b. Simplify the negated character classes, knowing that "$[$^$cc]$" = "$[$^$c]$"

7. Remember earlier in the article I mentioned that there is an elegant way to overcome our lack of the end anchor `\$`? Here it is. Concatenate a copy of the entire pattern and put a Kleene star in between to quantify the first group:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$$*$$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))$
Now go through the copy and delete all negated character classes from it:
$([$^$c_{1}c_{1}]|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*([$^$c_{1}c_{2}]|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*([$^$c_{1}c_{3}]|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... ([$^$c_{1}c_{n-1}]|c_{n-1}(c_{n-1})*[$^$c_{1}c_{n}]) ... )))*(|c_{1}(c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*(|c_{4})...c_{1})*(|c_{2}(c_{3}(c_{4}c_{5}(|c_{6})...c_{3})*c_{4}(|c_{5})...c_{2})*(|c_{3}(c_{4}c_{5}(|c_{6})...c_{3})* ... (|c_{n-1}(c_{n-1})*) ... )))$
And you're done.

This is considerably less complicated in practice, as the examples below show.

JT's Algorithm: Easy Example


Let s = "mimic". Let's start off nice and gently.

1. Create the upside-down word staircase:
mimic
mimi
mim
mi
m
2. Write out the cycles:
mimic
mimi
mim(im)*
mi
m(m|imm)*
We first observe that "mimim" cycles back to "mim". Likewise, "mm" and "mimm" cycle back to simply "m". However, "mimi" does not cycle back to "mi" because it is itself a partial match of "mimic".

3. Combine cycles to eliminate repeated characters in groups.

No significant combining to be done.  `((im)?m)*` is no more useful to us than `(m|imm)*`

4. Recursively insert cycle groups into lower-order cycles:
mimic
mimi
mim(im)*mi
m(m|im(im)*m)*
The cycle for "mim" simply needed to be inserted at the position where "mim" is matched in `m(m|imm)*`

5. Build the expression.

We loop and concatenate "([^mm]|m(m|im(im)*m)*" with "([^mi]|i" then with "([^mm]|m(im)*" then "([^mi]|i" then finally "[^mc]", and then add all the closing parentheses:
([^mm]|m(m|im(im)*m)*([^mi]|i([^mm]|m(im)*([^mi]|i[^mc]))))
6. Add characters to negated character classes that lead to cycles.

Looking back at step 2, both sets of cycles result from matching 'm', which is the first letter of "mimic". Therefore nothing needs to be added to the negated character classes.

6b. Simplify character classes.

Although we didn't add anything in the previous step, we can still simplify the one character class containing a superfluous 'm':
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))
7. And finally, a bit of magic.

Copy the expression and add it to itself with a `*` in between, then remove all the negated character classes in the copy:
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))*([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))
Becomes:
([^m]|m(m|im(im)*m)*([^mi]|i([^m]|m(im)*([^mi]|i[^mc]))))*(|m(m|im(im)*m)*(|i(|m(im)*(|i))))
And there you have it: a formal regular expression to match any string that does not contain "mimic".


JT's Algorithm: Average Example


Let s = "peppers". Getting a little trickier here.

1. Create the upside-down word staircase:
peppers
pepper
peppe
pepp
pep
pe
p
2. Write out the cycles:
peppers
pepper
peppe
pepp
pep(pep)*
pe(pe)*
p(p|eppp|epperp)*
As with most of the strings you're likely to encounter, the last line contains the majority of the cycles.

We observe that "peppep" cycles back to "pep"; "pepe" cycles back to "pe"; and matching a "p" after any of "p", "pepp", or "pepper" cycles back to simply "p".

3. Combine cycles to eliminate repeated characters in groups.

Here we can combine the cycles in the last group to help us with the next step:
peppers
pepper
peppe
pepp
pep(pep)*
pe(pe)*
p(p|epp(|er)p)*
4. Recursively insert cycle groups into lower-order cycles.

Let's do this one in steps. First, we insert the cycle for "pep" into the appropriate positions in the cycles for "pe" and "p". Recall that this is done exactly where "pep" is matched in those lines:
peppers
pepper
peppe
pepp
pep(pep)*
pe(p(pep)*e)*
p(p|ep(pep)*p(|er)p)*
Then, insert the cycle you just formed for "pe" into its position in the cycle for "p":
peppers
pepper
peppe
pepp
pep(pep)*
pe(p(pep)*e)*
p(p|e(p(pep)*e)*p(pep)*p(|er)p)*
5. Build the expression.

Tedious, but what to do.
([^pp]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^pp]|p(pep)*([^pp]|p([^pe]|e([^pr]|r[^ps]))))))
6. Add characters to negated character classes that lead to cycles.

In this example, there is only one line with cycles whose corresponding character is not equal to the first character ('p'), and that is line two, the cycle line for "pe". Its only cycle has length 2, so we need to add $c2$ ('e') to the negated character class "[^$c_{1}c_{2+2}$]". $c_{4}$ = 'p', which means we are adding to the second occurrence of "[^pp]":
([^pp]|p(p|e(p(pep)*e)*p(pep)*p(er)?p)*([^pe]|e(p(pep)*e)*([^pp]|p(pep)*([^ppe]|p([^pe]|e([^pr]|r[^ps]))))))
6b. Simplify those negated character classes:
([^p]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^p]|p(pep)*([^pe]|p([^pe]|e([^pr]|r[^ps]))))))
7. My favourite step.

Copy the expression and add `*` in between, then remove those negated character classes in the second half:
([^p]|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*([^pe]|e(p(pep)*e)*([^p]|p(pep)*([^pe]|p([^pe]|e([^pr]|r[^ps]))))))*(|p(p|e(p(pep)*e)*p(pep)*p(|er)p)*(|e(p(pep)*e)*(|p(pep)*(|p(|e(|r))))))
And we're done; you won't see any "peppers" getting through this formal regular expression.


JT's Algorithm: Difficult Example


Let s = "abaaabaababc". Grab a drink for this one.

1. Create the upside-down word staircase:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa
abaa
aba
ab
a
2. Write out the cycles.

This is tricky, but it requires nothing more than the same trusty logic that served us well earlier:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*
abaa(abaabaa)*
aba(aabaababa)*
ab(ab|aab|aaabab)*
a(a|baaaa)*
If, for example, I'm looking for the cycles of "abaaa", I look for all lines above it that end with "abaa" (that is, "abaaa" without the trailing 'a') then see if the next character is not 'a'. This is only the case with the line "abaaabaa": it ends with "abaa" and the next character in the line above is 'b'.

Apply a similar analysis to "abaa" and you'll find "abaaabaabaa" cycles back to it. Then you'll see "abaaabaababa" cycles back to "aba"; "abab", "abaab", and "abaaabab" all cycle back to "ab"; and, finally, both "aa" and "abaaaa" cycle back to simply "a".

The last line is surprisingly lacking in cycles since they are mostly dominated by the lines above it. Again, this is not typical; most words and phrases in practice do not contain significantly many partial matches of themselves.

3. Combine cycles to eliminate repeated characters in groups.

Here, we're only interested in the "ab" cycle group:
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*
abaa(abaabaa)*
aba(aabaababa)*
ab(a(|a(|aba))b)*
a(a|baaaa)*
You will be very thankful you did this.

4. Recursively insert cycle groups into lower-order cycles.

Let's do it one line at a time. First the cycle for "abaaa":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*
aba(aa(baaa)*baababa)*
ab(a(|a(|a(baaa)*ba))b)*
a(a|baaa(baaa)*a)*
Then the cycle for "abaa":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)* abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)* ab(a(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*
a(a|baa(a(baaa)*baabaa)*a(baaa)*a)*
Now "aba":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)*ab(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*
a(a|ba(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*
And, finally, "ab":
abaaabaababc
abaaabaabab
abaaabaaba
abaaabaab
abaaabaa
abaaaba
abaaab
abaaa(baaa)*abaa(a(baaa)*baabaa)*aba(a(a(baaa)*baabaa)*a(baaa)*baababa)*ab(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*
5. Now the fun part.. building the expression!

Eleven painstaking iterations. Have you built a script to assist you with this yet? I haven't.
([^aa]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^aa]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^aa]|a(a(baaa)*baabaa)*([^aa]|a(baaa)*([^ab]|b([^aa]|a([^aa]|a([^ab]|b([^aa]|a([^ab]|b[^ac])))))))))))
Check your work by scanning over the second characters of each of those negated character classes in sequence and making sure they spell "abaaabaababc".

6. Add characters to negated character classes that lead to cycles.

Look back at step 2 and identify the one line with cycles whose corresponding character is not 'a'. This is the line "ab(ab|aab|aaabab)*".

This means we need to add 'b' to the negated character classes associated with "abaa", "abaaa", and "abaaabaa".
([^aa]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^aa]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^aab]|a(a(baaa)*baabaa)*([^aab]|a(baaa)*([^ab]|b([^aa]|a([^aab]|a([^ab]|b([^aa]|a([^ab]|b[^ac])))))))))))
6b. Simplify occurrences of "aa" in the negated character classes:
([^a]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^a]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^ab]|a(a(baaa)*baabaa)*([^ab]|a(baaa)*([^ab]|b([^a]|a([^ab]|a([^ab]|b([^a]|a([^ab]|b[^ac])))))))))))
7. The final step.

Almost there. Now we do our little copy and paste trick. I don't need to tell you to delete the character classes from the second half of it by now:
([^a]|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*([^ab]|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*([^a]|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*([^ab]|a(a(baaa)*baabaa)*([^ab]|a(baaa)*([^ab]|b([^a]|a([^ab]|a([^ab]|b([^a]|a([^ab]|b[^ac])))))))))))*(|a(a|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*a(a(a(baaa)*baabaa)*a(baaa)*baababa)*a(a(baaa)*baabaa)*a(baaa)*a)*(|b(a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*ba))b)*(|a(a(a(baaa)*baabaa)*a(baaa)*baababa)*(|a(a(baaa)*baabaa)*(|a(baaa)*(|b(|a(|a(|b(|a(|b)))))))))))
And hey presto, we're done. Wasn't that easy? Well, a damn sight easier than doing it by trial and error, at least. If you ever to negate the word "abaaabaababc" using formal regular expressions, you now have a recipe for doing so.

The End


This was my attempt at filling in a gap of information that exists where formal regular expressions are concerned. I hope, dear reader, that you've found some of it interesting.

Don't think that I've run out of creative material in the domain of regex just yet. There will be more, so I invite you to stay tuned by following me on Twitter (@DrRegex).

Thank you for reading.

Thursday, 1 November 2018

How to match "A B C" where A+B=C: The Beast Reborn

A while ago, I created a regex to verify addition of two non-negative integers, and unleashed the unholy mess onto reddit. Reactions were fierce, heads exploded, etc. and so I immediately took to demystifying it by introducing line breaks, comments, and structured formatting. This follow-up was so effective that many commenters were lulled to a state of boredom by the result, underwhelmed by its simplicity. The loudest voices expressed dissatisfaction with the robustness of the solution, complaining about the lack of negative number support and, worse yet, the stark absence of decimal support.

Slowly, I came to realize that these people were right. These features needed to be implemented; the regex simply wasn’t long and complicated enough. So I returned to the drawing board, puked on it several times, and ended up with the following:

# A slow start
^
# Immediately step on the gas
# Parse expression as before, but with some extra pizzazz
(?=[-+]?(?:0\B)*+(\d*?)((?:(?=\d+(?:\.\d+)?\ [-+]?(?:0\B)*+(\d*?)(\d(?(4)\4))(?:\.\d+)?\ [-+]?(?:0\B)*+(\d*?)(\d(?(6)\6))(?:\.\d+)?$)\d)++)\b)

(?:

 # 0th case: handle 0 + 0 = 0 with any sign orientation
 ^(?:[-+]?0+(?:\.0+)?(?:\ |$)){3}\b$
 
|
 
 # 1st case: handle "A B C" and "-A -B -C", ie. check A+B=C
 (?=-\S*\ -\S*\ -|(?!.*-))

 [-+]?(?:0\B)*+

 # Check leading excess digits
 (?=
  (?(?!\1\.?(?!(?!(?&a))))

   # No carrying, match one excess part to the other
   (?=\S+\ \S+\ [-+]?0*\1\3\6(?:\.\d+)?$)
  |
   
   # Carrying, match one excess part to 1 + the other
   (?(?!.*+\3)\S+\ [-+]?(?:0\B)*)
   (?=\d*?(\2|\4)\b(.*?\ [-+]?(?:0\B)*)\S+$)
   
   # Check for carrying with all 9s
   (?(?=9*\7\b)
    # If so, match '9's to '0's and check for a leading '1'
    (?:9(?=\d*?\8[1](\g{-1}?+0)))*?
    \7\8[1]\g{-1}?+\6\b
   |
    # Otherwise, match equivalent pairs of digits until they differ by 1
    # Then pair any extra '0's with '9's
    (?:(\d)(?=\d*\8(\g{-1}?+\g{-2})))*+
    (?=\d(\d*\8\g{-2}?+))
    (?=0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)
    \d
    (?:9(?=\d*\8\g{-2}?+\d(\g{-1}?+0)))*?
    \7\8\g{-3}?+\d\g{-1}?+\6\b
   )
  )
 )

 \1

 # Check rest of digits
 (?:
  # Identify next group of digits to examine
  (?=\.?(\d)\S*\ [-+]?(?:0\B)*+\3((\g{-2}?+\.?)\d)\S*\ [-+]?(?:0\B)*+\5((\g{-2}?+\.?)\d))
  \.?(*SKIP)
  
  # Check for carrying and match digits accordingly
  (?(?=(?!
    \14\.?
    (?<a>
     # Match pairs of digits that total 9
     (?:
      (?=\d\.?\d(?<b>\S*?\ [-+]?(?:0\B)*\3\15?+)((\g{-2}?+\.?)\d))
      (?=\d(\S*?\g{-4}\g{-2}))
      (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
      \d\.?(?=\S*?(\g{-5}\g{-4}))
     )*+
     # Match a pair of digits that totals > 9
     (?=\d(\g{-2}\.?|(?&b)\.?))
     (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]|4\g{-1}[6-9]|6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
    ){0}
    
    # Backreferences were erroneously being set that carried over to subsequent iterations,
    # necessitating this subroutine call
    # PCRE really wanted me to fail, apparently
    (?&a)
   ))
  
   # No carrying, match pairs of digits and their sums
   # My goodness it's beautiful
   (?=\d(\S*\ [-+]?(?:0\B)*\3\16)\d(\S*\ [-+]?(?:0\B)*\5\18))
   (?=
     1\g{-2}(?:1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
    |2\g{-2}(?:1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
    |3\g{-2}(?:1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
    |4\g{-2}(?:1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
    |5\g{-2}(?:1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
    |6\g{-2}(?:1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
    |7\g{-2}(?:1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6)
    |8\g{-2}(?:1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
    |9\g{-2}(?:1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8)
    |0\g{-2}(\d)\g{-2}\g{-1}
    |(\d)\g{-4}0\g{-3}\g{-1}
   )
  |
  
   # Carrying, match pairs of digits and their sums + 1
   (?=\d((?-5))\d((?-5)))
   (?= 
     1\g{-2}(?:0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
    |2\g{-2}(?:0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
    |3\g{-2}(?:0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
    |4\g{-2}(?:0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
    |5\g{-2}(?:0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
    |6\g{-2}(?:0\g{-1}7|1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6) 
    |7\g{-2}(?:0\g{-1}8|1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
    |8\g{-2}(?:0\g{-1}9|1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8) 
    |9\g{-2}(?:0\g{-1}0|1\g{-1}1|2\g{-1}2|3\g{-1}3|4\g{-1}4|5\g{-1}5|6\g{-1}6|7\g{-1}7|8\g{-1}8|9\g{-1}9) 
    |0\g{-2}(?:0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
   )

  )
  \d
 )++

 # Paths to success
 (?:
  # A, B and C contain no more significant digits
  (?:\.0+)?0*\ [-+]?0*\3\15(?:\.0+)?0*\ [-+]?0*\5\17(?:\.0+)?0*$
 |
  # A contains no more significant digits, match B's excess to C's
  \ [-+]?0*\3\15(\.?\d*?)0*\ [-+]?0*\5\17\g{-1}(?:\.0+)?0*$
 |
  # B contains no more significant digits, match A's excess to C's
  (\.?\d*?)0*\ [-+]?0*\3\15\ [-+]?0*\5\17\g{-1}(?:\.0+)?0*$
 |
  # C contains no more significant digits but A and B both do
  # Match pairs of digits that total 9 until one that totals 10 is found
  \.?
  (?:
   (?=\d\d((?&b))((\g{-2}?+\.?)\d))
   (?=\d(\S*?\g{-4}\g{-2}))
   (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
   \d
   (?=\S*?(\g{-5}(\g{-5})))
  )*+
  # Match a pair that totals 10 then validate
  (?=\d(\g{-3}|(?&b)\.?))
  (?=1\g{-1}9|2\g{-1}8|3\g{-1}7|4\g{-1}6|5\g{-1}5|6\g{-1}4|7\g{-1}3|8\g{-1}2|9\g{-1}1)
  \d0*\ [-+]?0*\3\15\g{-2}?+\.?\d0*\ [-+]?0*\5\17\.?0*$
 )

|

 # Case 2: handle "-A B C" and "A -B -C", ie. check A+C=B
 (?=-(?!.*-)|[^-]\S*\ -\S*\ -)

 [-+]?(?:0\B)*+

 # Check leading excess digits
 (?=
  (?(?!\1\.?(?!(?!(?&c))))
  
   # No carrying, match one excess part to the other
   (?=\S+\ [-+]?0*\1\5\4\b)
   
  |
   
   # Carrying, match one excess part to 1 + the other
   # There are two paths here depending on whether |A| < |C| or |C| <= |A|
   (?|
   
    # |A| < |C|
    (?!.++\5)
    \S+\ [-+]?(?:0\B)*+(?=\S*(\ [-+]?(?:0\B)*+)())
    
    # Check for carrying with all '9's
    (?(?=\S*\41[9]*\6\b)
     # If so, match '9's to '0's and check for a leading '1'
     1(?:0(?=\S*\41(\g{-1}?+9)))*?
     \4\b\S*\41\g{-1}?+\6\b
    |
     # Otherwise, match equivalent pairs of digits until they differ by 1
     # Then pair any extra '0's with '9's
     (?:(\d)(?=\S*\41(\g{-1}?+\g{-2})))*+
     (?=\d(\S*\41\g{-2}?+))(?=1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8)
     \d
     (?:0(?=\S*\41\g{-2}?+\d(\g{-1}?+9)))*?
     \4\b\S*\41\g{-3}?+\d\g{-1}?+\6\b
    )
   |
    # |C| <= |A|, basically the same comments as above
    (?=.*+\5)
    (?=\d*?(\2)\b(\S*\ [-+]?(?:0\B)*+))
    
    (?(?=9*\41\b)
     (?:9(?=\d*?\42[1](\g{-1}?+0)))*?
     \41\42[1]\g{-1}?+\4\b
    |
     (?:(\d)(?=\d*\42(\g{-1}?+\g{-2})))*+
     (?=\d(\d*\42\g{-2}?+))
     (?=0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)
     \d
     (?:9(?=\S*\42\g{-2}?+\d(\g{-1}?+0)))*?
     \41\42\g{-3}?+\d\g{-1}?+\4\b
    )
   )
  )
 )

 \1

 # Check rest of digits
 (?:
  # Identify next group of digits to examine
  (?=\.?(\d)\S*\ [-+]?(?:0\B)*+\3((\g{-2}?+\.?)\d)\S*\ [-+]?(?:0\B)*+\5((\g{-2}?+\.?)\d))
  \.?(*SKIP)
  
  # Check for carrying and match digits accordingly
  (?(?=(?!
    \48\.?
    (?<c>
     # Match pairs of digits that total 9
     (?:
      (?=\d\.?\d(?<d>\S*?\ \S+\ [-+]?(?:0\B)*\5\51?+)((\g{-2}?+\.?)\d))
      (?=\d(\S*?\g{-4}\g{-2}))
      (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
      \d\.?(?=\S*?(\g{-5}\g{-4}))
     )*+
     # Match a pair of digits that totals > 9
     (?=\d(\g{-2}\.?|(?&d)\.?))
     (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]|4\g{-1}[6-9]|6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
    ){0}
    
    # Backreferences were erroneously being set that carried over to subsequent iterations,
    # necessitating this silly subroutine call 
    (?&c)
   ))
   
   # No carrying, match pairs of digits and their sums
   (?=\d(\S*\ [-+]?(?:0\B)*\3\50)\d(\S*\ [-+]?(?:0\B)*\5\52))
   (?=
    1\g{-2}(?:2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8|0\g{-1}9)
    |2\g{-2}(?:3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7|0\g{-1}8|1\g{-1}9)
    |3\g{-2}(?:4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6|0\g{-1}7|1\g{-1}8|2\g{-1}9)
    |4\g{-2}(?:5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5|0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9)
    |5\g{-2}(?:6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4|0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9)
    |6\g{-2}(?:7\g{-1}1|8\g{-1}2|9\g{-1}3|0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9)
    |7\g{-2}(?:8\g{-1}1|9\g{-1}2|0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9)
    |8\g{-2}(?:9\g{-1}1|0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9)
    |9\g{-2}(?:0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)
    |0\g{-2}(\d)\g{-2}\g{-1}
    |(\d)\g{-4}\g{-1}\g{-3}0
   )
   
  |
   # Carrying, match pairs of digits and their sums + 1
   (?=\d((?-5))\d((?-5)))
   (?= 
    1\g{-2}(?:2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7|0\g{-1}8|1\g{-1}9)
    |2\g{-2}(?:3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6|0\g{-1}7|1\g{-1}8|2\g{-1}9)
    |3\g{-2}(?:4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5|0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9)
    |4\g{-2}(?:5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4|0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9)
    |5\g{-2}(?:6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3|0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9)
    |6\g{-2}(?:7\g{-1}0|8\g{-1}1|9\g{-1}2|0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9) 
    |7\g{-2}(?:8\g{-1}0|9\g{-1}1|0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9)
    |8\g{-2}(?:9\g{-1}0|0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9) 
    |9\g{-2}(?:0\g{-1}0|1\g{-1}1|2\g{-1}2|3\g{-1}3|4\g{-1}4|5\g{-1}5|6\g{-1}6|7\g{-1}7|8\g{-1}8|9\g{-1}9) 
    |0\g{-2}(?:1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8|0\g{-1}9)
   )   
  )
 \d
 )++

 # Paths to success
 (?:

  # A, B, and C contain no more significant digits
  (?:\.0+)?0*\ [-+]?0*\3\49(?:\.0+)?0*\ [-+]?0*\5\51(?:\.0+)?0*$

 |

  # A contains no more significant digits, match B's excess to C's
  \ [-+]?0*\3\49(\.?\d*?)0*\ [-+]?0*\5\51\g{-1}(?:\.0+)?0*$

 |
  # C contains no more significant digits, match A's excess to B's
  (\.?\d*?)0*\ [-+]?0*\3\49\g{-1}(?:\.0+)?0*\ [-+]?0*\5\51$
  
 |
  
  # B contains no more significant digits but A and B both do
  # Match pairs of digits that total 9 until one that totals 10 is found
  \.?
  (?:
   (?=\d\d((?&d))((\g{-2}?+\.?)\d))
   (?=\d(\S*?\g{-4}\g{-2}))
   (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
   \d
   (?=\S*?(\g{-5}(\g{-5})))
  )*+
  # Match a pair that totals 10 then validate
  (?=\d(\g{-3}|(?&d)\.?))
  (?=1\g{-1}9|2\g{-1}8|3\g{-1}7|4\g{-1}6|5\g{-1}5|6\g{-1}4|7\g{-1}3|8\g{-1}2|9\g{-1}1)
  \d0*\ [-+]?0*\3\49\.?0*\ [-+]?0*\5\51\g{-2}?+\.?\d0*$
 )

|

 # And finally, handle "A -B C" and "-A B -C", ie. check B+C=A
 (?=[^-]\S*\ -\S*\ [^-]\S*|-\S*\ [^-]\S*\ -)

 [-+]?(?:0\B)*+

 # Check leading excess digits
 (?=
  (?(?!\S+\ [-+]?(?:0\B)*+\3\.?(?!(?!(?&e))))
   
   # No carrying, match one excess part to the other
   (?=\3\5\2\b)
  |
   
   # Carrying, match one excess part to 1 + the other
   (?=\d*(\S*\ (?(?=.*+\3)\S+\ )[-+]?(?:0\B)*+)\d*?(\4|\6)\b)
   
   # Check for carrying with a leading '1' and all '0's
   (?(?=10*\2\b)
    # If so, match '0's to '9's
    1(?:0(?=\d*\75(\g{-1}?+9)))*?
    \2\b\75\g{-1}?+\76\b
   |
    # Otherwise, match equivalent pairs of digits until they differ by 1
    # Then pair any extra '0's with '9's
    (?:(\d)(?=\d*\75(\g{-1}?+\g{-2})))*+
    (?=\d(\d*\75\g{-2}?+))
    (?=1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8)
    \d
    (?:0(?=\d*\75\g{-2}?+\d(\g{-1}?+9)))*?
    \2\b\75\g{-3}?+\d\g{-1}?+\76\b
   )
  )
 )

 \1

 # Check rest of digits
 (?:
  # Identify next group of digits to examine
  (?=\.?(\d)\S*\ [-+]?(?:0\B)*+\3((\g{-2}?+\.?)\d)\S*\ [-+]?(?:0\B)*+\5((\g{-2}?+\.?)\d))
  \.?(*SKIP)
  
  # Check for carrying and match digits accordingly
  (?(?=(?!
    \S*\ [-+]?(?:0\B)*+\3\83\.?
    (?<e>
     # Match pairs of digits that total 9
     (?:
      (?=\d\.?\d(?<f>\S*?\ [-+]?(?:0\B)*\5\85?+)((\g{-2}?+\.?)\d))
      (?=\d(\S*?\g{-4}\g{-2}))
      (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
      \d\.?
      (?=\S*?(\g{-5}\g{-4}))
     )*+
     # Match a pair of digits that totals > 9
     (?=\d(\g{-2}\.?|(?&f)\.?))
     (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]|4\g{-1}[6-9]|6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
    ){0}
    
    # Backreferences were erroneously being set that carried over to subsequent iterations,
    # necessitating this preposterous subroutine call 
    (?&e)
   ))
   
   # No carrying, match pairs of digits and their sums  
   (?=\d(\S*\ [-+]?(?:0\B)*\3\84)\d(\S*\ [-+]?(?:0\B)*\5\86))
   (?=
     (?:2\g{-2}1|3\g{-2}2|4\g{-2}3|5\g{-2}4|6\g{-2}5|7\g{-2}6|8\g{-2}7|9\g{-2}8|0\g{-2}9)\g{-1}1
    |(?:3\g{-2}1|4\g{-2}2|5\g{-2}3|6\g{-2}4|7\g{-2}5|8\g{-2}6|9\g{-2}7|0\g{-2}8|1\g{-2}9)\g{-1}2
    |(?:4\g{-2}1|5\g{-2}2|6\g{-2}3|7\g{-2}4|8\g{-2}5|9\g{-2}6|0\g{-2}7|1\g{-2}8|2\g{-2}9)\g{-1}3
    |(?:5\g{-2}1|6\g{-2}2|7\g{-2}3|8\g{-2}4|9\g{-2}5|0\g{-2}6|1\g{-2}7|2\g{-2}8|3\g{-2}9)\g{-1}4
    |(?:6\g{-2}1|7\g{-2}2|8\g{-2}3|9\g{-2}4|0\g{-2}5|1\g{-2}6|2\g{-2}7|3\g{-2}8|4\g{-2}9)\g{-1}5
    |(?:7\g{-2}1|8\g{-2}2|9\g{-2}3|0\g{-2}4|1\g{-2}5|2\g{-2}6|3\g{-2}7|4\g{-2}8|5\g{-2}9)\g{-1}6
    |(?:8\g{-2}1|9\g{-2}2|0\g{-2}3|1\g{-2}4|2\g{-2}5|3\g{-2}6|4\g{-2}7|5\g{-2}8|6\g{-2}9)\g{-1}7
    |(?:9\g{-2}1|0\g{-2}2|1\g{-2}3|2\g{-2}4|3\g{-2}5|4\g{-2}6|5\g{-2}7|6\g{-2}8|7\g{-2}9)\g{-1}8
    |(?:0\g{-2}1|1\g{-2}2|2\g{-2}3|3\g{-2}4|4\g{-2}5|5\g{-2}6|6\g{-2}7|7\g{-2}8|8\g{-2}9)\g{-1}9
    |(\d)\g{-3}0\g{-2}\g{-1}
    |(\d)\g{-4}\g{-1}\g{-3}0
   )
  |
   # Carrying, match pairs of digits and their sums + 1
   (?=\d((?-5))\d((?-5)))
   (?= 
     (?:2\g{-2}0|3\g{-2}1|4\g{-2}2|5\g{-2}3|6\g{-2}4|7\g{-2}5|8\g{-2}6|9\g{-2}7|0\g{-2}8|1\g{-2}9)\g{-1}1
    |(?:3\g{-2}0|4\g{-2}1|5\g{-2}2|6\g{-2}3|7\g{-2}4|8\g{-2}5|9\g{-2}6|0\g{-2}7|1\g{-2}8|2\g{-2}9)\g{-1}2
    |(?:4\g{-2}0|5\g{-2}1|6\g{-2}2|7\g{-2}3|8\g{-2}4|9\g{-2}5|0\g{-2}6|1\g{-2}7|2\g{-2}8|3\g{-2}9)\g{-1}3
    |(?:5\g{-2}0|6\g{-2}1|7\g{-2}2|8\g{-2}3|9\g{-2}4|0\g{-2}5|1\g{-2}6|2\g{-2}7|3\g{-2}8|4\g{-2}9)\g{-1}4
    |(?:6\g{-2}0|7\g{-2}1|8\g{-2}2|9\g{-2}3|0\g{-2}4|1\g{-2}5|2\g{-2}6|3\g{-2}7|4\g{-2}8|5\g{-2}9)\g{-1}5
    |(?:7\g{-2}0|8\g{-2}1|9\g{-2}2|0\g{-2}3|1\g{-2}4|2\g{-2}5|3\g{-2}6|4\g{-2}7|5\g{-2}8|6\g{-2}9)\g{-1}6 
    |(?:8\g{-2}0|9\g{-2}1|0\g{-2}2|1\g{-2}3|2\g{-2}4|3\g{-2}5|4\g{-2}6|5\g{-2}7|6\g{-2}8|7\g{-2}9)\g{-1}7
    |(?:9\g{-2}0|0\g{-2}1|1\g{-2}2|2\g{-2}3|3\g{-2}4|4\g{-2}5|5\g{-2}6|6\g{-2}7|7\g{-2}8|8\g{-2}9)\g{-1}8
    |(?:0\g{-2}0|1\g{-2}1|2\g{-2}2|3\g{-2}3|4\g{-2}4|5\g{-2}5|6\g{-2}6|7\g{-2}7|8\g{-2}8|9\g{-2}9)\g{-1}9
    |(?:1\g{-2}0|2\g{-2}1|3\g{-2}2|4\g{-2}3|5\g{-2}4|6\g{-2}5|7\g{-2}6|8\g{-2}7|9\g{-2}8|0\g{-2}9)\g{-1}0
   )
  )
  \d
 )++

 # Paths to success
 (?:
  # A, B, and C contain no more significant digits
  (?:\.0+)?0*\ [-+]?0*\3\83(?:\.0+)?0*\ [-+]?0*\5\85(?:\.0+)?0*$
 |
  # C contains no more significant digits, match A's excess to B's
  (\.?\d*?)0*\ [-+]?0*\3\83\g{-1}(?:\.0+)?0*\ [-+]?0*\5\85$
 |
  # B contains no more significant digits, match A's excess to C's
  (\.?\d*?)0*\ [-+]?0*\3\83\ [-+]?0*\5\85\g{-1}(?:\.0+)?0*$
 |
  # A contains no more significant digits but B and C both do
  # Match pairs of digits that total 9 until one that totals 10 is found
  \.?0*\ [-+]?(?:0\B)*+\3\83\.?
  (?:
   (?=\d\d((?&f))((\g{-2}?+\.?)\d))
   (?=\d(\S*?\g{-4}\g{-2}))
   (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5|5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
   \d
   (?=\S*?(\g{-5}(\g{-5})))
  )*+
  # Match a pair that totals 10 then validate
  (?=\d(\g{-3}|(?&f)\.?))
  (?=1\g{-1}9|2\g{-1}8|3\g{-1}7|4\g{-1}6|5\g{-1}5|6\g{-1}4|7\g{-1}3|8\g{-1}2|9\g{-1}1)
  \d0*\ [-+]?0*\5\85\g{-2}?+\.?\d0*$
 ) 
)

Demo on regex101 (comments removed)

This was a fantastic personal challenge that really pushed me to consider how every conceivable arrangement of digits would be matched by each piece of the regex. When, say, test case #2,490,511 erroneously failed to match and the only hint given consists solely of the three numbers that caused the failure, it’s a little tricky to know where to begin debugging. This happened over and over, prompting revision after revision, until - finally - it stopped. When I could safely conclude that the regex was fully functional, that was a very satisfying “Eureka” moment indeed.

The general method is similar to the simple, positive-integer regex that I commented exhaustively last time, and so I didn't feel the need to add as many comments this time. The individual algorithms are almost the same, but the general method has of course been expanded:



Step 1: Examine the signs of A, B, and C to determine the order of matching. There are eight possible combinations of signs: two complementary arrangements for when each of A, B, and C appear as the highest value (ie. the sum of the addition), and two invalid arrangements (“-A –B C” and “A B –C”).

In the above example, an arrangement of “A –B C” calls for verifying that B + C = A, ie. 12345.67 = 656.4321 + 11689.2379.

Step 2: Compare the excess part of the total with the excess part of the longer of the two operands, similar to last time. They will either be equal or differ by exactly 1. Note that this comparison cannot spill over into the decimal portion.

In our example, we find the excess part of A, “12”, should be compared to the excess part of C, “11”. After pairing ‘3’ and ‘6’, which total 9, we find ‘4’ and ‘8’, which yield a sum greater than 9. And so we can rightly match “12” against “11” since they differ by 1.

Step 3: Compare the equal-length portions of A, B, and C. Like before, this is done by matching pairs of digits in the two operands to their sums in the total. However, unlike before, this matching may continue into the decimal portions of the 3 numbers.

In the example, ‘3’ matches with ‘6’ and ‘6’ because of carrying (6+6=12, then + 1 = 3), and then ‘4’ matches with ‘5’ and ‘8’, and so on until A runs out of digits.

Step 4: At this stage, if the addition is correct then the only digits that could possibly remain are decimals. In fact, these residual decimal digits would exist in exactly two of A, B, or C, and there would be the same number of digits left in both. If one of these two is the total, then our final check is basic: make sure the two residual parts are equivalent (for example, 1.2 + 1.2345 = 2.4345). Otherwise, we need to match pairs of digits that total 9 until the last pair, which should total 10.

So in our example, we are left with “21” in B and “79” in C. ‘2’ and ‘7’ total 9, and then ‘9’ and ‘1’ give us 10 as needed.


Thursday, 6 September 2018

How to Match "A B C" where A+B=C: The Beast Tamed

A regex I submitted to Reddit recently climbed to the top of /r/programming and made quite a few heads explode in the process. As delightful as this was, I couldn't help but feel a little guilty for subjecting tens of thousands of people to this disgraceful pile of electronic fecal matter. Absolutely zero effort was put into making it something that even remotely resembled a useful, constructive demonstration. Instead, I lured you all into my own private lemon party and left you shocked, bewildered, and fearing for your lives. And for this I apologize.

At the risk of taking myself too seriously, I went ahead and added comments to the regex as well as tidied it up, re-wrote a couple of parts for greater clarity, and corrected a few glaring oversights. Some of you have levels of curiosity that far outweigh your better judgement, and so I invite you to read on.

First, it may be worth outlining the general method of handling addition when you're restricted to matching from left to right:

  1. First, compare the excess part of A or B with the excess part of C. They will either be equal, or C's will be greater by 1.
  2. Next, iterate through digits in A and match corresponding digits in B with their sums in C. Again, there may be differences of 1 depending upon the rest of the digits in A and B.
These potential differences of 1 ("carrying") are determined by moving through pairs of digits that sum to 9 until a pair is found whose sum exceeds 9.

# I wrapped the entire expression in (?!(?! )) just to do away with all
# captured substrings and cleanly match a verified line.
(?!(?!

^0*+

# Here we essentially right-align A, B, and C, ignoring leading zeros, 
# and populate backreferences with components that will be useful later.
# 
#   \1    \2             \3    \4               \5    \6
(?=(\d*?)((?:(?=\d+\ 0*+(\d*?)(\d(?(4)\4))\ 0*+(\d*?)(\d(?(6)\6))$)\d)++)\ )
#
# Taking "12345 678 13023" as an example:
#
# \1 = "12", ie. the extra digits in A if A is longer than B. Empty otherwise.
# \2 = "345", ie. the rest of the digits in A that match up with those in B and C.
# \3 = "", ie. the extra digits in B if B is longer than A. Empty otherwise.
# \4 = "678", ie. the rest of the digits in B that match up with those in A.
# \5 = "13", ie. the extra digits in C that match up with the longer of A and B.
# \6 = "023", ie. the rest of the digits in C.

# This next part checks the extra digit portions to make sure everything is in order.
#
# There are two main paths to take:
# Easy:  Adding \2 to \4 results in no "carrying"; the length stays the same.
#        \5 should then exactly equal either \1 or \3, whichever was longer.
#        An example of this is when matching "5123 456 5579", since 123+456=579.
#   Then \5 = \1 = "5".
# OR
# Hard:  Adding \2 to \4 results in "carrying"; the length increases by 1. In this case,
#        \5 should equal 1 more than either \1 or \3 (which is non-empty).
#        This is the case we need to handle for our example of "12345 678 13023".
#        Here, \5 = "13" and \1 = "12", and so we need to verify \5 = \1 + 1. 

(?=(?(?!
 
 # First thing to check is whether \2 + \4 results in carrying.
 # To do this, we must inspect \2 and \4 from the left and match
 # optional pairs of digits that sum to 9 until we find a pair that
 # sum to > 9. 
 #
 # In our example, "345" and "678", we find that '3' and '6' sum to 9,
 # then '4' and '7' sum to > 9. Therefore we have carrying.
 
  # Consume the extra digits in A; they're not important here.
  \1
 
  # Move through all pairs of digits that sum to 9.
  (?: 
   
   # Collect the next digit of interest in B.
   (?=\d+\ 0*+\3((\g{-2}?+)\d))
   
   # This lookahead is used to set up a backreference that goes from one digit
   # of interest to the next, in the interests of simplifying the long check ahead.
   (?=\d(\d*\ 0*+\3\g{-2}))
   
   # Now to use that backreference to match pairs of digits that sum to 9.   
   (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5
      |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
   
   # Consume a digit so we can move forward.
   \d
  )*+
   
  # Now that we've gone through all pairs that sum to 9, let's try to match one
  # that sums to > 9.
  
  # First set up our backreference of convenience.
  (?=\d(\d*\ 0*+\3\g{-3}?+)) 
 
  # Now find a pair that sums to > 9. 
  (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]
     |4\g{-1}[6-9]|6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
 )
 
 # The above was a negative lookahead, so if it matched successfully then there is no
 # carrying and it's smooth sailing ahead.
 
 # Since either \1 or \3 (or both) is empty, the concatenation of the two will produce
 # what we need to match at the front of C. Then, \6 is the rest of C. 
 (?=\d+\ \d+\ 0*+\1\3\6$)
  
 |
 
 # Carrying. This is where it gets complicated.

 # First let's move forward to the extra digits of interest.
 # ".*+" matches up to the end of the line with no backtracking. The only way
 # \3 can be found at that position is if \3 = "".
 # So if the negative lookahead succeeds, \3 isn't empty and B contains the
 # extra digits of interest, so we consume A and a space in that case. 
 (?(?!.*+\3)\d+\ )
 
 # More declarations for convenience.
 #       \11    \12
 (?=\d*?(\2|\4)(\ .*?0*+)\d+$)
 # \11 = the rest of the digits in A or B, \2 or \4, depending on where we're at.
 #  This anchor is important so we know where to stop matching the extra digits.
 # \12 = The part between the end of A/B and the beginning of C.
 
 # Another decision tree. Are the extra digits of interest composed solely of '9's,
 # such as in the example "999123 878 1000001"?
 # If so, the strategy is somewhat simplified.
 # This also handles zero '9's, when A and B are of equal length. 
 (?(?=9*\11\ )
 
  # If the extra digits of interest are composed solely of '9's, all we need
  # to do is pair '9's in A/B with '0's in C, and match a '1' at the start of C.

  # So, start pairing '9's and '0's.
  (?:9(?=\d*?\12[1](\g{-1}?+0)))*? 
  
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )
  
  # Now verify C actually starts with a '1', then match the '0's we've collected,
  # and also make sure all that follows is \6 (the rest of C).
  \11\12[1]\g{-1}?+\6$
  
  |
  
  # Now the trickier path. We need to add 1 to extra digits in A/B and match it to C.
  # Because we know these extra digits are not composed solely of '9's, we know the
  # extra digits in C will be the same length.
  #
  # How do you check if a number is 1 more than another given they're equal length?
  # First, iterate through the digits and match pairs of equivalent digits.
  # When you reach a position where they differ, it must be the case that C's
  # digit is 1 greater than A/B's. After this point, you need to pair '9's in A/B
  # with '0's in C until you exhaust the extra digits of interest. 
  #
  # To see why this last part is necessary, consider the example "4129990 10 4130000".
  # When we compare "41299" to "41300", we first match '4' in A to '4' in C, then '1'
  # in A to '1' in C. Then we find the point where the next digit in C is 1 greater 
  # than the next one in A, and pair '2' in A with '3' in C. There can only be exactly
  # one such point like this. Afterwards, the only thing that could possibly follow is
  # a series of '9's in A and '0's in C until we exhaust the extra digits of interest. 
  
  # The first part, consume all equivalent digits.
  (?:(\d)(?=\d*\12(\g{-1}?+\g{-2})))*+
  
  # Now we prepare for the next check by setting up a backreference that contains
  # everything in between the two digits of interest, for simplicity.
  (?=\d(\d*\12\g{-2}?+))

  # Match pairs that differ by 1 in favour of C.
  (?=0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)\d
  
  # Now consume any and all additional '9's, pairing them with '0's in C.
  (?:9(?=\d*\12\g{-2}?+\d(\g{-1}?+0)))*?
  
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )

  # Now verify C by checking it contains all extra digits shared with A/B, followed by
  # the lone digit that was found to be 1 greater than the corresponding one in A/B,
  # then any '0's that followed, and finally \6, the rest of its digits.
  \11\12\g{-3}?+\d\g{-1}?+\6$
  
 )
))

# At this point, we've managed to successfully verify the extra digits in A, B, and C.
# We have examined the "12" and "13" in our example of "12345 678 13023" and found them 
# to be sound. We would have rejected examples such as "11 1 22" and "92 8 110" by now,
# since their extra digits don't compute.
#
# The rest of the logic examines the equi-length portions of A and B (saved as \2 and \4
# respectively). This is actually simpler since we don't have to fuss around with things
# being different lengths; we took care of all that earlier.
#
# At this point, we can simply match pairs of digits in A and B to their sum in C.
# There is, however, still some considerations to be made as to carrying. We're iterating
# through digits from left to right, after all, and the sum of every pair of digits we
# sum in A and B may be found in C as either A+B(mod 10) or 1+A+B(mod 10) depending on
# whether carrying occurs to the right.

# Consume any extra digits in A; they're no longer important.
\1

# Iterate through A, B, and C one digit at a time, from the left.
(?:
 
 # Here we set up backreferences to useful portions of the subject, ignoring any
 # leading '0's, and also ignoring those extra digits from before, \3 and \5.
 #
 #   18           19  20                   21  22
 (?=(\d)\d*\ 0*+\3((\g{-2}?+)\d)\d*\ 0*+\5((\g{-2}?+)\d)) 
 #
 # These values update as we iterate through A, but on the first run,
 # using our example of "12345 678 13023":
 #
 # \18 = "3", ie. the next digit to inspect in A.
 # \19 = "6", ie. what we've examined in B including the next digit.
 # \20 = "", ie. what we've examined in B excluding the next digit.
 # \21 = "0", ie. what we've examined in C including the next digit.
 # \22 = "", ie. what we've examined in C excluding the next digit.
 
 # Like before, we must proceed in one of two directions based on whether or not we
 # encounter carrying. 
 #
 # Similar to the first part, in order to determine this, we need to look at the parts
 # of A and B that follow our current digits of interest. And, as before, we sift through
 # any pairs of digits that total 9 until we find a pair whose sum is > 9.  
 (?(?!
   # Consume the current digit of interest in A.
   \18
   
   # Then start matching pairs of digits in A and B whose sum is 9.
   (?:
    
    # Use nested references to remember how far into B we are.
    (?=\d+\ 0*+\3\19((\g{-2}?+)\d))
    
    # Set up a backreference for our simple pair matching.
    (?=\d(\d*\ 0*+\3\19\g{-2}))
    
    # Match pairs of digits that sum to 9.
    (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5
       |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
    
    # Consume a digit to move forward.
    \d
   )*+
  
   # All that's left is to check if the next pair of digits in A and B has
   # a sum exceeding 9. Set up our convenient back reference and check.
   (?=\d(\d*\ 0*+\3\19\g{-3}?+)) 
  
   # Now test for a combination of digits whose sum is > 9.
   (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]|4\g{-1}[6-9]
      |6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
  )
  
 # The above negative lookahead succeeded, so fortunately we don't have to contend
 # with carrying in the first branch. We need to match pairs of digits in A and B
 # and their sum in C. I don't think there's a more clever way to do this in PCRE
 # than tabulating all the combinations.
  
  # First set up convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))
 
  # Now the ugly part.  
  (?=
   1\g{-2}(?:1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
  |2\g{-2}(?:1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
  |3\g{-2}(?:1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
  |4\g{-2}(?:1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
  |5\g{-2}(?:1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
  |6\g{-2}(?:1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
  |7\g{-2}(?:1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6)
  |8\g{-2}(?:1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
  |9\g{-2}(?:1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8)
  |0\g{-2}(\d)\g{-2}\g{-1} # At least we can handle zeros
  |(\d)\g{-4}0\g{-3}\g{-1} # with a bit of intelligence.
  )
  
 |
  
 # And in the else branch, we have to deal with carrying. So we'll match pairs of
 # digits like up there, but this time we'll match pairs of digits in A and B with
 # their sum + 1 (mod 10) in C.

  # Convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))  
  
  # Almost done, let's just get through this last bit of ugliness.
  (?= 
  1\g{-2}(?:0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
 |2\g{-2}(?:0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
 |3\g{-2}(?:0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
 |4\g{-2}(?:0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
 |5\g{-2}(?:0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
 |6\g{-2}(?:0\g{-1}7|1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6) 
 |7\g{-2}(?:0\g{-1}8|1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
 |8\g{-2}(?:0\g{-1}9|1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8) 
 |9\g{-2}(?:0\g{-1}0|1\g{-1}1|2\g{-1}2|3\g{-1}3|4\g{-1}4|5\g{-1}5|6\g{-1}6|7\g{-1}7|8\g{-1}8|9\g{-1}9) 
 |0\g{-2}(?:0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
  ) 
 )
 
 # Whew. It's over. Consume a digit so we can move forward and restart the fun.
 \d

)+

# At the end of all this, if we can match a space then we have succeeded in matching all
# of A, and hence all of B and C. 
\s

|

# I tripped up on these edge cases involving zeros the first time I made this.
^0+\ 0*(\d+)\ 0*\g{-1}$

|

# I'm not going to make that mistake again!
^0*(\d+)\ 0+\ 0*\g{-1}$

)).+

Demo on regex101 (comments removed)

See, it's not so scary after all!

In case anyone is still wondering in earnest why I did this, the simple answer is: it was fun! Once I saw this challenge on StackOverflow's codegolf section and realized it may be possible, I just couldn't stop thinking about it until I had a working solution.

If anyone out there shares my zany passion for things of this nature, I invite you to subscribe to this blog and/or follow me on Twitter. I do have plenty more madness lined up to share with the world.

Also, please know that I do actually spend time creating and helping others create regular expressions that are useful and serve a practical purpose. You are welcome to fire up your favourite IRC client and pop by Freenode's #regex if you ever need any advice or just want to shoot the shit. We've got a great team there who are always happy to help you conquer your regex woes.

Thanks for reading.

Tuesday, 7 November 2017

Match Nested Brackets with Regex: A new approach

My first blog post was a bit of a snoozefest, so I feel I ought to make this one a little shorter and more to the point. I'm going to show you how to do something with regular expressions that's long been thought impossible. Get excited.

The problem


You want to match a full outer group of arbitrarily nested parentheses with regex but you're using a flavour such as Java's java.util.regex that supports neither recursion nor balancing groups.

The solution


(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Proof: Java Regex or PCRE on regex101 (look at the full matches on the right)

Et voila; there you go. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.

No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.

That's great and all, but I want to match inner groups too!


OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:

(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$))) 

Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.

So... how the hell does this actually work?


I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:

Component
Description




(?=\()
Make sure '(' follows before doing any hard work.

(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So "(?!.*?\1)" ensures we don't match the same '(' again.
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid used to overcome this bug.
)

Handle ')'
(?=
This lookahead deals with finding the next ')'.
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above-mentioned bug is not applicable here, so this simple expression is sufficient.
)


.
Consume a single character so that the group can continue matching. It is safe to consume a character here because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.

)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check.
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).


Conclusion


So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regex features - no recursion or balancing groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.

If you share my excitement for things of this nature then I encourage you to follow this blog, as I have a few more pearls of regex wisdom to offer in good time. Also in the cards is a regex quiz adventure that will test your skills in a variety of interesting and challenging ways. So please do stay tuned.

I'd like to thank Me-me on Freenode for inspiring this discovery with a clever attempt at one of my #regex challenges. Thank you for being man enough to attempt them!

Bonus: Optimization!


Just for fun, I managed to improve the performance of this technique from "disgustingly, horrendously awful" to just "awful". Using the example in the proofs above:

Original (16,257 steps):

(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)

Optimized (4,205 steps):

(?=\()(?:(?=(?(1).*?(?=\1)).*?\((.*))(?=(?(2).*?(?=\2)).*?\)(.*)).)+?(?>.*?(?=\1))[^(]*?(?=\2$)

And just so you can get an idea of how poor this method truly is, here's how the conventional recursive method measures up:

Recursive (445 steps):

\((?:[^()]+|(?R))*+\)

A whole order of magnitude more efficient.