Dr. Regex: How to Match "A B C" where A+B=C: The Beast Tamed

A regex I submitted to Reddit recently climbed to the top of /r/programming and made quite a few heads explode in the process. As delightful as this was, I couldn't help but feel a little guilty for subjecting tens of thousands of people to this disgraceful pile of electronic fecal matter. Absolutely zero effort was put into making it something that even remotely resembled a useful, constructive demonstration. Instead, I lured you all into my own private lemon party and left you shocked, bewildered, and fearing for your lives. And for this I apologize.

At the risk of taking myself too seriously, I went ahead and added comments to the regex as well as tidied it up, re-wrote a couple of parts for greater clarity, and corrected a few glaring oversights. Some of you have levels of curiosity that far outweigh your better judgement, and so I invite you to read on.

First, it may be worth outlining the general method of handling addition when you're restricted to matching from left to right:

First, compare the excess part of A or B with the excess part of C. They will either be equal, or C's will be greater by 1.
Next, iterate through digits in A and match corresponding digits in B with their sums in C. Again, there may be differences of 1 depending upon the rest of the digits in A and B.

These potential differences of 1 ("carrying") are determined by moving through pairs of digits that sum to 9 until a pair is found whose sum exceeds 9.

# I wrapped the entire expression in (?!(?! )) just to do away with all
# captured substrings and cleanly match a verified line.
(?!(?!

^0*+

# Here we essentially right-align A, B, and C, ignoring leading zeros, 
# and populate backreferences with components that will be useful later.
# 
#   \1    \2             \3    \4               \5    \6
(?=(\d*?)((?:(?=\d+\ 0*+(\d*?)(\d(?(4)\4))\ 0*+(\d*?)(\d(?(6)\6))$)\d)++)\ )
#
# Taking "12345 678 13023" as an example:
#
# \1 = "12", ie. the extra digits in A if A is longer than B. Empty otherwise.
# \2 = "345", ie. the rest of the digits in A that match up with those in B and C.
# \3 = "", ie. the extra digits in B if B is longer than A. Empty otherwise.
# \4 = "678", ie. the rest of the digits in B that match up with those in A.
# \5 = "13", ie. the extra digits in C that match up with the longer of A and B.
# \6 = "023", ie. the rest of the digits in C.

# This next part checks the extra digit portions to make sure everything is in order.
#
# There are two main paths to take:
# Easy:  Adding \2 to \4 results in no "carrying"; the length stays the same.
#        \5 should then exactly equal either \1 or \3, whichever was longer.
#        An example of this is when matching "5123 456 5579", since 123+456=579.
#   Then \5 = \1 = "5".
# OR
# Hard:  Adding \2 to \4 results in "carrying"; the length increases by 1. In this case,
#        \5 should equal 1 more than either \1 or \3 (which is non-empty).
#        This is the case we need to handle for our example of "12345 678 13023".
#        Here, \5 = "13" and \1 = "12", and so we need to verify \5 = \1 + 1. 

(?=(?(?!
 
 # First thing to check is whether \2 + \4 results in carrying.
 # To do this, we must inspect \2 and \4 from the left and match
 # optional pairs of digits that sum to 9 until we find a pair that
 # sum to > 9. 
 #
 # In our example, "345" and "678", we find that '3' and '6' sum to 9,
 # then '4' and '7' sum to > 9. Therefore we have carrying.
 
  # Consume the extra digits in A; they're not important here.
  \1
 
  # Move through all pairs of digits that sum to 9.
  (?: 
   
   # Collect the next digit of interest in B.
   (?=\d+\ 0*+\3((\g{-2}?+)\d))
   
   # This lookahead is used to set up a backreference that goes from one digit
   # of interest to the next, in the interests of simplifying the long check ahead.
   (?=\d(\d*\ 0*+\3\g{-2}))
   
   # Now to use that backreference to match pairs of digits that sum to 9.   
   (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5
      |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
   
   # Consume a digit so we can move forward.
   \d
  )*+
   
  # Now that we've gone through all pairs that sum to 9, let's try to match one
  # that sums to > 9.
  
  # First set up our backreference of convenience.
  (?=\d(\d*\ 0*+\3\g{-3}?+)) 
 
  # Now find a pair that sums to > 9. 
  (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]
     |4\g{-1}[6-9]|6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
 )
 
 # The above was a negative lookahead, so if it matched successfully then there is no
 # carrying and it's smooth sailing ahead.
 
 # Since either \1 or \3 (or both) is empty, the concatenation of the two will produce
 # what we need to match at the front of C. Then, \6 is the rest of C. 
 (?=\d+\ \d+\ 0*+\1\3\6$)
  
 |
 
 # Carrying. This is where it gets complicated.

 # First let's move forward to the extra digits of interest.
 # ".*+" matches up to the end of the line with no backtracking. The only way
 # \3 can be found at that position is if \3 = "".
 # So if the negative lookahead succeeds, \3 isn't empty and B contains the
 # extra digits of interest, so we consume A and a space in that case. 
 (?(?!.*+\3)\d+\ )
 
 # More declarations for convenience.
 #       \11    \12
 (?=\d*?(\2|\4)(\ .*?0*+)\d+$)
 # \11 = the rest of the digits in A or B, \2 or \4, depending on where we're at.
 #  This anchor is important so we know where to stop matching the extra digits.
 # \12 = The part between the end of A/B and the beginning of C.
 
 # Another decision tree. Are the extra digits of interest composed solely of '9's,
 # such as in the example "999123 878 1000001"?
 # If so, the strategy is somewhat simplified.
 # This also handles zero '9's, when A and B are of equal length. 
 (?(?=9*\11\ )
 
  # If the extra digits of interest are composed solely of '9's, all we need
  # to do is pair '9's in A/B with '0's in C, and match a '1' at the start of C.

  # So, start pairing '9's and '0's.
  (?:9(?=\d*?\12[1](\g{-1}?+0)))*? 
  
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )
  
  # Now verify C actually starts with a '1', then match the '0's we've collected,
  # and also make sure all that follows is \6 (the rest of C).
  \11\12[1]\g{-1}?+\6$
  
  |
  
  # Now the trickier path. We need to add 1 to extra digits in A/B and match it to C.
  # Because we know these extra digits are not composed solely of '9's, we know the
  # extra digits in C will be the same length.
  #
  # How do you check if a number is 1 more than another given they're equal length?
  # First, iterate through the digits and match pairs of equivalent digits.
  # When you reach a position where they differ, it must be the case that C's
  # digit is 1 greater than A/B's. After this point, you need to pair '9's in A/B
  # with '0's in C until you exhaust the extra digits of interest. 
  #
  # To see why this last part is necessary, consider the example "4129990 10 4130000".
  # When we compare "41299" to "41300", we first match '4' in A to '4' in C, then '1'
  # in A to '1' in C. Then we find the point where the next digit in C is 1 greater 
  # than the next one in A, and pair '2' in A with '3' in C. There can only be exactly
  # one such point like this. Afterwards, the only thing that could possibly follow is
  # a series of '9's in A and '0's in C until we exhaust the extra digits of interest. 
  
  # The first part, consume all equivalent digits.
  (?:(\d)(?=\d*\12(\g{-1}?+\g{-2})))*+
  
  # Now we prepare for the next check by setting up a backreference that contains
  # everything in between the two digits of interest, for simplicity.
  (?=\d(\d*\12\g{-2}?+))

  # Match pairs that differ by 1 in favour of C.
  (?=0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9)\d
  
  # Now consume any and all additional '9's, pairing them with '0's in C.
  (?:9(?=\d*\12\g{-2}?+\d(\g{-1}?+0)))*?
  
  # Stop when we exhaust the extra digits of interest.
  (?=\11\ )

  # Now verify C by checking it contains all extra digits shared with A/B, followed by
  # the lone digit that was found to be 1 greater than the corresponding one in A/B,
  # then any '0's that followed, and finally \6, the rest of its digits.
  \11\12\g{-3}?+\d\g{-1}?+\6$
  
 )
))

# At this point, we've managed to successfully verify the extra digits in A, B, and C.
# We have examined the "12" and "13" in our example of "12345 678 13023" and found them 
# to be sound. We would have rejected examples such as "11 1 22" and "92 8 110" by now,
# since their extra digits don't compute.
#
# The rest of the logic examines the equi-length portions of A and B (saved as \2 and \4
# respectively). This is actually simpler since we don't have to fuss around with things
# being different lengths; we took care of all that earlier.
#
# At this point, we can simply match pairs of digits in A and B to their sum in C.
# There is, however, still some considerations to be made as to carrying. We're iterating
# through digits from left to right, after all, and the sum of every pair of digits we
# sum in A and B may be found in C as either A+B(mod 10) or 1+A+B(mod 10) depending on
# whether carrying occurs to the right.

# Consume any extra digits in A; they're no longer important.
\1

# Iterate through A, B, and C one digit at a time, from the left.
(?:
 
 # Here we set up backreferences to useful portions of the subject, ignoring any
 # leading '0's, and also ignoring those extra digits from before, \3 and \5.
 #
 #   18           19  20                   21  22
 (?=(\d)\d*\ 0*+\3((\g{-2}?+)\d)\d*\ 0*+\5((\g{-2}?+)\d)) 
 #
 # These values update as we iterate through A, but on the first run,
 # using our example of "12345 678 13023":
 #
 # \18 = "3", ie. the next digit to inspect in A.
 # \19 = "6", ie. what we've examined in B including the next digit.
 # \20 = "", ie. what we've examined in B excluding the next digit.
 # \21 = "0", ie. what we've examined in C including the next digit.
 # \22 = "", ie. what we've examined in C excluding the next digit.
 
 # Like before, we must proceed in one of two directions based on whether or not we
 # encounter carrying. 
 #
 # Similar to the first part, in order to determine this, we need to look at the parts
 # of A and B that follow our current digits of interest. And, as before, we sift through
 # any pairs of digits that total 9 until we find a pair whose sum is > 9.  
 (?(?!
   # Consume the current digit of interest in A.
   \18
   
   # Then start matching pairs of digits in A and B whose sum is 9.
   (?:
    
    # Use nested references to remember how far into B we are.
    (?=\d+\ 0*+\3\19((\g{-2}?+)\d))
    
    # Set up a backreference for our simple pair matching.
    (?=\d(\d*\ 0*+\3\19\g{-2}))
    
    # Match pairs of digits that sum to 9.
    (?=0\g{-1}9|1\g{-1}8|2\g{-1}7|3\g{-1}6|4\g{-1}5
       |5\g{-1}4|6\g{-1}3|7\g{-1}2|8\g{-1}1|9\g{-1}0)
    
    # Consume a digit to move forward.
    \d
   )*+
  
   # All that's left is to check if the next pair of digits in A and B has
   # a sum exceeding 9. Set up our convenient back reference and check.
   (?=\d(\d*\ 0*+\3\19\g{-3}?+)) 
  
   # Now test for a combination of digits whose sum is > 9.
   (?=[5-9]\g{-1}[5-9]|1\g{-1}9|2\g{-1}[89]|3\g{-1}[7-9]|4\g{-1}[6-9]
      |6\g{-1}4|7\g{-1}[34]|8\g{-1}[2-4]|9\g{-1}[1-4])
  )
  
 # The above negative lookahead succeeded, so fortunately we don't have to contend
 # with carrying in the first branch. We need to match pairs of digits in A and B
 # and their sum in C. I don't think there's a more clever way to do this in PCRE
 # than tabulating all the combinations.
  
  # First set up convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))
 
  # Now the ugly part.  
  (?=
   1\g{-2}(?:1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
  |2\g{-2}(?:1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
  |3\g{-2}(?:1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
  |4\g{-2}(?:1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
  |5\g{-2}(?:1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
  |6\g{-2}(?:1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
  |7\g{-2}(?:1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6)
  |8\g{-2}(?:1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
  |9\g{-2}(?:1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8)
  |0\g{-2}(\d)\g{-2}\g{-1} # At least we can handle zeros
  |(\d)\g{-4}0\g{-3}\g{-1} # with a bit of intelligence.
  )
  
 |
  
 # And in the else branch, we have to deal with carrying. So we'll match pairs of
 # digits like up there, but this time we'll match pairs of digits in A and B with
 # their sum + 1 (mod 10) in C.

  # Convenient backreferences.
  (?=\d(\d*\ 0*+\3\20)\d(\d*\ 0*+\5\22))  
  
  # Almost done, let's just get through this last bit of ugliness.
  (?= 
  1\g{-2}(?:0\g{-1}2|1\g{-1}3|2\g{-1}4|3\g{-1}5|4\g{-1}6|5\g{-1}7|6\g{-1}8|7\g{-1}9|8\g{-1}0|9\g{-1}1)
 |2\g{-2}(?:0\g{-1}3|1\g{-1}4|2\g{-1}5|3\g{-1}6|4\g{-1}7|5\g{-1}8|6\g{-1}9|7\g{-1}0|8\g{-1}1|9\g{-1}2)
 |3\g{-2}(?:0\g{-1}4|1\g{-1}5|2\g{-1}6|3\g{-1}7|4\g{-1}8|5\g{-1}9|6\g{-1}0|7\g{-1}1|8\g{-1}2|9\g{-1}3)
 |4\g{-2}(?:0\g{-1}5|1\g{-1}6|2\g{-1}7|3\g{-1}8|4\g{-1}9|5\g{-1}0|6\g{-1}1|7\g{-1}2|8\g{-1}3|9\g{-1}4)
 |5\g{-2}(?:0\g{-1}6|1\g{-1}7|2\g{-1}8|3\g{-1}9|4\g{-1}0|5\g{-1}1|6\g{-1}2|7\g{-1}3|8\g{-1}4|9\g{-1}5)
 |6\g{-2}(?:0\g{-1}7|1\g{-1}8|2\g{-1}9|3\g{-1}0|4\g{-1}1|5\g{-1}2|6\g{-1}3|7\g{-1}4|8\g{-1}5|9\g{-1}6) 
 |7\g{-2}(?:0\g{-1}8|1\g{-1}9|2\g{-1}0|3\g{-1}1|4\g{-1}2|5\g{-1}3|6\g{-1}4|7\g{-1}5|8\g{-1}6|9\g{-1}7)
 |8\g{-2}(?:0\g{-1}9|1\g{-1}0|2\g{-1}1|3\g{-1}2|4\g{-1}3|5\g{-1}4|6\g{-1}5|7\g{-1}6|8\g{-1}7|9\g{-1}8) 
 |9\g{-2}(?:0\g{-1}0|1\g{-1}1|2\g{-1}2|3\g{-1}3|4\g{-1}4|5\g{-1}5|6\g{-1}6|7\g{-1}7|8\g{-1}8|9\g{-1}9) 
 |0\g{-2}(?:0\g{-1}1|1\g{-1}2|2\g{-1}3|3\g{-1}4|4\g{-1}5|5\g{-1}6|6\g{-1}7|7\g{-1}8|8\g{-1}9|9\g{-1}0)
  ) 
 )
 
 # Whew. It's over. Consume a digit so we can move forward and restart the fun.
 \d

)+

# At the end of all this, if we can match a space then we have succeeded in matching all
# of A, and hence all of B and C. 
\s

|

# I tripped up on these edge cases involving zeros the first time I made this.
^0+\ 0*(\d+)\ 0*\g{-1}$

|

# I'm not going to make that mistake again!
^0*(\d+)\ 0+\ 0*\g{-1}$

)).+

Demo on regex101 (comments removed)

See, it's not so scary after all!

In case anyone is still wondering in earnest why I did this, the simple answer is: it was fun! Once I saw this challenge on StackOverflow's codegolf section and realized it may be possible, I just couldn't stop thinking about it until I had a working solution.

If anyone out there shares my zany passion for things of this nature, I invite you to subscribe to this blog and/or follow me on Twitter. I do have plenty more madness lined up to share with the world.

Also, please know that I do actually spend time creating and helping others create regular expressions that are useful and serve a practical purpose. You are welcome to fire up your favourite IRC client and pop by Freenode's #regex if you ever need any advice or just want to shoot the shit. We've got a great team there who are always happy to help you conquer your regex woes.

Thanks for reading.

Dr. Regex

Thursday, 6 September 2018

How to Match "A B C" where A+B=C: The Beast Tamed

3 comments: