Matching nested structures with Regexps in Ruby 1.9

NOTE: this was originally published on the older version of this blog, some content may be broken/outdated.

Following a discussion on hacker news I have found myself wondering about regular expressions in ruby 1.9.

In this major version ruby switched its regex engine to oniguruma (and, since a few days ago to a fork of it called Onigmo ).

This engine is widely more powerful than the original ruby one, supporting positive and negative lookbehind, named captures, better support for backreferences and most importantly for our topic, callable backreferences.

The latter means that, among other things, ruby has gained the ability to recognize arbitrarily nested structures.

Since I hadn’t played with this before, I spent some time to craft a Regexp that would match a nested tree like structure, and what better example is there than good ol’ lisp?

This is what I got

# this pattern has many issues, consider it an example
R = %r{
      (?<atom> [-\w]+){0}
      (?<quote> '\g<expr>){0}
      (?<expr>  
        \g<atom> | 
        \g<quote> | 
        \(  
          \g<expr>?
          (\s+  \g<expr>)* 
        \)
      ){0}
      ^\g<expr>$
    }imx

which has a rather nice feel to it, reading bottom up

a regex is an expression
an expression is
- an atom or
- a quotation or
- an expression (possibly empty) followed by other expressions
a quotation is an expression with a single quote in the front
an atom is a sequence of word characters or a dash

syntax wise, there are only a couple of things that are unusual: the “grammar rules” are defined like

 (?<foo> some pattern){0}

which is a clever trick (which I first saw in the oniguruma documentation): create a named capturing group, but say that it should match zero times. This way, the definition of the group stays around, but it is not required to actually match anywhere.

The other thing is how we refer to this rules:

\g<foo>

this simply means: apply the named pattern in this spot, via the awesome “callable backreferences” of oniguruma (also available in PCRE now). Notice that we also get forward declarations, which is rather cool.

Some quick tests show that this works (for an arguable value of “works”) in some simple cases:

t('()')
t('(ciao)')
t('(ciao miao)')
t('(ciao miao)')
t('(ciao (miao))')
t('(ciao (miao bau))')
t("(ciao '(miao bau))")

t('(sum-iter k 1)')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
SEXPR
t('ok')

t("nope '(miao bau))")
t('(nope')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1))
SEXPR

t(<<-SEXPR)
(define @  (lambda (k)
    (sum-iter k 0 1))
SEXPR

which correctly returns:

(): ok
(ciao): ok
(ciao miao): ok
(ciao miao): ok
(ciao (miao)): ok
(ciao (miao bau)): ok
(ciao '(miao bau)): ok
(sum-iter k 1): ok
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
: ok
ok: ok
nope '(miao bau)): no
(nope: no
(define sum1  (lambda (k)
    (sum-iter k 0 1))
: no
(define @  (lambda (k)
    (sum-iter k 0 1))
: no

So the pattern correctly matches recursive structures with balanced parenthesis.

Sadly, this approach appears to have one serious downside: it’s impossible to get at the matched elements.

From what I understand, when the regexp engine compiles the pattern, it reserves a certain number of slots for the capturing groups which is independent of the actual string being examinated.

This means that each named group only stores the value of the last match, e.g.

# no reference to "first"
>> R.match('(first last)') 
=>#<MatchData "(first last)" atom:"last" quote:nil expr:"(first last)">

This is actually true also for all the other regexes, without named groups or backreferences calls, e.g.

>> /((\w)*)/.match('abc')
=> #<MatchData "abc" 1:"abc" 2:"c">

except that, in my normal life as a regexp user, I’d solve that specific issue by using #scan and getting rid of the capturing group:

>> 'abc'.scan(/\w/)
=> ["a", "b", "c"]
>> 'abc'.scan(/(\w)/)
=> [["a"], ["b"], ["c"]]

but I can’t do it here, as that would mean unrolling the recursion to infinite depth.

I digged for quite a while trying to understand if this was something inherent in the ruby version of oniguruma, or in oniguruma itself, or if I was just dumb.

Turns out, people way smarter than me over at pcre-dev already had a discussion related to this, from which I quote

> Is there a workable approach if the repeating pattern itself has desired
> subpatterns? With the following I find the “b” captured only once:
> /(a(b)c)(\g1)?/ with data abcabc

That’s correct. Nobody has invented a scheme where the contents of capturing subpatterns are expressed as vectors. PCRE returns the value from the last time the parens captured something. So if you had /(a(b|d)c)(\g1)?/ with data abcadc you would get “d”.

Anyway, it seems to me that it should be possible to hook into the regex engine so that when a match happens it is passed to a user defined routine. I seem to recall perl has something like that, and PCRE’s callouts seem to be in the same league.

To be fair, I’d rather use a proper parser than regular expressions in real problems related to nested structures, as regular expressions have a tendency to grow exponentially in complexity. Still, since we can already check if a structure is recursive, it would be cool to be able to get the matched tree structure.

Even better, I’d like regular expressions in ruby to become something akin to pattern matching in perl6, and subsume the issue of basic regular expressions and full blown parsing in one go.

Leave a Reply Cancel reply