Matching nested structures with Regexps in Ruby 1.9

NOTE: this was originally published on the older version of this blog, some content may be broken/outdated.

Following a discussion on hacker news I have found myself wondering about regular expressions in ruby 1.9.

In this major version ruby switched its regex engine to oniguruma (and, since a few days ago to a fork of it called Onigmo ).

This engine is widely more powerful than the original ruby one, supporting positive and negative lookbehind, named captures, better support for backreferences and most importantly for our topic, callable backreferences.

The latter means that, among other things, ruby has gained the ability to recognize arbitrarily nested structures.

Since I hadn’t played with this before, I spent some time to craft a Regexp that would match a nested tree like structure, and what better example is there than good ol’ lisp?

This is what I got

# this pattern has many issues, consider it an example
R = %r{
      (?<atom> [-\w]+){0}
      (?<quote> '\g<expr>){0}
      (?<expr>  
        \g<atom> | 
        \g<quote> | 
        \(  
          \g<expr>?
          (\s+  \g<expr>)* 
        \)
      ){0}
      ^\g<expr>$
    }imx

which has a rather nice feel to it, reading bottom up

  • a regex is an expression
  • an expression is
    • an atom or
    • a quotation or
    • an expression (possibly empty) followed by other expressions
  • a quotation is an expression with a single quote in the front
  • an atom is a sequence of word characters or a dash

syntax wise, there are only a couple of things that are unusual: the “grammar rules” are defined like

 (?<foo> some pattern){0}

which is a clever trick (which I first saw in the oniguruma documentation): create a named capturing group, but say that it should match zero times. This way, the definition of the group stays around, but it is not required to actually match anywhere.

The other thing is how we refer to this rules:

\g<foo> 

this simply means: apply the named pattern in this spot, via the awesome “callable backreferences” of oniguruma (also available in PCRE now). Notice that we also get forward declarations, which is rather cool.

Some quick tests show that this works (for an arguable value of “works”) in some simple cases:

t('()')
t('(ciao)')
t('(ciao miao)')
t('(ciao miao)')
t('(ciao (miao))')
t('(ciao (miao bau))')
t("(ciao '(miao bau))")

t('(sum-iter k 1)')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
SEXPR
t('ok')

t("nope '(miao bau))")
t('(nope')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1))
SEXPR

t(<<-SEXPR)
(define @  (lambda (k)
    (sum-iter k 0 1))
SEXPR


which correctly returns:

(): ok
(ciao): ok
(ciao miao): ok
(ciao miao): ok
(ciao (miao)): ok
(ciao (miao bau)): ok
(ciao '(miao bau)): ok
(sum-iter k 1): ok
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
: ok
ok: ok
nope '(miao bau)): no
(nope: no
(define sum1  (lambda (k)
    (sum-iter k 0 1))
: no
(define @  (lambda (k)
    (sum-iter k 0 1))
: no

So the pattern correctly matches recursive structures with balanced parenthesis.

Sadly, this approach appears to have one serious downside: it’s impossible to get at the matched elements.

From what I understand, when the regexp engine compiles the pattern, it reserves a certain number of slots for the capturing groups which is independent of the actual string being examinated.

This means that each named group only stores the value of the last match, e.g.

# no reference to "first"
>> R.match('(first last)') 
=>#<MatchData "(first last)" atom:"last" quote:nil expr:"(first last)"> 

This is actually true also for all the other regexes, without named groups or backreferences calls, e.g.

>> /((\w)*)/.match('abc')
=> #<MatchData "abc" 1:"abc" 2:"c">

except that, in my normal life as a regexp user, I’d solve that specific issue by using and getting rid of the capturing group:

>> 'abc'.scan(/\w/)
=> ["a", "b", "c"]
>> 'abc'.scan(/(\w)/)
=> [["a"], ["b"], ["c"]]

but I can’t do it here, as that would mean unrolling the recursion to infinite depth.

I digged for quite a while trying to understand if this was something inherent in the ruby version of oniguruma, or in oniguruma itself, or if I was just dumb.

Turns out, people way smarter than me over at pcre-dev already had a discussion related to this, from which I quote

> Is there a workable approach if the repeating pattern itself has desired
> subpatterns? With the following I find the “b” captured only once:
> /(a(b)c)(\g1)?/ with data abcabc

That’s correct. Nobody has invented a scheme where the contents of capturing subpatterns are expressed as vectors. PCRE returns the value from the last time the parens captured something. So if you had /(a(b|d)c)(\g1)?/ with data abcadc you would get “d”.

Anyway, it seems to me that it should be possible to hook into the regex engine so that when a match happens it is passed to a user defined routine. I seem to recall perl has something like that, and PCRE’s callouts seem to be in the same league.

To be fair, I’d rather use a proper parser than regular expressions in real problems related to nested structures, as regular expressions have a tendency to grow exponentially in complexity. Still, since we can already check if a structure is recursive, it would be cool to be able to get the matched tree structure.

Even better, I’d like regular expressions in ruby to become something akin to pattern matching in perl6, and subsume the issue of basic regular expressions and full blown parsing in one go.

Both my languages suck

NOTE: this was written on a different version of this blog many years ago and may contain outdated links and references.

Over at desperately unenterprise there is a nice article about the things the author dislikes about haskell, and an invitation to share your own quirks about your favourite language. Well, I almost equally like ruby and python. so here is a double rant.

Why ruby sucks

First, i can reopen classes but so can the others. It’s all nice, and the usual idea that this will make development impossible is dumb, but it’s true that it can get annoying. Matz once used to say this could be fixed by namespaces in ruby2 but I haven’t heard anything about it in years and I’m not sure it will happen. Meanwhile, I still got the occasional oh, damn why isn’t this working? moments, and it sucks. People may remember that nice chainsaw infanticide logger maneuver.

Modules as namespaces also suck. They rely on a wrong assumption, that everyone else will use modules as namespaces, which of course is not always true, so if you happen to use something poorly designed you may end hupo with their User class conflicting with your. Python got it right, namespaces are implicitly defined by files/directories and you have to explicitly import stuff in the current one.

But not just that, you can also import a single name from a python module. Try that with ruby’s require if you can. You’re either forced to split every single method in it’s own file, like facets does or you impose on the end user namespace pollution.

And it’s not even that namespaces are cheap to define. The usual way is

  • create a foo.rbmain file
  • create a foo/ directory
  • create many foo/xyz.rb files
  • in every single file define the module And it gets much worse with nested namespaces. And yeah I know I can do class Module::ClassName except that then I can’t load that single file separately.

(I also dislike the thing about constants being lexically scoped to be fair, but probably it’s just my disliking of class variables and the fact that I’d like to use constants for that)

What is nice about ruby modules is their usage as mixins, and even though I’d like them a bit different they are cool. But so under used.

The core classes in ruby are so fat that they are almost impossible to subclass. Have you ever tried subclassing String so that you can, for instance have a TextileString where you can do substitution without considering the markup? You’ll never get it right enough, just String#[] has 7 different behaviors depending on th arguments, and String#[]= has 8 of them.

The Comparable mixin rocks, everybody loves Enumerable but go figure how to make an Hash-like or File-like object without subclassing

Why python sucks

The problems with python are, well, what makes it pythonesque. Yeah, I understand why you need to explicitly define self in a function, I just believe it’s dumb and too verbose. But why can’t I have a bit of syntax sugar ? Make .foo be a shortcut for self.foo ? At least that would also enforce the convention!

Python has such a big set of conventions that are not supported by the language. Do not use global variables, but look, there is a useful global keyword, even though you have nested namespaces.

On the other hand, it enforces some ugly conventions, suchs as the avoidance of assignments in conditionals by splitting statements and expressions. That distinction is so annoying that as time passes more and more things become both. Conditions are statements, so here is also the expression version. Functions are statements so here is the crippled lambda expression. Assignment is a statement so here are a handful of setters that avoid using the evil = symbol.

update: sorry, this got published by error, it was still largely incomplete, so I’ll juust add some more comments

A lot of things that I dislike are in the “kernel library” design, and not syntax. len, map and so do not make the language less OO, I know, but they are ugly and polluting the top namespace, why can’t they be made part of proper abstractions? And I hate not being able to use mutable values as key for dictionaries, even java manages to do that, it should not be that hard.

Something that I miss syntax wise instead is the lack of a case statemente, especially of a destructuring/pattern matching one. Python people will usually tell you that you just use a dictionary. Yeah, but why I have to? I can remove else-if from the language too, but it does not mean it makes sense. case/switch statements are actually a place where the language can provide very useful behaviours once someone get past the C/Java versions of it. Take a look at ruby, SML or Scala.

And, pet peeve: why I have to type “:” in definitions? the parser does not need that, and come on, it adds nothing readibility-wise.

A spell corrector in perl6 part 3

NOTE this was originally posted on an older version of this blog, in 2007, the language then known as Perl 6 has been renamed to Raku.

(See part 1 and part 2 for explanations of the following code).

Now, this code should work. If I copy and paste it into pugs it works as expected. If I run it it has some failures in the final tests for correct() that I can’t explain.

The part in the beginning shows how to count different words in a file, it depends on proper handling of unicode in pugs, so it may or may not work at the moment,


sub words($file) { slurp($file).lc.comb(/<alpha>+/) }

sub train(@words) {
  my %res;
  for @words -> $w { %res{$w}++ }
  %res
}


#my %NWORDS = train(words('/home/rff/Desktop/big.txt'));
my %NWORDS={'ciao'=>4,'c'=>3,'cibo'=>1,'ciaao'=>1,'ccc'=>1,'cia'=>1};

my @ALPHA = 'a'..'z';


# 'abc' -> 'ac'
sub deletion($word) {
  (^$word.chars).map: {substr(my $tmp = $word,$_,1)='';$tmp};
}

# 'abc' -> 'adc'
sub substitution($word) {
  gather {
    for (0..$word.chars-1) X @ALPHA {
      substr(my $tmp = $word,$_[0],1)=$_[1];
      take $tmp;
    }
  }
}

# 'abc' -> 'abbc'
sub insertion($word) {
  gather {
    for (0..$word.chars) X @ALPHA {
      substr(my $tmp = $word,$_[0],0)=$_[1];
      take $tmp;
     }
  } 
}

# 'abc' -> 'acb'
sub transposition($w) {
  gather for ^$w.chars {
    my $tmp=$w;
    my $removed =(substr($tmp,$_,1)='');
    substr($tmp,$_+1,0)=$removed;
    take $tmp;
  }
}

sub edits1($w) {
  # all these are different, no need to use a set
  transposition($w),insertion($w),substitution($w),deletion($w)
}

    
sub known_edits2($words) { 
  my @ary = gather {
    for edits1($words) -> $e1 {
      for edits1($e1) -> $e2 {
        take $e2 if %NWORDS{$e2} 
      }
    }
  }
  any(@ary).values
}

sub known(@words) { 
  gather for @words {take $_ if %NWORDS{$_}} ;
}

sub correct($w) {
  my @values = known([$w]) or known(edits1($w)) or known_edits2($w) or [$w];
  # single argument max() doesn't work yet
  say @values.perl;
  @values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}

}Code language: Perl (perl)