Matching nested structures with Regexps in Ruby 1.9

NOTE: this was originally published on the older version of this blog, some content may be broken/outdated.

Following a discussion on hacker news I have found myself wondering about regular expressions in ruby 1.9.

In this major version ruby switched its regex engine to oniguruma (and, since a few days ago to a fork of it called Onigmo ).

This engine is widely more powerful than the original ruby one, supporting positive and negative lookbehind, named captures, better support for backreferences and most importantly for our topic, callable backreferences.

The latter means that, among other things, ruby has gained the ability to recognize arbitrarily nested structures.

Since I hadn’t played with this before, I spent some time to craft a Regexp that would match a nested tree like structure, and what better example is there than good ol’ lisp?

This is what I got

# this pattern has many issues, consider it an example
R = %r{
      (?<atom> [-\w]+){0}
      (?<quote> '\g<expr>){0}
      (?<expr>  
        \g<atom> | 
        \g<quote> | 
        \(  
          \g<expr>?
          (\s+  \g<expr>)* 
        \)
      ){0}
      ^\g<expr>$
    }imx

which has a rather nice feel to it, reading bottom up

  • a regex is an expression
  • an expression is
    • an atom or
    • a quotation or
    • an expression (possibly empty) followed by other expressions
  • a quotation is an expression with a single quote in the front
  • an atom is a sequence of word characters or a dash

syntax wise, there are only a couple of things that are unusual: the “grammar rules” are defined like

 (?<foo> some pattern){0}

which is a clever trick (which I first saw in the oniguruma documentation): create a named capturing group, but say that it should match zero times. This way, the definition of the group stays around, but it is not required to actually match anywhere.

The other thing is how we refer to this rules:

\g<foo> 

this simply means: apply the named pattern in this spot, via the awesome “callable backreferences” of oniguruma (also available in PCRE now). Notice that we also get forward declarations, which is rather cool.

Some quick tests show that this works (for an arguable value of “works”) in some simple cases:

t('()')
t('(ciao)')
t('(ciao miao)')
t('(ciao miao)')
t('(ciao (miao))')
t('(ciao (miao bau))')
t("(ciao '(miao bau))")

t('(sum-iter k 1)')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
SEXPR
t('ok')

t("nope '(miao bau))")
t('(nope')
t(<<-SEXPR)
(define sum1  (lambda (k)
    (sum-iter k 0 1))
SEXPR

t(<<-SEXPR)
(define @  (lambda (k)
    (sum-iter k 0 1))
SEXPR


which correctly returns:

(): ok
(ciao): ok
(ciao miao): ok
(ciao miao): ok
(ciao (miao)): ok
(ciao (miao bau)): ok
(ciao '(miao bau)): ok
(sum-iter k 1): ok
(define sum1  (lambda (k)
    (sum-iter k 0 1)))
: ok
ok: ok
nope '(miao bau)): no
(nope: no
(define sum1  (lambda (k)
    (sum-iter k 0 1))
: no
(define @  (lambda (k)
    (sum-iter k 0 1))
: no

So the pattern correctly matches recursive structures with balanced parenthesis.

Sadly, this approach appears to have one serious downside: it’s impossible to get at the matched elements.

From what I understand, when the regexp engine compiles the pattern, it reserves a certain number of slots for the capturing groups which is independent of the actual string being examinated.

This means that each named group only stores the value of the last match, e.g.

# no reference to "first"
>> R.match('(first last)') 
=>#<MatchData "(first last)" atom:"last" quote:nil expr:"(first last)"> 

This is actually true also for all the other regexes, without named groups or backreferences calls, e.g.

>> /((\w)*)/.match('abc')
=> #<MatchData "abc" 1:"abc" 2:"c">

except that, in my normal life as a regexp user, I’d solve that specific issue by using and getting rid of the capturing group:

>> 'abc'.scan(/\w/)
=> ["a", "b", "c"]
>> 'abc'.scan(/(\w)/)
=> [["a"], ["b"], ["c"]]

but I can’t do it here, as that would mean unrolling the recursion to infinite depth.

I digged for quite a while trying to understand if this was something inherent in the ruby version of oniguruma, or in oniguruma itself, or if I was just dumb.

Turns out, people way smarter than me over at pcre-dev already had a discussion related to this, from which I quote

> Is there a workable approach if the repeating pattern itself has desired
> subpatterns? With the following I find the “b” captured only once:
> /(a(b)c)(\g1)?/ with data abcabc

That’s correct. Nobody has invented a scheme where the contents of capturing subpatterns are expressed as vectors. PCRE returns the value from the last time the parens captured something. So if you had /(a(b|d)c)(\g1)?/ with data abcadc you would get “d”.

Anyway, it seems to me that it should be possible to hook into the regex engine so that when a match happens it is passed to a user defined routine. I seem to recall perl has something like that, and PCRE’s callouts seem to be in the same league.

To be fair, I’d rather use a proper parser than regular expressions in real problems related to nested structures, as regular expressions have a tendency to grow exponentially in complexity. Still, since we can already check if a structure is recursive, it would be cool to be able to get the matched tree structure.

Even better, I’d like regular expressions in ruby to become something akin to pattern matching in perl6, and subsume the issue of basic regular expressions and full blown parsing in one go.

Step, a scala web picoframework

NOTE: this was written on an old version of this blog in 2009, some formatting, links etc.. may have rotted since.

So, after playing with scala on Google App Engine two days ago, I spent a bit of time yesterday (while waiting for food from girlfriend) and today (while waiting plane to go eat at my parents’ easter breakfast, unrelated) writing my hack for a web framework.

I mostly wanted to try scala’s cool abstraction features, so I thought I could try to produce something that had more or less the feeling of immediateness of sinatra.

To get ruby-like blocks in scala you only need to define functions with by-name arguments in this way:

def foo(x:String)(fun: =>Any) 
foo("something") {block returning something to be evaluated later }Code language: Scala (scala)

Of course, what was needed at this point was something that allowed me to match the url and the http method. This is the code to allow that, using a simple Map from path+method to anonymous functions

abstract class Step extends HttpServlet {
  val Map = new HashMap[(String,String),(Any=>String)]() 
  override def service(request:HttpServletRequest ,response: HttpServletResponse) {
        val method=request.getMethod
        val path=request.getRequestURI
        response.getWriter.println(Map(method,path)());
    }
  
  def get(path:String)(fun: =>Any) =    Map.put(("GET",path),x=>fun.toString)
  def post(path:String)(fun: =>Any) =    Map.put(("POST",path),x=>fun.toString)  
}Code language: Scala (scala)

I believe the types could be simplified, as I’d just need a ()=>Any function not Any=>String, but I could not come up with the proper definition yet 🙂

Anyway, this can be used in a pretty simple way, and thanks to Scala’s xml literal you can have code like this:

class ScalaHello extends Step {
  get("/") {
    <form action="/post" method="POST">
      <textarea name="key"/>
      <input type="submit"/>
    </form>
  }

  post("/post") {
    "received a post"
  }

  def hello() = "hello scala world!"
  
  get("/hello) {
    hello
  }

  get("/number") { 1 }
  
}Code language: Scala (scala)

Now, obviously we have the issue of passing arguments. The by-name trick does not allow us to pass parameters and we cannot inject a new variable into the closure scope (I believe), nor evaluate the block in a different scope a-la ruby’s instance_eval.

It is possible in scala to define implicit variables that have dynamic scope, but I could not find example on how that would be possible with byname blocks.

What we can do though, is define a method params to access such variable.

Yet, we are defining all the “get”s inside the same object, so even if we can access the method to access the parameters we just shifted the problem of having that shared object, and what if different threads access it at the same time?

Yet, scala already provides what we need, namely dynamic scope, in the form of the DynamicVariable class, which can be used like

  private val dynamic = new DynamicVariable[SomeType](init)
  dynamic.withValue(aValue) {
    block where dynamic contains aValue
  }Code language: Scala (scala)

Having defined the params method properly we have this code working nicely

class ScalaHello extends Step {
  get("/") {
    <form action="/post" method="POST">
      <textarea name="key"/>
      <input type="submit"/>
    </form>
  }
  post("/post") {
    "hello, this is your input "+params("key")
  }Code language: Scala (scala)

I did not implement the pattern matching in the URL as sinatra does, but that should also be simple.

The code is pretty ugly and having started really reading about scala only few days ago it can be probably improved in many ways. The first things that come to my mind

  • I am shuffling around parameters to keep the function closure, but probably cleaning up the types should avoid this
  • I am using java.util.Map and java arrays, which means even if the code is well typed it can still go wrong
  • I have a cast that could probably be avoided
  • there is no error handling
  • the code still needs to me mapped via the evil web.xml to the / path. It should be possible to avoid this, but I have no clue on how to do it 🙂

Anyway, as the code is 30 lines of which about 9 are “real” code, I am fairly impressed 🙂

The full code

package step
import javax.servlet._;
import javax.servlet.http._;
import scala.util.DynamicVariable
import scala.collection.mutable.HashMap

abstract class Step extends HttpServlet {
  type Params = java.util.Map[String,Array[String]]
  val Map = new HashMap[(String,String),(Any=>String)]()
  val paramsMap = new DynamicVariable[Params](null)

  def params(name:String):String = paramsMap.value.get(name)(0)
  override def service(request:HttpServletRequest ,response: HttpServletResponse) {
      try {
        response.setContentType("text/html")
        paramsMap.withValue(request.getParameterMap.asInstanceOf[Params]) {
          response.getWriter.println(Map(request.getMethod,request.getRequestURI)());
        }
      }
       catch {
         case ex:NoSuchElementException => response.getWriter.println("requesting "+request.getMethod+" "+ request.getRequestURI+" but only have "
                                                   +Map)
       }
    }  
  def get(path:String)(fun: =>Any) =    Map.put(("GET",path),x=>fun.toString)
  def post(path:String)(fun: =>Any) =    Map.put(("POST",path),x=>fun.toString)
}Code language: JavaScript (javascript)

Please post suggestions if you want, but remeber that for heavy lifting it would still better to use a lift ðŸ˜‰

PS this code does not work on google app engine as of now, but I am getting this message

Error: Server Error
The server encountered an error and could not complete your request.

If the problem persists, please report your problem and mention this error message and the query that caused it.

which may mean this is a bug on their side, probably related to some Thread behaviour as I believe DynamicVariable is defined in terms of ThreadLocal. I’ll let you know if I sort it out. UPDATE actually, it seems related to scala.collections.mutable.HashMap not being loaded, even thought the jar is in lib as the others. Mh..

UPDATE 2 well, it seems I had just a misconfigured path, everything works like a charm on Google App Engine too, yay!

Both my languages suck

NOTE: this was written on a different version of this blog many years ago and may contain outdated links and references.

Over at desperately unenterprise there is a nice article about the things the author dislikes about haskell, and an invitation to share your own quirks about your favourite language. Well, I almost equally like ruby and python. so here is a double rant.

Why ruby sucks

First, i can reopen classes but so can the others. It’s all nice, and the usual idea that this will make development impossible is dumb, but it’s true that it can get annoying. Matz once used to say this could be fixed by namespaces in ruby2 but I haven’t heard anything about it in years and I’m not sure it will happen. Meanwhile, I still got the occasional oh, damn why isn’t this working? moments, and it sucks. People may remember that nice chainsaw infanticide logger maneuver.

Modules as namespaces also suck. They rely on a wrong assumption, that everyone else will use modules as namespaces, which of course is not always true, so if you happen to use something poorly designed you may end hupo with their User class conflicting with your. Python got it right, namespaces are implicitly defined by files/directories and you have to explicitly import stuff in the current one.

But not just that, you can also import a single name from a python module. Try that with ruby’s require if you can. You’re either forced to split every single method in it’s own file, like facets does or you impose on the end user namespace pollution.

And it’s not even that namespaces are cheap to define. The usual way is

  • create a foo.rbmain file
  • create a foo/ directory
  • create many foo/xyz.rb files
  • in every single file define the module And it gets much worse with nested namespaces. And yeah I know I can do class Module::ClassName except that then I can’t load that single file separately.

(I also dislike the thing about constants being lexically scoped to be fair, but probably it’s just my disliking of class variables and the fact that I’d like to use constants for that)

What is nice about ruby modules is their usage as mixins, and even though I’d like them a bit different they are cool. But so under used.

The core classes in ruby are so fat that they are almost impossible to subclass. Have you ever tried subclassing String so that you can, for instance have a TextileString where you can do substitution without considering the markup? You’ll never get it right enough, just String#[] has 7 different behaviors depending on th arguments, and String#[]= has 8 of them.

The Comparable mixin rocks, everybody loves Enumerable but go figure how to make an Hash-like or File-like object without subclassing

Why python sucks

The problems with python are, well, what makes it pythonesque. Yeah, I understand why you need to explicitly define self in a function, I just believe it’s dumb and too verbose. But why can’t I have a bit of syntax sugar ? Make .foo be a shortcut for self.foo ? At least that would also enforce the convention!

Python has such a big set of conventions that are not supported by the language. Do not use global variables, but look, there is a useful global keyword, even though you have nested namespaces.

On the other hand, it enforces some ugly conventions, suchs as the avoidance of assignments in conditionals by splitting statements and expressions. That distinction is so annoying that as time passes more and more things become both. Conditions are statements, so here is also the expression version. Functions are statements so here is the crippled lambda expression. Assignment is a statement so here are a handful of setters that avoid using the evil = symbol.

update: sorry, this got published by error, it was still largely incomplete, so I’ll juust add some more comments

A lot of things that I dislike are in the “kernel library” design, and not syntax. len, map and so do not make the language less OO, I know, but they are ugly and polluting the top namespace, why can’t they be made part of proper abstractions? And I hate not being able to use mutable values as key for dictionaries, even java manages to do that, it should not be that hard.

Something that I miss syntax wise instead is the lack of a case statemente, especially of a destructuring/pattern matching one. Python people will usually tell you that you just use a dictionary. Yeah, but why I have to? I can remove else-if from the language too, but it does not mean it makes sense. case/switch statements are actually a place where the language can provide very useful behaviours once someone get past the C/Java versions of it. Take a look at ruby, SML or Scala.

And, pet peeve: why I have to type “:” in definitions? the parser does not need that, and come on, it adds nothing readibility-wise.

ECMAScript 4, the fourth system syndrome

NOTE: this was written in 2007 in an old version of this blog, some links or comments are lost to time.

The second system syndrome is that evil effect that often happens when redesign a small, working system so that it becomes a huge leviathan built by piling new features over new features.

Today I was reading the overview of ECMASCript 4 from ecmascript.org, and I got this very bad feeling that maybe the fourth version of ES, is suffering of an extremely strong case of this problem.

I believe that part of the success of ES in its current incarnation is that it is quite simple. Yeah, it has its shortcomings and oddities, but in general it is simple enough that everyone can learn it in no time.

ES4, on the other hand, seems to have included almost everything that came out from programming languages in the last 50 years, save Prolog and APL.

For example, it has prototypes, interfaces, classes, metaclasses, higher order types and generic functions, which for what I can tell entails all the work ever done on object-orientation (maybe predicate classes are missing, but they may be there too).

The argument list of a function may have named arguments, rest/variadic arguments, optional arguments, optional types including: non-nullable arguments, like arguments (this seems the same as in Eiffel, used to link the type of an object to another), wrap arguments that perform automatic conversion of an object of type X into another of type Y,

Whereas many languages I know provide a concept of Module or Package to handle namespacing issues, ES4 provides packages and namespaces and program units to support name hiding and modularity. Not only this, but you can introduce variables with var to have it in the object scope or with let to have it in a block, so even more namespace separation 🙂

Finally, ES4 incorporates some cool features such as list comprehensions, a kind of destructuring assignment/pattern matching, python-like slicing, triple-quotes and generators (though AFAICT this are only one way generators like in older python releases, where you can yield but you can’t receive values Thanks to Neil Mix for pointing out that they are complete coroutines, and support bidirectional value passing).

And operator overloading, program pragmas, required tail call elimination, a host of interesting keywords suchy as intrinsic to reify operators, dynamic to have classes that allow adding of new fields, static to have class-level functions, final to avoid inheritance and override to use it, internalpublic and so on. And of course, 6 different values of falsity, with NaN""0falsenull and undefined, but not {} and [].

ECMAScript 4 seems cool, I want to use a lot of those things (yay for multi method dispatch!), but maybe it is changing the language a little bit too much .

How many of the current user of ES in one of its incarnations have ever heard of product types? Will they grasp at once the logic behind the subtyping relation between functions, with all that covariant and contravariant stuff? Will they need it at all?

I hope the internet pipes will resist to the number of “metaclass vs meta namespace vs prototype vs superclass” questions on usenet.

A spell corrector in perl6 part 3

NOTE this was originally posted on an older version of this blog, in 2007, the language then known as Perl 6 has been renamed to Raku.

(See part 1 and part 2 for explanations of the following code).

Now, this code should work. If I copy and paste it into pugs it works as expected. If I run it it has some failures in the final tests for correct() that I can’t explain.

The part in the beginning shows how to count different words in a file, it depends on proper handling of unicode in pugs, so it may or may not work at the moment,


sub words($file) { slurp($file).lc.comb(/<alpha>+/) }

sub train(@words) {
  my %res;
  for @words -> $w { %res{$w}++ }
  %res
}


#my %NWORDS = train(words('/home/rff/Desktop/big.txt'));
my %NWORDS={'ciao'=>4,'c'=>3,'cibo'=>1,'ciaao'=>1,'ccc'=>1,'cia'=>1};

my @ALPHA = 'a'..'z';


# 'abc' -> 'ac'
sub deletion($word) {
  (^$word.chars).map: {substr(my $tmp = $word,$_,1)='';$tmp};
}

# 'abc' -> 'adc'
sub substitution($word) {
  gather {
    for (0..$word.chars-1) X @ALPHA {
      substr(my $tmp = $word,$_[0],1)=$_[1];
      take $tmp;
    }
  }
}

# 'abc' -> 'abbc'
sub insertion($word) {
  gather {
    for (0..$word.chars) X @ALPHA {
      substr(my $tmp = $word,$_[0],0)=$_[1];
      take $tmp;
     }
  } 
}

# 'abc' -> 'acb'
sub transposition($w) {
  gather for ^$w.chars {
    my $tmp=$w;
    my $removed =(substr($tmp,$_,1)='');
    substr($tmp,$_+1,0)=$removed;
    take $tmp;
  }
}

sub edits1($w) {
  # all these are different, no need to use a set
  transposition($w),insertion($w),substitution($w),deletion($w)
}

    
sub known_edits2($words) { 
  my @ary = gather {
    for edits1($words) -> $e1 {
      for edits1($e1) -> $e2 {
        take $e2 if %NWORDS{$e2} 
      }
    }
  }
  any(@ary).values
}

sub known(@words) { 
  gather for @words {take $_ if %NWORDS{$_}} ;
}

sub correct($w) {
  my @values = known([$w]) or known(edits1($w)) or known_edits2($w) or [$w];
  # single argument max() doesn't work yet
  say @values.perl;
  @values.max: {%NWORDS{$^a} <=> %NWORDS{$^b}}

}Code language: Perl (perl)