Tuesday, October 8, 2013

Beautiful Code Ported to Go

This week I've been learning Go - the programming language, not the game. I had studied its concurrency primitives for my Clojure library that would bring the CSP model to Clojure (this was before Rich Hickey and crew created core.async), but until a few days ago I hadn't formally studied the whole of Go with the intention of being proficient in it.

Go has pointers, but it does not have pointer arithmetic. Instead, it has slices - variable sized arrays on which you can use Python-like "slice" notation. I wanted a chance to try that out and found it recently when reading Chapter 1 of Beautiful Code, which is about a limited regular expression that Rob Pike (co-creator of Go) wrote for the Practice of Programming book he co-wrote with Brian Kernighan (who is also the author of Ch. 1 of Beautiful Code).

Pike's code is a limited (pedagogical) regex library that allows the following notation:

|------------+------------------------------------------------------------|
| Character  | Meaning                                                    |
|------------+------------------------------------------------------------|
| c          | Matches any literal character c                            |
| . (period) | Matches any single character                               |
| ^          | Matches the beginning of the input string.                 |
| $          | Matches the end of the input string                        |
| *          | Matches zero or more occurrences of the previous character |
|------------+------------------------------------------------------------|


/* ---[ The C version ]--- */

Here's Pike's code in C:


 #include <stdio.h>

 int matchstar(int c, char *regexp, char *text);

 /* matchhere: search for regexp at beginning of text */
 int matchhere(char *regexp, char *text) {
   if (regexp[0] == '\0')
     return 1;
   if (regexp[1] == '*')
     return matchstar(regexp[0], regexp+2, text);
   if (regexp[0] == '$' && regexp[1] == '\0')
     return *text == '\0';
   if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
     return matchhere(regexp+1, text+1);
   return 0;
 }


 /* matchstar: search for c*regexp at beginning of text */
 int matchstar(int c, char *regexp, char *text) {
   do {
     /* a * matches zero or more instances */
     if (matchhere(regexp, text))
       return 1;
   } while (*text != '\0' && (*text++ == c || c == '.'));
   return 0;
 }


 /* match: search for regexp anywhere in text */
 int match(char *regexp, char *text) {
   if (regexp[0] == '^')
     return matchhere(regexp+1, text);
   do {
     /* must look even if string is empty */
     if (matchhere(regexp, text))
       return 1;
   } while (*text++ != '\0');
   return 0;
 }

I'll let you read Ch. 1 of Beautiful Code for an analysis, but two things are noteworthy for my purposes:

  1. Pike uses pointer arithmetic throughout the code
  2. He uses the unusual do-while loop twice in only 30 or so lines of code

So I thought I'd port it to Pike's new language Go.


/* ---[ My Go version ]--- */


 package pikeregex

 // search for c*regex at beginning of text
 func matchstar(c rune, regex []rune, text []rune) bool {
     for {
         if matchhere(regex, text) {
             return true
         }
         if ! (len(text) > 0 && (text[0] == c || c == '.')) {
             return false
         }
         text = text[1:]
     }
 }

 // search for regex at beginning of text
 func matchhere(regex []rune, text []rune) bool {
     if len(regex) == 0 {
         return true
     }
     if len(regex) > 1 && regex[1] == '*' {
         return matchstar(regex[0], regex[2:], text)
     }
     if regex[0] == '$' && len(regex) == 1 {
         return len(text) == 0
     }
     if len(text) > 0  && (regex[0] == '.' || regex[0] == text[0]) {
         return matchhere(regex[1:], text[1:])
     }
     return false
 }

 // search for regex anywhere in the text
 func Match(regex string, text string) bool {
     runerx := compile(regex)
     runetxt := []rune(text)

     if len(runerx) > 0 && runerx[0] == '^' {
         return matchhere(runerx[1:], runetxt)
     }

     for {
         if matchhere(runerx, runetxt) {
             return true
         }
         if len(runetxt) == 0 {
             return false
         }
         runetxt = runetxt[1:]
     }
 }

 // one enhancement: allow + (1 or more) notation
 func compile(regex string) (regslc []rune) {
     regslc = make([]rune, 0, len(regex) + 10)

     for _, r := range regex {
         if r == '+' {
             regslc = append(regslc, regslc[len(regslc) - 1], '*')
         } else {
             regslc = append(regslc, r)
         }
     }   
     return regslc
 }

This is as straight a port as I could make it. And I think it translates well to Go. I've capitalized the Match method, as that is the public one to be exported to other libraries.

Instead of pointer arithmetic I used slice notation, as in this recursive call to matchhere:


 // C version
 return matchhere(regexp+1, text+1);

 // Go version
 return matchhere(regex[1:], text[1:])

Also in the C code you check whether you are at the end of the text string by looking for the NUL char: *text == '\0'. In Go, you can use the builtin len function: len(text) == 0. That statement is true if you keep recursively slicing text[1:] until you get to an empty string, or rather in my code, an empty slice of runes.


/* ---[ Runeology ]--- */

Runes are the Go 'char' type. A rune is an integer value identifying a Unicode code point. When you iterate over strings, you get runes, which are of variable size (number of bytes).

You have to be careful with strings in Go: text[2] returns the third byte, not the third rune in the string. If you want the third rune, you might try to use the utf8.DecodeRuneInString(text[2:]) function. But this would only work with ASCII, as you are slicing at the third byte and asking the utf8 library to parse the first rune from that point. But if the first rune in the string is two bytes long, you'll be getting the second rune in the string, not the third. If it's three bytes long, you're really in trouble.

The safest way is to do what I did in the code: convert the string to a slice of runes ([]rune) immediately and then work it that. Now when you index runeslice[2] you know you are getting the third rune.


/* ---[ No do-while ]--- */

Go doesn't have a do-while loop. It doesn't even have a while statement: just for. But, as Rob Pike reminded me in a critique of the first version of this blog entry, a do-while can be adequately mimicked with an infinite for loop:

 func matchstar(c rune, regex []rune, text []rune) bool {
     for {
         if matchhere(regex, text) {
             return true
         }
         if ! (len(text) > 0 && (text[0] == c || c == '.')) {
             return false
         }
         text = text[1:]
     }
 }

The stated intent of Go was to be as minimal as possible. Pike, in a recent podcast interview, said that the core team that created Go (which includes Ken Thompson) all had to agree that a feature was essential for it be included. Many candidate features were dropped, including the do-while loop. Of note, goto was not, which I find quite interesting. goto is only mentioned once (almost in passing) in the Effective Go guide, so I'm interested in what the essential use case for it was.


/* ---[ One addition ]--- */

Finally, in the Beautiful Code chapter, Kernighan suggests a number of enhancements the reader can make. I've only done one - allowing the + (1 or more) operator by mildly precompiling the regex, turning x+ into xx*, allowing me to use Pike's original (ported) code untouched.

The above code is available on GitHub: https://github.com/midpeter444/pikeregex

No comments:

Post a Comment