Wednesday, December 26, 2012

My CS/Programming Top 10 for 2012

As many will do and I did last year, I looked through my notes, projects, tweets, blog entries and personal wiki to assemble the highlights of my year on all topics computer science and software engineering.

These are the top 10 "things" that added the most to my knowledge, impressed me as excellent tools, and added to the joy of being a software developer. Here's my list in no particular order:

  1. Security Now podcast
  2. Chrome Dev Tools
  3. Emacs 24
  4. Groovy
  5. Clojure
  6. z (rupa/z)
  7. Xubuntu
  8. Coursera
  9. Fiddler2 and Wireshark
  10. TrueCrypt and other encryption tools
  11. One that should be on the list: Datomic

/* ---[ 1. Security Now podcast ]--- */

In March I did a short blog entry on the podcasts I was listening to or had heard of and wanted to try out. Good technical podcasts are a gold mine of information that you can use to fill the interstices of your day - while commuting, cleaning the kitchen or taking a walk.

The podcast that had the biggest impact on me in 2012 is Security Now, done by Steve Gibson and Leo Laporte, one of the flagship podcasts of the network.

To date, they have done 384 episodes, starting in August 2005 and the vast majority are still relevant and worth listening to. You can download them from Steve's GRC Security Now website.

The focus, of course, is on computer security, mostly for the individual user, not at the corporate level. The podcast also covers networking theory and practice in great detail, since the network is mainly how malware spreads and is a vast attack surface for it. For example, this year Steve did a deep dive into SPDY, the networking protocol developed by Google to speed up the web by reducing page load time by overcoming the TCP slow start problem.

And there's plenty of focus on cryptography and security. A highlight of the year was Steve's episode on Elliptic Curve Crypto, a crypto technology that will likely be used more heavily in days to come.

In addition, you learn a lot about how hard drives work, since Steve wrote SpinRite, a disk maintenance and recovery utility, which I use for maintenance on my spinning disks.

Also, starting with episode #233 ("Let's Design a Computer, part 1"), Steve does a 8+ episode series on the basics of how computers work, including what machine language is, how assembly language works, the role of stacks and registers, hardware interrupts and RISC vs. CISC architectures. You can learn (or be refreshed on) in surprising levels of detail for an audio-only medium. Steve is very good at explaining this stuff.

This year, while keeping up with the weekly new broadcasts, I went back and started at episode 1. At this point I've listened to about half of the episodes, so this will continue to be my mainstay into 2013.

/* ---[ 2. Chrome Dev Tools ]--- */

This year I got back into JavaScript programming. I remember the horrible days of debugging by alert statements, which contributed to the general consensus that JavaScript was a toy language and a piece of ill-thought out crap. Despite its warts, a result of its ridiculously over-short allowed design period, Brendan Eich created a rather fascinating and powerful programming language. Even though we like to complain about its issues, I agree with Crockford that given the conditions under which it was developed, we got better than we deserved and Mr. Eich is to thank for that.

So I was pleased to discover the awesome Chrome Dev Tools for browser based JavaScript development. JavaScript debugging can actually be pleasurable. Some resources to get you started if you aren't using it:

/* ---[ 3. Emacs24 ]--- */

Emacs is alive and well. In fact it is thriving more than ever. I've been a long time user of emacs and I use it for everything except Java (which really needs a full IDE).

Emacs 24, released this year, is a great text editor. I use it on both Linux and Windows, the exact same set up on both.

Most notably Emacs has package repositories now. Three, in fact, that I know of:
* ELPA, which is maintained/sponsored by GNU and has only the core emacs packages that adhere to the copyleft licensing model of the Free Software Foundation.
* Marmalade
* MELPA, which is where most of the bleeding edge work goes.

I used ELPA and MELPA by default, but I sometimes switch over to Marmalade if it has something not on the others. Generally MELPA and Marmalade seem to have the same packages, though MELPA often has the most recent. To make things confusing, MELPA moved to a date-based versioning system, like "20121206.1504", rather than the more traditional major.minor versioning system, such as "0.19".

There is a still a big learning curve to emacs and some things are still pretty esoteric (I still have trouble getting themes to work), but when people ask me why I use emacs I say: "if programming is your main career and hobby wouldn't you want to use the most powerful tool available? It's worth the few months of learning to enjoy the benefits for the rest of your life." But isn't emacs "old"??? (as if that's a bad thing) Seriously, when I use emacs I feel like I'm tapping into some of the collective wisdom of our programming culture from the last 30 years.

And no disrespect to vim. I like vim too. Pick one of those two and learn it. Stop using Notepad++ or worse.

A few emacs highlights from my year:

  • I love the nrepl package for Clojure. Now I can use those fancy keystrokes to auto-evaluate Clojure s-expressions. With the ac-nrepl package, it has code completion and will show you the argument signature for functions in the minibuffer! Some IDE-like goodness right there.

  • paredit. When I talk to people about Clojure (or Lisp in general), I sometimes get the story of how horrible it was balancing parentheses at 3 in the morning the day their CS class assignment was due. I am happy to announce to anyone that doesn't know: that problem is solved. It's name is paredit. Here is the slide deck I originally learned it from.

  • Learn to use emacs macros in two ways:

    • named macros you'll use a lot and save in your init.el (or macros.el if you want a separate file).
    • temp unnamed macros to automate some task you need to do some one-time repetitive thing, say, 10 times in a file. This EmacsRocks video shows a great example of that.

/* ---[ 4. Groovy ]--- */

When I was first learning Ruby, many years ago, I remember experiencing Matz' principle of least surprise. Once you learned the basic gist of Ruby and its blocks and classes, you could often just guess how to do something or what a method would be called and it would work. It was a very satisfying experience.

This year I joined a new company and they have largely adopted Groovy as their primary scripting language. As I jumped in to learn it, I had that deja vu feeling of learning Ruby, this time wrappering the Java language we know and love.

For example, I started to write a Groovy script that would have to recursively traverse a directory structure, and I remembered the pain of doing this in Java with its FilenameFilters and other APIs you had to learn to get anything done. I said to myself "I hope Groovy has made this easier". Holy smokes, they wrapped to have an eachFileRecurse method that takes a closure:

new File('.').eachFileRecurse {
  if ( =~ /.*\.txt/) println it;

There is also an eachDirRecurse and variations where you can pass in a file type filter.

The more I learn about Groovy the more I like it. In fact, the "groovy-JDK" is one of my favorite things: The Groovy creators and contributors have wrapped a large number of the Java classes, using the Groovy metaclass concept, and given them additional useful methods. Such as:

  • String now has an eachLine method and versions of replaceAll and replaceFirst that take a closure, allowing arbitrarily complex logic to be executed to determine the replacement string.

  • Map now has an any method that takes a predicate closure to see if at least one entry passes the predicate test. It also now has map and reduce, though the authors unfortunately followed Ruby in calling them collect and inject respectively.

  • And thank the gods, they wrappered the horrible java.util.Date class and made it more useful.

It provides many functional programming constructs, such as closures (the lambdas of Groovy), immutable data structures, higher order functions and very importantly: regex, list and map literals, akin to JavaScript or Closure literals (though the map literal syntax is different in Groovy).

With GStrings you get string interpolation and multi-line strings. And Groovy gives you simpler syntax for accessing getters and setters - you grab them like properties.

In short when you are hacking out large swaths of boilerplate in Java, using tedious syntax to do stuff with Maps, Lists, Regular Expressions and a variety of other things, you constantly think to yourself, "man I wish I could be doing in this in Groovy". Groovy makes programming a pleasure.

I'm still learning it and look forward to using it for years to come.

/* ---[ 5. Clojure ]--- */

And speaking of bringing the joy back to programming, Clojure is a combination of elegance, joy and ... wait a minute, how do I do this in Clojure? I ran across someone who described himself as a "perennial Clojure beginner". I can identify with that. Since I don't come from a Lisp or functional programming background, the last year learning Clojure has been like learning to ride a bicycle again. Except this bicycle is tricked out and has gears, knobs and restrictions that are different from the other bicycles.

I've started proselytizing co-workers about Clojure. I get the "why Clojure?" question a lot, so here is my version:

  • Combines the best of Lisp, such as macros, and Java/JVM, such as its world class garbage collector (which a language built on immutable data structures needs)
  • Brilliant design for immutable data structures that is now being adopted by other languages (Scala for one)
  • Functional programming model, but with practical bent (not Haskell, but more pure than Common Lisp)
  • STM: software transaction memory -- brilliant solution to shared mutable state
  • Designed for concurrency (in a couple of different ways)
  • A fast dynamic language: faster than ruby and python, comparable to Java in many areas and can drop into Java easily when performance is the most important thing
  • ClojureScript: bring the power of Clojure macro writing, namespaces and better syntax to doing your JavaScript work
  • Data centric (like lisps), but even better by being abstraction centric
  • Clean design for solving the “expression problem”:
  • Separation of concerns – an overall philosophy to tease things apart into simple (non-completed) pieces:
    • Example: polymorphism is not tied to inheritance
  • Simple and elegant syntax. For example, I find Scala to be powerful but overwhelming and confusing in its approach to syntax and expression
  • Community:
    • Small focused libraries (separation of concerns, non-complected)
    • Datomic => one of the greatest examples of separation of concerns there is
    • Core.logic => modern logic programming easily integrated into your program
  • Finally, an argument ad hominem: Rich Hickey. You need to watch the series of presentations he’s made over the past 5 years (perhaps one every week as Bodil suggests). Unquestionably the most impactful thinker in CS I’ve ever encountered. Even if you end up not agreeing with all of his views, you will learn a lot and think about things in a different way, possibly changing the way you think about our craft.

Finally, as a coda to this paean to Clojure: The O'Reilly Clojure Programming book came out this year. Chas Emerick, Brian Carper and Christophe Grand have written a fantastic book. It is a book you will learn from and come back to for its insights, examples and reference material for many years. Definitely belongs on my top 10 for 2012 list.

/* ---[ 6. rupa/z ]--- */

The z shell script (not zsh) is one of my favorite discoveries of 2012. To give it more press, I gave it its own blog entry, which you read here:

Here's the short summary: z is a 200-line shell script compatible with bash and z-shell that is a clever LRU-type cache of your directory visitations - the cache weighting is based on both frequency and recentness, which the author dubs "frecency". As you navigate around to different directories, it keeps track of where you've been, how often you've been there and how recently.

To navigate somewhere you've been, pass a part of the path to the z command and it will take you to the highest weighted directory in your cache.

/* ---[ 7. Xubuntu ]--- */

I'm a Linux guy. I was on the Ubuntu bandwagon for many years. I played with Linux Mint a little. I've got Fedora and CentOS running in VirtualBox VMs. But when Unity came out on Ubuntu, I struggled to get used to its desktop model. It does not fit how I work. I tried it for a month and was considering what to switch to when I saw a Slashdot article that Linus Torvalds was adopting XFCE to get away from the strangeness of many modern Linux desktop environments.

So that prompted me to try Xubuntu, based on XFCE and also Lubuntu, based on the LXDE desktop. Lubuntu was a little too minimal for me, but Xubuntu clicked for me right away. I don't like the Dash of Unity and I really really hate the fact that when I try to open a new console shell it brings the current one to the forefront. That is not what I want. I'll use Alt-Tab for that.

Xubuntu behaves as you expect. Click the terminal icon and it opens a new terminal. Xubuntu puts shortcut icons on the bottom, similar to Apple's desktop, but without the annoying enlargement animations. I don't do a lot of customization of my desktop. I just want one that has sane defaults and Xubuntu is that for me.

Ubuntu also stirred up criticism for its integration with Amazon affiliated advertisements, making the Dash a purchasing platform, in the process creating data leaks. Now you don't even have privacy when operating your desktop. The EFF write-up summarizes this nicely.

You can turn it off, but even among Linux users I suspect the "tyranny of the default" will mean that most users are leaking data and thus are at the mercy of Canonical, which people are starting to develop some mistrust for.

Well, Xubuntu doesn't have Dash. So you get the goodness of the Ubuntu ecosystem without the privacy violations. Its defaults are sane.

Try it out.

/* ---[ 8. Coursera and Online Education ]--- */

2012 is the year that online education skyrocketed. I've done a few CodeSchool courses and enjoyed those. But now there's Udacity and Coursera and Udemy and edX and probably 10 more I don't know about.

This year I took a Coursera course: Functional Programming Principles in Scala taught by Martin Odersky. It was a great experience. The format is excellent - each week there is about 2 to 3 hours of video lectures and a programming assignment that takes anywhere from 5 to 15 hours to complete. The examples were challenging enough to make the time investment worth it. And I got a nice certificate at the end for having a passing grade.

Uploading assignments was done via a command in Scala's sbt tool; it was easy and seamless. The assignments were graded automatically in about 10 minutes and gave good feedback, allowing you to fix problems and resubmit. The only part of the course I didn't enjoy was using the Scala Eclipse IDE, which is still quite painful compared to Java in Eclipse or Clojure in Emacs.

It's amazing what you can get online for free these days. I've signed up for two more courses and have my eye on a cryptography course there as well.

/* ---[ 9. Fiddler2 and Wireshark ]--- */

I spent a good deal of time this year maintaining and enhancing a large "legacy" web app written that uses Ajax calls to communicate with the Java back-end. In many cases, the shortcut to figuring out what is going on is to watch the traffic between the browser and server. Fiddler2 is an invaluable tool for that.

I also tried Wireshark, but the output from Fiddler2 is just as intuitive and easy to follow as can be, since it focuses only on HTTP traffic.

Wireshark is more general purpose. I started learning it this year and want to get better at configuring and customizing it, so I can use it effectively (and efficiently) on Linux, since Fiddler2 is unfortunately a Windows-only product.

/* ---[ 10. TrueCrypt, GPG and other encryption tools ]--- */

If you aren't using encryption for your files, hard drives and passwords, make it your new year resolution to learn the tools. Ever since Phil Zimmerman bravely pioneered encryption for the everyman, the suite of tools available to do this have gotten better and better.

I use GPG to encrypt individual files, TrueCrypt to encrypt thumb drives and external drives and Ubuntu's full disk encryption for my laptops. If you have a laptop and thumb drives, they should be encrypted.

A nice file encryption tool on Windows is AxCrypt.

For passwords, I use LastPass, which I believe does it all correctly and securely in a "trust no one" fashion.

Consider using an encrypted "Trust No One" backup and file syncing service. Dropbox is not encrypted, nor is SkyDrive or Google Drive or many other popular services. Do not upload anything to those systems that you wouldn't mind having broadcast on the internet or at least read by employees of those companies.

Steve Gibson (of the Security Now podcast) did a multi-episode analysis of backup and file syncing services from an encryption and "trust no one" perspective. Start with episode #349. There are a number of good solutions. I use SpiderOak on Linux.

If you already know and do this stuff, have a CryptoParty in your area. If you live in my area (Raleigh/Durham, North Carolina, USA), join the DC919 group.

/* ---[ Datomic: Mine goes to 11 ]--- */

While I did attend a Datomic training course this year and wrote a fairly long blog post about it, I just haven't made the time to really study it yet. I fully intend to, as I think it is one of the most profound and important things to have come out in 2012. I've queued it up to be on my top 10 list in 2013.

Monday, December 24, 2012

An irritation no longer: command line joy

/* ---[ pushd and dirs ]--- */

I have long been a command line person. I hate having to use the mouse. One thing that is a little cumbersome about the command line is jumping around into various deeply nested directory structures. I've long been a user of pushd, dirs and popd on Unix/Linux/Cygwin consoles. But if you are alternating between 3 or more directories with some regularity, those commands require some care to use correctly.

/* ---[ Improvement #1: pd ]--- */

An improvement on that is a small bash function that I found on stackexchange:

function pd() { 
  if [ "$1" ]; then
    pushd "${1/#[0-9]*/+$1}";
  fi > /dev/null

which simplifies using pushd. A basic session of use would be:

midpeter444:~/lang/clojure/concurrency-fun$ pd .
midpeter444:~/lang/clojure/concurrency-fun$ dirs
 0  ~/lang/clojure/concurrency-fun
 1  ~/lang/clojure/concurrency-fun
midpeter444:~/lang/clojure/concurrency-fun$ cd ~/.mozilla/
midpeter444:~/.mozilla$ dirs
 0  ~/.mozilla
 1  ~/lang/clojure/concurrency-fun
midpeter444:~/.mozilla$ pd .
midpeter444:~/.mozilla$ cd /tmp
midpeter444:/tmp$ dirs
 0  /tmp
 1  ~/.mozilla
 2  ~/lang/clojure/concurrency-fun
midpeter444:/tmp$ pd 2

pd can take either a dot, which means "remember this directory" or a number which refers to the position on the dirs history list. The random-access list metaphor is easier to work with the the pushd stack-based metaphor.

/* ---[ Vast Improvement #2: z ]--- */

But recently I discovered rupa/z or "z" and now I only occasionally use pd anymore. z is the biggest change and improvement to my command line life in years. I really love it. It works with cygwin as well for my time on Windows machines in my day job.

If you use the command line much, go get it now:

What is it?

First it is not z-shell (I'm a bash user), which is what I thought initially. (It doesn't help that the main GitHub page for it starts with "ZSH USERS BACKWARD COMPATIBILITY WARNING").

What it is, is a 200-line shell script compatible with bash and z-shell that is basically a clever LRU-type cache of your directory visitations - the cache weighting is based on both frequency and recentness, which the author dubs "frecency". As you navigate around to different directories, it keeps track of where you've been, how often you've been there and how recently.

To see your current cache in ascending order of 'frecency', just type z:

midpeter444:~$ z
0.313808   /home/midpeter444/apps/apache-ant-1.8.4/bin
0.313808   /home/midpeter444/lang/clojure/books/land-of-lisp/wizards-adventure/doc
0.313808   /home/midpeter444/lang/java/projects/mybatis-koans
0.392263   /tmp
0.429067   /home/midpeter444/lang/clojure/source-code
0.627622   /home/midpeter444/lang/lisp
0.784525   /home/midpeter444/.mozilla/firefox
0.83702    /home/midpeter444/lang/clojure/projects/clj-how-tos/clj-sockets
0.86298    /home/midpeter444/media/ebooks
1.62       /home/midpeter444/Dropbox/scripts-and-config-files
2.32335    /home/midpeter444/lang/clojure/sandbox/src/sandbox
5.6486     /home/midpeter444/lang/clojure/books/land-of-lisp/wizards-adventure
8.54205    /home/midpeter444/lang/clojure/books/land-of-lisp/orc-battle
10.7351    /home/midpeter444/lang/clojure/projects/clj-how-tos/clj-sockets
20.9559    /home/midpeter444/lang/clojure/books/land-of-lisp
30.7926    /home/midpeter444/lang/clojure/books/land-of-lisp/webserver
32.099     /home/midpeter444/Downloads
192.24     /home/midpeter444/lang/clojure/concurrency-fun

The number on left indicates the frecency score. So ambiguous entries will resolve in favor of the one with the higher score.

To navigate somewhere you've been, pass a part of the path to the z command:

midpeter444:~$ z fun

midpeter444:~$ z lisp

midpeter444:~$ z moz

It also has tab-completion. If I hit tab for the example above where I typed moz it expands to:

$ z /home/midpeter444/.mozilla

Now the only part of the command line usage I once found irritating is pure joy.

Saturday, December 15, 2012

Programming Praxis Amazon Interview Question, part 2

In part 1 of this blog entry I covered two relatively efficient implementations of an algorithm to keep the top N entries in a stream of points that came from the Programming Praxis web site:

Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).

I was happy with my ad-hoc solution since it performed twice as fast as using a sorted-set data structure.

The answer chosen by the author of the Programming Praxis site used a Heap data structure, where the max value is kept at the top of the heap. His implementation for this exercise was a mutable heap, so it wasn't not a viable candidate for a Clojure implementation. However, he has links to other praxis challenges where he implemented heaps (in Scheme), some of which use immutable data structures. I chose to (manually) transpile his Scheme-based "Leftist heap":

This is only one implementation of a heap and there might be (probably is) one that is more efficient, but I chose this as a representative to see how it would compare to the other solutions.

Here is my implementation of the "top 100" challenge using the Leftist Priority Queue Heap:

The Priority Queue does not keep track of its size and I didn't want to modify the data structure to do that. To compensate, I used the Clojure split-at function to split the points vector into two lazy-seqs: the first max-size entries from the points vector, which are just directly inserted into the heap and the rest.

Those remaining points then have to be sifted: if a point's distance from the origin is less than the first element on the heap, then that top entry on the heap needs to be pulled off and the new point is inserted. That is done with the code:

(q/pq-insert dist-lt? pt (q/pq-rest dist-lt? pq))

pq-rest is like Clojure's pop - it gives you the data structure minus its head and we insert the new point into that remaining data structure.

The dist-lt? function is a comparator function required by the Leftist Heap algorithm.

I did this additional exercise, because I suspected that a heap would a more efficient implementation that a fully sorted-set.

Here are some representative benchmark results again using the Clojure criterium tool. (This time I truncated some of the output to make it easier to read.)

user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
user=> (count points)

;; this is the "ad-hoc" solution from part 1
user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 525.526800 ms

;; this is the sorted-set solution from part 1
user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.250241 sec

;; this is the new heap-based implementation
user=> (bench (top-heap points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.063965 sec

I've only shown three results, but I ran them many times in varied order and got remarkably consistent results.

So without going into specific precision, the heap-based implementation is about 20% faster than the sorted-set implementation and my ad-hoc solution is about 50% faster than the heap-based implementation.

Which is about what I expected. I thought the heap-based solution might even be a little faster than this. One problem with the heap implementation I'm using is that it's central function (merge) uses true recursion. I didn't see an easy way to make it use loop-recur or a lazy-seq. If anyone sees a clever way to do that, I'd love to hear it.

Friday, December 14, 2012

Programming Praxis - Would Amazon hire me?

I was finishing up my work for the day and decided to check Hacker News. High on the list that day was Programming Praxis, which I'd never seen before and the featured post was the following Amazon interview question:

Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).

I had been planning to go home and start working on Dice of Doom (in Clojure) from Chapter 16 of Conrad Barski's Land of Lisp, but this problem sounded intriguing enough that I would take it up.

After sketching out a few ideas, I concluded that a strict O(n) solution isn't possible, but something near-O(n) would be feasible. Similar to how Clojure's persistent data structures often operate in "essentially constant" time -- O(log-32 n) being close enough to constant time to be considered basically a constant factor.

I decided to try an ad-hoc solution and then a sorted-set implementation. My guess was that the ad-hoc solution would be faster and that gave me a good excuse to try out Hugo Duncan's criterium benchmark tool to prove it (or prove me wrong).

/* ---[ The ad-hoc solution ]--- */

The approach I decided upon was to use an unsorted hash-map with mixed keys.

One key :ct would keep track of the count: how many entries ("points") were in the hashmap. It's max would the max-size (100 in the challenge statement).

The second key :hd, short for highest distance, would keep track of the entry (or entries) with the farthest distance from the the center point.

The rest of the keys are integers representing the distance to the origin. This distance key is mapped to a vector of points. Each point is a tuple (vector) of the x and y coordinate.

I decided to interpret distance from (0,0) not as the square root of the sum of the squares of x and y, but rather the absolute value of x plus the absolute value of y, but it wouldn't be too hard to imagine this with the other distance formula.

So the data structure would look like this:

{:ct 4
 :hd 7
 0 [[0 0]]
 2 [[1 1] [0 2]]
 3 [[1 2] [0 3] [3 0]]
 6 [[5 1]]
 7 [[4 3]]}

In this example, the data structure has 7 points, 1 with distance 0, 2 with distance 2, and so on.

Based on showing you this data structure I probably don't need to describe the algorithm I'm going to use. Rob Pike's Rule 5 is:

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming

Or Brooks' famous statement:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

So now the fun is to implement this in Clojure's immutable data structures using nicely composed functions.

We know we have to iterate through all the input points - there's our O(n) cost without doing any thing else - and we want to end up with one data structure. Therefore, we'll use reduce.

Even though the problem statement says the closest 100 points, for testing, I want to parameterize the max-size and what set of points I'll feed it, so the main function will look like this:

(def init-map {:hd Long/MIN_VALUE, :ct 0})

(defn topN [points max-size]
  (reduce (mk-sift-fn max-size) init-map points))

points will be a collection or seq of x-y tuples and max-size will indicate how many points the final data structure should retain. I pass max-size to a "make-sift-function", which is a higher-order function that will return the function that will "sift" through each point and determine whether it goes into the data structure and if so, where. A place for everything and everything in its place.

(defn mk-sift-fn [max-size]
  (fn [m pt]
    (let [dist (distance pt)]
      (if (< (:ct m) max-size)
        (add-point m dist pt)
        (if (< dist (:hd m))
          (-> m (remove-point (:hd m)) (add-point dist pt))

The flow to the function returned by mk-sift-fn is:

  • if you haven't seen max-size entries yet, add it to the map (letting the add-point fn figure out where to stick it)

  • if the data structure is at max capacity, then if the distance from (0,0) of the current point under consideration is less than the current highest distance in the map, then we have to remove one point that maps to the highest-point and add this new point.

I use the -> threading macro to weave through the remove-point and add-point functions, allowing nice composition of the functions with immutable structures.

Here is the full implementation with all the helper methods:

Here's an example run with a small data set:

user=> (require '[sandbox.top100 :refer [topN]] :reload)
user=> (def points (shuffle (for [x (range 2 5) y (range 3)] [x y])))
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (topN points 5)
{2 [[2 0]], 3 [[3 0] [2 1]], 4 [[4 0] [2 2]], :hd 4, :ct 5}

/* ---[ The sorted-set solution ]--- */

Clojure's sorted-sets are binary (persistent) trees. The sort order I use for the sorted set is distance from (0,0) descending.

As before, we'll directly add the first max-size entries we see. After that, we have to remove entries from the set if one with a shorter distance is seen.

Due to our sort order, the point with the greatest distance would be at the top of the tree and is easily removed in constant time using disj when we find a point that is closer to the origin. However, we then have to add that new point to the sorted set and all of these additions average O(log-n) insertion time. I was pretty sure my ad-hoc solution would be more efficient overall because of this extra sort time for all elements that get added.

To define a sorting comparator in Clojure, you use the sorted-set-by fn which takes a comparator of your choosing.

I stated above that the sort order would by distance descending, but since this is a sorted set, not a sorted list or vector, that won't actually work:

user=> (require '[sandbox.top100 :refer [distance]] :reload)
user=> (defn by-dist [pt1 pt2]
  #_=>   (> (distance pt1) (distance pt2)))
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (apply sorted-set-by by-dist points)
#{[4 2] [3 2] [3 1] [3 0] [2 0]}

We lost some points in the set. We have [3 1], but not [4 0]. Since these have the same "value" in the eyes of the set and a set can keep only one value, the other is dropped.

So I made the sort-by method to take into account equal distances and then do a secondary sort based on the value of the x coordinate, thus keeping all the points we are fed:

(defn dist-then-first [pt1 pt2]
  (let [dist1 (distance pt1)
        dist2 (distance pt2)]
    (if (= dist1 dist2)
      (> (first pt1) (first pt2))
      (> dist1 dist2))))

As before I used reduce to iterate over all the input points. The overall solution is nicely shorter than the ad-hoc one:

/* ---[ Performance showdown ]--- */

At the Clojure Conj this year, I discovered Hugo Duncan's criterium benchmark tool, which gives you more robust benchmarks than simply using time.

I used it to compare the solutions above. I redefined the points vector to have 100,000 points. I ran bench twice (in the opposite order). The first time I kept the closest 6 points. The second time I kept the closest 100.

user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
user=> (count points)

user=> (use 'criterium.core)
user=> (bench (topNSS points 6))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.022929 sec
    Execution time std-deviation : 10.070356 ms
   Execution time lower quantile : 1.006426 sec ( 2.5%)
   Execution time upper quantile : 1.044943 sec (97.5%)

user=> (bench (topN points 6))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 545.035140 ms
    Execution time std-deviation : 4.335731 ms
   Execution time lower quantile : 538.861529 ms ( 2.5%)
   Execution time upper quantile : 554.198797 ms (97.5%)

user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 531.174287 ms
    Execution time std-deviation : 4.642063 ms
   Execution time lower quantile : 522.942875 ms ( 2.5%)
   Execution time upper quantile : 541.571260 ms (97.5%)

user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.260036 sec
    Execution time std-deviation : 15.670337 ms
   Execution time lower quantile : 1.240810 sec ( 2.5%)
   Execution time upper quantile : 1.292583 sec (97.5%)

Sweet! My ad-hoc solutions runs twice as fast for this dataset. In the end, it's a trade off between the more elegant code and performance, which is often the case.


Monday, October 1, 2012

Reading user input from STDIN in Clojure

Recently I was working on a simple Clojure program where I need to read input from STDIN. I hadn't actually done this before, so I searched online and found others had similar questions, but had to cobble an answer together due to two issues:

  1. In most languages you use a while loop, check for some condition (or input) to be false in order to stop. Clojure actually has a while loop construct, but making it work seems tricky and likely involves mutable state. Paying the overhead of STM (Software Transactional Memory) in order to do a simple input loop, seems like excessive overkill.

  2. When you try to read from STDIN using lein run, it doesn't work. The input is ignored.

So, here's what I have working that is reasonably idiomatic and concise.

In my code example, I want to read a user's input until they type in a sentinel value, which I chose to be :done. Here I just echo the output back:

The trick to getting it work with with Leiningen (I'm using lein-2.0.0-preview10) is to use lein trampoline run, rather than lein run. The trampoline feature allows you to run your app in a separate JVM process rather than within the Leiningen process (as a child process). Running it as a child process somehow blocks STDIN input from being captured.

Example usage:

$ lein trampoline run -m example.stdin
Enter text:
First line
You entered: >>First line<<
Second line
You entered: >>Second line<<
You entered: >>Foo<<

In all likelihood, you would want to capture and retain the input lines in a collection. To do that use an accumulator in your loop:

Ideas on more idiomatic ways to do this in Clojure are welcome.

Monday, September 24, 2012

Running Wireshark as non-root on Ubuntu Linux

/* ---[ Wireshark on Linux ]--- */

When you install the wireshark APT package on an Ubuntu system (in my case Xubuntu 11.10) and start it (as you), it will start up fine, but you can't immediately start sniffing packets because you'll have nothing showing in the Capture > Interface list to click on. For this to work on linux, wireshark's packet capture tool, dumpcap, has to have additional privileges (or capabilities) to run.

Naturally, you then think "I'll run it as root" and do sudo wireshark. Yes! Now I can see eth0 and lo and any other network interfaces on my system. But wait, there's this pop-up dialog telling me that running wireshark as root is dangerous and it refers me to /usr/share/doc/wireshark-common/README.Debian, which mostly reads as mystical uber-geek mutterings. Then you see this link at the bottom, which looks more promising.

It is a great article that shows you how to either give the setuid attribute to /usr/bin/dumpcap or, preferably, use Linux capabilties to grant just the specific capabilities that dumpcap needs, without needing to run it as root.

Unfortunately, the instructions given for the second option on that page are wrong. There are three errors:

Error 0: The groupadd command has a spurious -g

Error 1: It forgets to set wireshark as the group for /usr/bin/dumpcap

Error 2: It gets the setcap instruction wrong.

/* ---[ Make it so ]--- */

Here is the new improved instruction set that I just got to work on my Xubuntu system:

$ sudo su - root
# sudo apt-get install libcap2-bin
# groupadd wireshark
# usermod -a -G wireshark <your-user-name>
# chmod 750 /usr/bin/dumpcap
# chgrp wireshark /usr/bin/dumpcap
# setcap 'CAP_NET_RAW+eip CAP_NET_ADMIN+eip' /usr/bin/dumpcap

(Note: on modern Ubuntu's you probably won't need to install libcap2-bin. I already had it.)

I believe you will also need to logout and log back in. You might be able to avoid this by invoking newgrp wireshark, but I'm not sure your group setting is the only reason you would have to log out.

To be fair, the wireshark wiki does get it right on this page, but you have to look in the "Other Linux based methods" section, not the "Debian, Ubuntu and other Debian derivatives" section.

Sunday, September 9, 2012

Beautiful Clojure

I ran across some beautiful code today.

I have the book Beautiful Code, though I've only read the chapter by Douglas Crockford on writing a simplified JavaScript parser using Top Down Operator Precedence (say that three times fast). And I skimmed the MapReduce chapter once. It's on my TODO list. Along with a lot of other things. So much awesome CS to do, so little ...

So that's not the beautiful code I'm referring to.

I spent the last few days going back to do some problems at 4Clojure. I finished about half of them a number of months ago and now it's time to make some progress on them again.

So I picked an "easy" one (so they claim). And got stumped. It was Problem 28: Flatten a Sequence. In the end, after only coming up with solutions involving mutable state (for shame), I decided to learn from the masters and look at the source code to clojure.core/flatten.

/* ---[ Beautiful Clojure, version 1 ]--- */

If you haven't seen it before, it's quite interesting:

It treats a nested collection, such as [1 2 [3 [4 5] 6]], like a tree, where the collection and its subcollections are branches (non-leaf nodes) and the values are the leaf-nodes.

It depends on tree-seq to do most of the work. I won't reproduce it here but take a look to understand it better.

After grokking the idea of treating the data structure like a tree, my next question was "why does flatten have to filter the results?". Here's why:

test=> (def ds [1 2 [3 4 [5 "a"]]])
test=> (tree-seq sequential? seq ds)
([1 2 [3 4 [5 "a"]]] 1 2 [3 4 [5 "a"]] 3 4 [5 "a"] 5 "a")

I see, tree-seq returns both branches (with their children) and leaf-nodes. To get just the leaf-nodes, I have to filter out the branches.

That is a fascinating idea, but I have to admit it wasn't what I expected the code to flatten to look like.

Then I solved the problem on 4clojure and was eager to see how others had solved it. I found two right away that immediately struck me as beautiful code.

/* ---[ Beautiful Clojure, version 2 ]--- */

The first, from nikkomega, uses mapcat, which works like map but calls (apply concat) on the result. I am in awe of its elegance:

Use map to iterate over each element in the data structure. If it is a "value", like 3, then return '(3). If it is a collection (more precisely implements the Sequential Clojure interface), then recursively dive into that collection.

In the end the outer map returns a sequence: '('(1) '(2) '(3) '(4) '(5) '("a")) (using the collection I listed above). Applying concat gives us the flattened sequence: '(1 2 3 4 5 "a").

Like clojure.core/flatten, this one also returns a LazySeq, but we get that for free - both map and concat return lazy-seqs.

/* ---[ Beautiful Clojure, version 3 ]--- */

The second solution that caught my eye is from unlogic. In the end it is pretty much the same algorithm, but uses reduce to iterate over all the elements and concats the results as it goes:

/* ---[ Beautiful Groovy? ]--- */

So I wondered, could I do this in Groovy, a language that provides some nice functional programming features?

Here is my best shot:

It prints:

[1, 2, 3, 4, 5, a]
[1, 2, 3, 4, 5, a]

(Disclaimer: I'm still learning Groovy so if you see a better way, I'd be happy to hear it.)

inject is (unfortunately) the name for reduce in Groovy (and other languages, like Ruby). In Groovy, there is no mapcat (that I know of) so I wrote it using inject and +, the latter being the equivalent of Clojure's concat when applied to collections.

You'll notice that I am using mutable state here: the shovel operator, <<, pushes onto and modifies in place the data structure I'm building up and the concat in flat3 has to be assigned back to x. I'd like to see a version that doesn't use immutable state, but perhaps even here it is justified? No less than Rich Hickey stated in his seminal "Are We There, Yet?" presentation that he's "not a fundamentalist, but rather a pragmatist": there are times when working with local variables in a contained setting that mutating state is a pragmatic and safe thing to do. Does that apply here?

Comments welcome.

Sunday, September 2, 2012

Before and after logic in the clojure.test framework

In Clojure's clojure.test framework, there are 3 facilities for setup and teardown, or "before" and "after" logic, to use RSpec terms. Overall, clojure.test is more similar to the xUnit style of tests, but the testing macro provides a bit of an RSpec feel.

/*---[ oneTimeSetUp and oneTimeTearDown ]---*/

In JUnit there used to be "oneTimeSetup" and "oneTimeTearDown" methods you could create. In JUnit 4 you now use the @BeforeClass and @AfterClass to mark methods to be used this way.

In clojure.test you define a :once fixture - method that will be called before any tests are run and it will be passed a function to call that will invoke all the tests. From this method, you can invoke any one time initial setup and final tear down functionality you need. Then you register it with the framework using its use-fixtures method:

/*---[ setUp and tearDown ]---*/

The logic for a setup and teardown method that will be called before and after each test is similar: you define an :each fixture and register it as a callback. The each fixture is also passed a function to invoke:

Putting those together you have:

Here's I've added a line to printout the type of thing it is passed, just so we can see that it is indeed a function. Here's the output from my run with three tests defined (not shown):

=> (clojure.test/run-tests)

Testing crypto-vault.core-test
one time setup
one time teardown

Ran 3 tests containing 10 assertions.
0 failures, 0 errors.
{:type :summary, :pass 10, :test 3, :error 0, :fail 0}

You can see that our fixture methods are being passed references to "subfunctions" to the test-all-vars function. If you inspect the source code for that function, you see that you can also specify multiple :once and :each fixtures, if you need to.

/*---[ The 'testing' macro ]---*/

The each and once fixtures basically provide what JUnit and most xUnit frameworks give you.

BDD "spec" frameworks like RSpec and Jasmine, go a little further and allow you to define test contexts with their own before and after callbacks and you can nest contexts within contexts arbitrarily deep.

One value of that model is better documentation of each section of the tests. Another is finer control over before and after logic.

clojure.test does not provide nested before and after callbacks, but you can use the testing macro to define sub-contexts and then use let bindings to define any "setup" for that context.

If you really need a nested "teardown" within a testing subcontext, you can do it all in a try/finally structure:

testing contexts can be nested within each other if that makes sense for what you are doing.

Whether you put the teardown logic in a finally block or a teardown fixture is largely one of taste, but I prefer to keep the test focused on the test itself and leave the boilerplate to the fixtures.

Friday, August 17, 2012

Happiness is an emacs trifecta

Today was a good emacs day. Actually, any day I spend in emacs is good, but today was especially good, since I found three new little gems to stick in my init.el.

/* ---[ Show only one buffer on startup ]--- */

First, I wanted emacs to only open with a single buffer showing when I pass multiple files to it on the command line. A few minutes later, solution #1 found:

[Update 20-Aug-2012]: gist updated with shorter version based on feedback from Steven Purcell.

/* ---[ Override a paredit keybinding ]--- */

I've come to love paredit when working in Clojure and emacs-lisp. Except for one problem: for years I've used Cntl-<left> and Cntl-<right> a lot to quickly navigate around. Those keystrokes are deeply ingrained in my motor memory. The paredit author's way of working differs from mine though, since he bound those to "barf" and "slurp" (mmmmm), respectively. When you then try to zip quickly around the screen using Cntl-<left> and Cntl-<right> with paredit-mode enabled, your cursor goes nowhere and instead you make a holy mess of your parens and brackets in the file.

So I wanted to rebind barf and slurp to Meta-<left> and Meta-<right> and unbind the existing mappings with the Control key. For a while I've just been editing the paredit.el file, but now that we have a package system in emacs24, that will get overridden when I upgrade. Thus, I need a clean way to do this that works for any future upgrades. So, a few minutes later, solution #2 found:

I hadn't seen the eval-after-load function before. It is a particularly satisfying way to override bindings in another package.

/* ---[ Toggle between multiple windows and a single window ]--- */

I've been teaching emacs to an intern at my job and today after showing him macros, how to toggle comment blocks and ace-jump-mode (which is awesome!), he asked: currently, I've got 4 buffers showing, but what if I want to expand one window to fill the window, work on it and then restore emacs back to having the 4 buffers I had showing before?

My response: blink, blink.

Umm, that would awesome, why didn't I think of that? So a few minutes later, thanks to Ignacio Paz Posse (about the coolest name I've ever seen), beautiful solution #3:

This is my favorite emacs thing now.

That is, until I find the next shiny bit of emacs-lisp to make my day happier than usual.

[Update 20-Aug-2012]: Steve Purcell in the comments pointed out that there is another way to restore your window configuration that has been built into emacs since version 20: Winner Mode.

So if you have trouble with the toggle-windows-split or would rather use something built-in, try this instead. In your .emacs or init.el add:

(winner-mode 1)

Then when you have multiple buffers showing and you want to maximize one of them enter C-x 1 as usual to close all other windows except the one with cursor focus. To restore your previous window configuration enter C-c <left>. Sweetness.

Saturday, August 11, 2012

Readability: Further improving your online reading experience

A while back I wrote a blog entry about how to overcome the lack of contrast and poor readability of many websites, even those that exist to provide content to be read online.

In that post I talked about some browser plugins (for Chrome and Firefox) where you can override those settings. In general these have improved my online reading experience, but they don't always work. There is also the problem of formatting. Some content sites clutter their web pages with too many advertisements, inserts and whirligigs so as to make reading downright unpleasant, even if they have good contrast for reading.

Case in point:

Or even worse from the same website:

This last one is so bad that I broke out my ruler: on my screen using Firefox, the text has occupies 29.167% of the screen width.

Maybe some people like that. It is reminiscent of an extremely busy old tyme newspaper layout.

If, like me, you don't like that style, then I have a suggestion: try You sign up for a free account and install it as a browser plugin. It adds some buttons to your toolbar - I usually hate anything that does that, but in this case I've made an exception.

When you get to a page where you want to read the content, just click the "Read Now" button. It will extract just the content, including images that are part of the content, format it into a nice readable layout, copy it to their servers, generate a URL and redirect you to that URL. You can also choose between 5 themes. I picked the "inverse" theme which gives a dark background with light text. Here is that second article above using the "inverse" theme:

That contrast is still a little too faint for my taste - they should get some advice from Contrast Rebellion. So I go one further and use No Squint on firefox or or Change Colors on Chrome to make it an even darker background and lighter text:

One shortcoming: readability can't extract multipage articles, but you can choose the "print all" button on the website if they offer it and then press "Read Now".

Readability also offers a "Read Later" option. Click that puts the article on your reading list. You can also use Readability on Kindle, tablets and smart phones.

I'm sure there are many other services that do something similar. I haven't researched them since I've found Readability + No Squint (Firefox) or Change Colors (Chrome) works for me, but if you have one you like even better, let me know.

Saturday, July 7, 2012

JavaScript Method Overloading

I'm focusing on leveling-up my JavaScript skills this month. To start, I grabbed the "early" edition of an in-progress book by John Resig and Bear Bibeault: Secrets of the JavaScript Ninja. (I put "early" in quotes, since I now have v.10 of the MEAP and it has been in progress since 2008, but is still actively being worked on.)

Since it is still in progress, it is a bit rough around the edges, but so far (I'm on chapter 5 now) I've found it to be a solid presentation of the nuances of JavaScript. If you are an intermediate level JS developer and want to take solid steps to becoming more expert, I can recommend this as one of the books you should read. It focuses on JavaScript itself, not some of the myriad JavaScript frameworks and libraries that have come out in the JavaScript Cambrian explosion of the last few years - though it will analyze techniques used in Prototype and jQuery.

/* ---[ JavaScript method overloading ]--- */

Today I have a short note on JavaScript method overloading. Technically, there is no such thing as overloading in JavaScript - you can have only one function for a given function name. JavaScript, being flexible, allows you to pass in too few or too many arguments to a function. There are ways to handle that within your function with a series of if/else blocks, but you might want something more formal and less noisy.

The Secrets book suggests doing it this way:

function addMethod(object, name, fn) {
  var old = object[name];
  object[name] = function(){
  if (fn.length == arguments.length)
    return fn.apply(this, arguments);
  else if (typeof old == 'function')
    return old.apply(this, arguments);
  else throw "Wrong number of args";  // I added this part

This method layers each version on top of the other, like a semi-recursive set of closures. The last one is the fallback for the previous one.

Let's try it out:

var ninja = {};

addMethod(ninja, 'attack', function() {
  console.log("Attack no args");
addMethod(ninja, 'attack', function(x) {
  console.log("Attack 1 person: " + x);
addMethod(ninja, 'attack', function(x, y) {
  console.log("Attack 2 people: " + x + ", " + y);

ninja.attack("Groucho", "Harpo");
try {
  ninja.attack("Groucho", "Harpo", "Chico");
} catch (e) {
  console.log("ERROR: attack: " + e);

If I run this on the command line with node.js, I get:

Attack no args
Attack 1 person: Groucho
Attack 2 people: Groucho, Harpo
Attack 1 person: Groucho
Attack no args
ERROR: attack: Wrong number of args

Pretty cool technique. It also doesn't matter the order in which you define the overloads. Try it out by changing the order of the addMethod calls.

/* ---[ Alternative version: my contribution ]--- */

To polish up my JS ninja skills, I came up with another way to do this:

function overload(object, name, fn) {
  object[name] = object[name] || function() {
    if (object[name + "." + arguments.length]) {
      return object[name + "." + arguments.length].apply(this, arguments);
    else throw "Wrong number of args"

  object[name + "." + fn.length] = fn;

Here I make the primary function name ("defend") a function that invokes "unpublished" functions that the overload function created.

So if you call ninja.defend("a", "b"), it gets routed to ninja[defend.2]("a", "b"). I have to use the bracket notation, since ninja.defend.2() will try defend a method called "2" on the defend method, which of course doesn't exist.

To prove it works the same, let's run this with node.js:

var ninja = {};

overload(ninja, 'defend', function(x, y) {
  console.log("Defense against 2 attackers: " + x +", "+ y);
overload(ninja, 'defend', function(x) {
  console.log("Defense against 1 attacker:  " + x);
overload(ninja, 'defend', function() {
  console.log("General Defense");

ninja.defend("Moe", "Larry");
try {
  ninja.defend("Moe", "Larry", "Curly");
} catch (e) {
  console.log("ERROR: defend: " + e);


General Defense
Defense against 1 attacker:  Moe
Defense against 2 attackers: Moe, Larry
Defense against 1 attacker:  Moe
General Defense
ERROR: defend: Wrong number of args

Lastly, let's call one of those unpublished methods directly to prove that is what it's doing under the hood:

var ninja = {};

overload(ninja, 'defend', function(x) {
  console.log("Defense against 1 attacker:  " + x);


Gives the output:


Defense against 1 attacker:  Moe

Saturday, June 30, 2012

Datomic: An Initial Comparative Analysis

/*---[ Day of Datomic: Initial Thoughts ]---*/

Yesterday I attended a "Day of Datomic" training course, taught by Stuart Halloway. It was well taught and had quite a bit of lively Q&A and discussion. If you have the opportunity to take the Day-of-Datomic course I highly recommend it.

At the bar at the end of class, one of my fellow students and I were discussing Rich Hickey and he said "Rich has clarity". Rich has thought long, deep and critically enough to see through to essence of key issues in modern computer science, bypassing things that are so ingrained as to be a familiarity blindness in the rest of us. Datomic is the latest brain-child of Rich Hickey and was co-created by him, Stu and others at ThinkRelevance, here in Durham, where I live.

To be clear, I have not yet attained "clarity" on Datomic. The training course was an excellent deep dive that lays the groundwork for you to climb the hill of understanding, but that hill still lies ahead of me. As Fogus said, this is something very new, at least to most of us who are used to thinking in terms of relational algebra, SQL, and structured tables. Even for those of us now comfortable with document databases (Mongo and Couch) and key-value stores, this is still different in many important ways.

I won't have time to dive deeper into Datomic for a while (too many other things on my to-do/to-research list), but I wanted to capture some key notes and thoughts that I can come back to later and may help others when starting out with Datomic.

I am not going to try to give an overview of Datomic. There are lots of good resources on the web now, starting with the Datomic web site.

I reserve the right to retract any thoughts I lay out here, as I have not yet done my hammock time on Datomic.

/*---[ Compare it to what you know ]---*/

When you get something new, you want to start by comparing it to what you know. I think that's useful for finding the boundaries of what it's not and tuning the radio dial to find the right way to think about it. There is enough new here that comparisons alone are not sufficient to get a handle on Datomic, but it's one helpful place to start.

Here is a partial list of things that could be useful to compare Datomic to:

  1. Git
  2. Document dbs (MongoDB and CouchDB)
  3. Graph dbs (Neo4j)
  4. RDF Triple Stores
  5. Event Sourcing
  6. VoltDB (and other "NewSQL" databases)

/*---[ Datomic v. Git ]---*/

I've seen the "Datomic is like Git" analogy come up a couple of times now - first in the Relevance podcast with Stu, once while describing Datomic to a work colleague and again yesterday in the class. This seems to resonate with many so use it as a starting point if it helps you, but the analogy falls short in a number of areas.

In fact, the analogy is useful really only in one way - for the most part we aren't used to thinking in terms of immutability. Stu emphasized this in the course - immutability changes everything. Rich Hickey, in his Clojure West talk on Datomic, said "you cannot correctly represent change without immutability ... and that needs to be recognized in our architectures".

Git has given us this idea that I have immutable trees (DAGs actually) of data in my history, wrapped in a SHA1 identifier that guarantees that tree has not changed since you committed it. These trees of history could have been created through sequential commits or a variety of branches and merges. Regardless, I can always go back to that point in time.

That idea does resonate with Datomic's design, but after that the analogy falls apart. Datomic doesn't have multiple branches of history - there is only one sequential trail of history. Datomic is not meant to be a source code repository. Another difference is that Datomic is intended to deal with much larger volumes of data than git. Git does not separate storage from query or transactional commits, as Datomic does, because it doesn't need to. You should never have a source code repo that is so big that you need to put it on a distributed storage service like AWS EC2.

/*---[ Datomic v. Document data stores ]---*/

"Schema" in Datomic is the thing I struggled with the most in the class. I think it will be important carefully and systematically spell out what is (and isn't) a schema in Datomic.

In my view, the Datomic schema model is non-obvious to a newcomer. On the one hand I got the impression from the earlier talks I'd seen on Datomic that Datomic is just a "database of facts". These facts, called "datoms", do have a structure: Entity, Attribute, Value and Time/Transaction (E/A/V/T for short). But "Datomic is a database of facts" sounds superficially like a set of records that have no other structure imposed on them.

On the other hand, my naive assumption coming into the training was that Datomic is actually quite close to the Mongo/Couch model. When written out, datomic records look like constrained JSON - just vectors of E/A/V/T entries and some of those can be references to other JSON-looking E/A/V/T facts, which seems like a way to link documents in a document datastore.

It turns out that neither of those presuppositions are right. Datoms are not JSON documents. Stu made the comment that of all the databases available out there, the Mongo/document-store model is probably the most different from the Datomic model (see Figure 1 below). That was actually quite surprising to me.

I have to go off and think about the differences of these models some more. It will take a bit of tuning of the radio dial back and forth to understand what the Datomic schema is.

/*---[ Datomic v. Graph databases (Neo4j) ]---*/

One area not covered much in the day-of-datomic training was a comparison of Datomic to graph databases (Neo4j in particular).

A significant part of the object-relational impedance mismatch is the translation of relational rectangles to object graphs. When people come to a graph database they get really excited about how this impedance falls away. Walking hierarchies and graphs seems beautifully natural and simple in a graph database, compared to the mind-bending experience of doing that with SQL (or extensions to SQL).

So can Datomic function as a graph database? I don't know. Maybe. This type of db was not on Stu's slides (see figure 1 below). When setting up attributes on an entity, you have to specify the type of an attribute. One type of attribute is a ref -- a reference to another entity. By setting up those refs in the schema of attributes, is that basically the same as setting up a Relationship between Nodes in a graph db? Or does Neo4j do things that Datomic is either not suited for or not as performant as?

[Update: 02-July-2012]: Stu Halloway sent me an email that gave some feedback on this question. First, he pointed me to an article by Davy Suvee in which he gives a nice set of basic examples of how Datomic keeps a graph of relations that can be queried using Datalog.

Stu pointed out that one difference between Datomic and a graph database like Neo4j is that the latter "reifies edges". In graph theory you have two basic concepts: the Node (Vertex) and the Relationship (Edge) between two Nodes. Both are first class (reified) entities of the graph model - they can have identities and attributes of their own.

Datomic does not make the Relationship a thing unto itself - it is just the "ref" between two entities (nodes). In Stu's words: "We believe many of the common business uses for reified edges are better handled by reified transactions". More food for thought! :-)

That being said, you can wrapper Datomic into a full blown graph API and that is just what Davy Suvee has done. He implemented Datomic as a backend to the Blueprints Graph API and data model, which is pretty good evidence that Datomic can function as a graph database for at least some important subset of use cases.

/*---[ Datomic v. RDF Triple Stores ]---*/

Interestingly, the part of the NoSQL world that seems to get the least attention (at least in the circles I mingle in, including sifting through Hacker News with some regularity) is the database whose data model comes closest to Datomic's - RDF Triple Stores.

Rich, in his Clojure West talk, says that RDF triple stores almost got it right. Triple stores use subject-predicate-object to represent facts. This maps to Datomic's entity-attribute-value concept. The piece missing is time, which Datomic models as, or wraps into, transactions. Transactions in Datomic are first-class creatures. They encapsulate timestamp, the order of history and can have additional attributes added to them (such as what user id is associated with this transaction).

Frankly, I don't know a lot about RDF triple stores, so I'm just going to include this slide from Stu's training session for your consideration:

Fig. 1: Comparison of Datomic to other databases using Agility as a metric. (From Stuart Halloway's training slides)

/*---[ Datomic v. Event Sourcing ]---*/

Event sourcing is an intriguing and compelling idea that changes to state should be recorded as a sequence of events. Those events are immutable and often written to an append-only log of "facts". When you do event sourcing with a relational database, you have to decide how to log your facts and then what to represent in the relational schema. Will you keep only "current state"? Will you keep all of history? Will you set up a "write-only" or "write-predominant" database where up-to-date state is kept and then read-only databases that somehow get populated from this main datastore?

There are lots of ways to set it up and thus lots of decisions to make in your architecture. Read Fowler's write up on it if you are new to the idea. The newness of this idea is a something of a risk when trying to figure out the best way to implement it.

As I understand it so far, Datomic is an implementation of an event sourcing model that also answers for you many of these ancillary questions about how to implement it, such as: where and in what format will you write the transaction log? How will you structure your query tables? Will you replicate those tables? Will you project those tables into multiple formats for different types of queries and analytics? Will transactions will be first-class entities?

Datomic's answers to those questions include:

  1. There is a component of the system called a transactor and it writes immutable transactions in a variant of "append-only", since it writes transactional history to trees. As I'll talk about in the next section all transactions are handled in serialized (single-threaded) mode.
  2. You structure your data into E/A/V/T facts using the Datomic schema model constraints. Queries are done on the Peers (app tier) and those Peers have read-only datasets.
  3. You don't need to make multiple projections of your data. Datomic already gives you four "projections" (indexes) out of the box - one index by transaction/time and three other indexes:
    1. An EAVT index, meaning an indexed ordered first by entity (id), then by attributes, then by values and finally by transaction
      • This is roughly equivalent to row-level index access in a relational system
    2. An AEVT index order, which is the analog of a column store in an analytics database
    3. A VEAT index order, for querying by known values first
    4. Lastly, you can also specify that Datomic keep an AVET index, which is useful for range queries on attribute values. It is not turned on by default, since it signifcantly slows down transactional write throughput
  4. Transactions are first-class and have their own attributes. One can query the database based on transactions/time and you can go back to a point in history and see what the state of the database was at that time.

I think one of the most exciting things about Datomic is that it gives you an event-sourced system with many of the design decisions already made for you.

/*---[ Datomic v. VoltDB ]---*/

I did a short write-up on VoltDB and the "new SQL" space a few months ago.

Like VoltDB, Datomic posits that we do not have to give up ACID transactions in order to improve scalability over traditional relational models. And VoltDB and Datomic share one key core premise:

all transactional writes should be serialized by using a single writer.

In VoltDB, there is one thread in one process on one server that handles all transactions. In Datomic, there is one transactor on one machine (either in its own thread or process, not sure which) that handles all transactions. Transactions, unlike most NoSQL systems like MongoDB, can involve any number of inserts/updates/deletes (VoltDB) or additions/retractions (Datomic).

Another similarity is that both VoltDB and Datomic allow you to add embedded functions or stored procedures to the transactor that will run inside the transaction. In both cases, you add or compile the embedded transaction functions to the transactor itself. Those functions do not run in the app tier.

Michael Stonebraker, key designer of VoltDB, based this design on the fact that traditional relational databases spend on the order of 80% of their cycles doing non-productive work: managing locks and latches, commit logs and buffer caches. By moving to a single-threaded model, the overhead of write contention goes away. VoltDB claims their transactional write throughput is on the order of 40x to 50x faster than traditional relational databases.

Where Datomic and VoltDB part ways is that VoltDB retains SQL and a fully relational model, using an in-memory database that is shared across many nodes in a cluster.

Another key difference is that VoltDB really focuses on the OLTP side, not on analytics. Datomic can target both areas.

/*---[ Datomic v. Stonebraker's "10 Rules" ]---*/

In the summer of 2011, Michael Stonebraker and Rick Cattell published an ACM article called "10 Rules for Scalable Performance in ‘Simple Operation’ Datastores". By "Simple Operation Datastore" they mean databases that are not intended to be data warehouses, particularly column store reporting databases. So traditional relational databases and most NoSQL databases, including Datomic, are "Simple Operation" by this definition.

How does Datomic fare against their 10 rules? As I said, I'm still too new to Datomic to give a definitive answer, but here's my first pass at assessing Datomic against them. I'd invite others to chime in with their own analysis of this. These axes will provide another way for us to analyze Datomic.

Rule 1: Look for Shared Nothing Scalability

A shared-nothing architecture is a distributed or clustered environment in which each node shares neither main memory nor disk. Instead, each node has its own private memory, disks and IO devices independent of other nodes, allowing it to be self-contained and possibly self-sufficient, depending on the usage conditions. Typically in a shared nothing environment the data is sharded or distributed across nodes in some fashion.

An alternative architecture is shared disk - a disk that is accessible from all cluster nodes. In a shared disk environment, write contention must be managed by the disk or database server.

So is Datomic a shared nothing architecture? One could argue that the Peers are not entirely self-sufficient, as they rely on the storage service to pull data. And the transactor is not self-sufficient as it has no storage of its own - it depends on the storage service.

However, I think Datomic is shared nothing in that there will never be write contention among nodes - there is only one (active) transactor.

I also don't quite know how the storage service is architected with Datomic. If it appears as a monolithic service to the Peers on the app tier, then if it goes down, none of the Peers can function unless they have cached all the data they need.

Rule 2: High-level languages are good and need not hurt performance.

This is a direct shot at the fact that many NoSQL databases have had to reinvent APIs for querying data and many have given us low-level programming APIS, not DSLs or high level languages. (To be fair, some have - Cassandra has CQL and Neo4j has the excellent Cypher language.)

Datomic fares very well on this one - it gives us Datalog, which Stu pointed out in the training is mathematically sound like the relational algebra and is transformable into relational algebra.

Rule 3. Plan to carefully leverage main memory databases.

Datomic allows you spin up in memory arbitrary amounts of data from the datastore. The amount is limited by (at least) two things:

  1. how much memory you can put on the app/Peer machine

  2. how much memory you can put into a JVM without hittings a size where GC pauses interfere with the application

On the gc-end, you have a couple of options to ramp up. First, Datomic has a two-tier cache on the Peer - one in main memory (on-heap) and one that is "compressed" or treated as a byte array, which could be on or off-heap, but in either case puts less pressure on the garbage collector.

Second, if you want really big JVMs with on-heap memory and no GC pauses, then you can levearge Azul's JVM, though it will cost you extra.

Datomic scores a reasonably high score here, but it is not architected to be a sharded in-memory database. However, with a very fast network and SSDs on AWS, Datomic's performance may not suffer significantly. We await some good case studies on this.

Rule 4. High availability and automatic recovery are essential for SO scalability.

From the Stonebraker paper:

Any DBMS acquired for SO applications should have built-in high availability, supporting nonstop operation.

With an AWS deployment, you inherit any weaknesses of AWS model (e.g., a few prominent outages recently), but overall it's a highly redudant and highly available storage service. With Datomic you can have any number of Peers on the app tier, so replicate those as needed.

You are not required to use AWS. Currently Datomic can also be deployed on relational systems and they hope to be able use Riak in the future.

The single point of failure would appear to be the single transactor, but Datomic will provide a failover mode to an online back transactor. We'll need to learn if this failover will be provided as an automatic service or requires configuration/set-up by the user/administrator.

Rule 5. Online everything.

This overlaps rule 4 in terms of always being up, but adds to it the idea that schema changes, index changes, reprovisioning and software upgrades should all be possible without interruption of service.

I don't have hard data here from the Datomic guys, but my guess is that Datomic will pass this test with flying colors. Most indexes are already turned on by default and I suspect turning on the other one (see above) can be done without database downtime.

Schema changes ... hmm. Stu said in the training course that attributes cannot (currently) be removed from the schema, but you can add new ones. Would this require rebuilding the indexes? I doubt it, but I'd need more information on this one.

For reprovisioning and software upgrades, it is easy to envision how to add more storage and Peers since you have redundancy built in from the start. For the transactor, reprovisioning would not mean spinning up more nodes to process transactions, it would be mean getting a bigger/faster box. I suspect you could reprovision offline while your existing transactors are in motion, then swap to the new improved system in real-time using the failover procedure.

Rule 6: Avoid multi-node operations.

From the Stonebraker paper:

For achieving SO scalability over a cluster of servers .... Applications rarely perform operations spanning more than one server or shard.

Datomic does not share data sets. The only operations that need to be performed over more than one server is when a Peer needs to access data that has not yet been cached on the app tier. It then makes a network lookup to pull in those data. This is certainly slower than a complete main memory database, but Stu said that AWS lookups are very fast - on the order of 1ms.

Rule 7: Don’t try to build ACID consistency yourself.

Unlike most NoSQL datastores, Datomic has not abandoned ACID transactions. It handles transactions of arbitrary complexity in a beautiful, fast serialized model. A+ on this one.

Rule 8: Look for administrative simplicity.

From the Stonebraker paper:

One of our favorite complaints about relational DBMSs is their poor out-of-the-box behavior. Most products include many tuning knobs that allow adjustment of DBMS behavior; moreover, our experience is that a DBA skilled in a particular vendor’s product, can make it go a factor of two or more faster than one unskilled in the given product.

As such, it is a daunting task to bring in a new DBMS, especially one distributed over many nodes; it requires installation, schema construction, application design, data distribution, tuning, and monitoring. Even getting a high-performance version of TPC-C running on a new engine takes weeks, though code and schema are readily available.

I don't have a lot data one way or the other on this for Datomic. I doubt there will be a lot of knobs to turn, however. And if you are deploying to AWS, there is a Datomic AMI, so deployment and set up of the storage side should be quite simple.

Rule 9: Pay attention to node performance.

Not every performance problem can or should be solved with linear scalability. Make sure your database can have excellent performance on individual nodes.

This one is a no-brainer: A+ for Datomic. You can run all pieces on the same box, entirely on different boxes, or mix and match as desired to put the faster systems in place that you need for each piece. You can also put your high intensity analytics queries on one box and your run-of-the-mill queries on another and they will not interfere. Performance of the system will be entirely dependant on individual node performance (and data lookups back into your data store).

Rule 10. Open source gives you more control over your future.

Datomic is a commercial product and is not currently open source or redistributable, as I understand it.

From the Stonebraker paper:

The landscape is littered with situations where a company acquired a vendor’s product, only to face expensive upgrades in the following years, large maintenance bills for often-inferior technical support, and the inability to avoid these fees because the cost of switching to a different product would require extensive recoding. The best way to avoid “vendor malpractice” is to use an open source product.

VoltDB, as an example, is open source and seeks commercial viability by using the support model that has worked well for Red Hat. Datomic is still young (less than 6 months old), so we will see how its pricing and support model evolve.

/*---[ Coda ]---*/

Remember that Datomic is more than the union of all these comparisons, even assuming I got all my facts right. Datomic not only combines the best of many existing solutions, but brings new stuff to the table. Like anything of significant intellectual merit, it will take a investment of your time to learn it, understand it and for us, as a community of users, to figure out best practices with it.

Datomic is interesting and fascinating from an intellectual and architectural point of view, but it also eminently practical and ready to use. I look forward to following its progress over the coming months and years.

Lastly, if it is of any use to anyone, here are my raw notes from the day-of-datomic training.