Wednesday, December 26, 2012

My CS/Programming Top 10 for 2012

As many will do and I did last year, I looked through my notes, projects, tweets, blog entries and personal wiki to assemble the highlights of my year on all topics computer science and software engineering.

These are the top 10 "things" that added the most to my knowledge, impressed me as excellent tools, and added to the joy of being a software developer. Here's my list in no particular order:

  1. Security Now podcast
  2. Chrome Dev Tools
  3. Emacs 24
  4. Groovy
  5. Clojure
  6. z (rupa/z)
  7. Xubuntu
  8. Coursera
  9. Fiddler2 and Wireshark
  10. TrueCrypt and other encryption tools
  11. One that should be on the list: Datomic

/* ---[ 1. Security Now podcast ]--- */

In March I did a short blog entry on the podcasts I was listening to or had heard of and wanted to try out. Good technical podcasts are a gold mine of information that you can use to fill the interstices of your day - while commuting, cleaning the kitchen or taking a walk.

The podcast that had the biggest impact on me in 2012 is Security Now, done by Steve Gibson and Leo Laporte, one of the flagship podcasts of the network.

To date, they have done 384 episodes, starting in August 2005 and the vast majority are still relevant and worth listening to. You can download them from Steve's GRC Security Now website.

The focus, of course, is on computer security, mostly for the individual user, not at the corporate level. The podcast also covers networking theory and practice in great detail, since the network is mainly how malware spreads and is a vast attack surface for it. For example, this year Steve did a deep dive into SPDY, the networking protocol developed by Google to speed up the web by reducing page load time by overcoming the TCP slow start problem.

And there's plenty of focus on cryptography and security. A highlight of the year was Steve's episode on Elliptic Curve Crypto, a crypto technology that will likely be used more heavily in days to come.

In addition, you learn a lot about how hard drives work, since Steve wrote SpinRite, a disk maintenance and recovery utility, which I use for maintenance on my spinning disks.

Also, starting with episode #233 ("Let's Design a Computer, part 1"), Steve does a 8+ episode series on the basics of how computers work, including what machine language is, how assembly language works, the role of stacks and registers, hardware interrupts and RISC vs. CISC architectures. You can learn (or be refreshed on) in surprising levels of detail for an audio-only medium. Steve is very good at explaining this stuff.

This year, while keeping up with the weekly new broadcasts, I went back and started at episode 1. At this point I've listened to about half of the episodes, so this will continue to be my mainstay into 2013.

/* ---[ 2. Chrome Dev Tools ]--- */

This year I got back into JavaScript programming. I remember the horrible days of debugging by alert statements, which contributed to the general consensus that JavaScript was a toy language and a piece of ill-thought out crap. Despite its warts, a result of its ridiculously over-short allowed design period, Brendan Eich created a rather fascinating and powerful programming language. Even though we like to complain about its issues, I agree with Crockford that given the conditions under which it was developed, we got better than we deserved and Mr. Eich is to thank for that.

So I was pleased to discover the awesome Chrome Dev Tools for browser based JavaScript development. JavaScript debugging can actually be pleasurable. Some resources to get you started if you aren't using it:

/* ---[ 3. Emacs24 ]--- */

Emacs is alive and well. In fact it is thriving more than ever. I've been a long time user of emacs and I use it for everything except Java (which really needs a full IDE).

Emacs 24, released this year, is a great text editor. I use it on both Linux and Windows, the exact same set up on both.

Most notably Emacs has package repositories now. Three, in fact, that I know of:
* ELPA, which is maintained/sponsored by GNU and has only the core emacs packages that adhere to the copyleft licensing model of the Free Software Foundation.
* Marmalade
* MELPA, which is where most of the bleeding edge work goes.

I used ELPA and MELPA by default, but I sometimes switch over to Marmalade if it has something not on the others. Generally MELPA and Marmalade seem to have the same packages, though MELPA often has the most recent. To make things confusing, MELPA moved to a date-based versioning system, like "20121206.1504", rather than the more traditional major.minor versioning system, such as "0.19".

There is a still a big learning curve to emacs and some things are still pretty esoteric (I still have trouble getting themes to work), but when people ask me why I use emacs I say: "if programming is your main career and hobby wouldn't you want to use the most powerful tool available? It's worth the few months of learning to enjoy the benefits for the rest of your life." But isn't emacs "old"??? (as if that's a bad thing) Seriously, when I use emacs I feel like I'm tapping into some of the collective wisdom of our programming culture from the last 30 years.

And no disrespect to vim. I like vim too. Pick one of those two and learn it. Stop using Notepad++ or worse.

A few emacs highlights from my year:

  • I love the nrepl package for Clojure. Now I can use those fancy keystrokes to auto-evaluate Clojure s-expressions. With the ac-nrepl package, it has code completion and will show you the argument signature for functions in the minibuffer! Some IDE-like goodness right there.

  • paredit. When I talk to people about Clojure (or Lisp in general), I sometimes get the story of how horrible it was balancing parentheses at 3 in the morning the day their CS class assignment was due. I am happy to announce to anyone that doesn't know: that problem is solved. It's name is paredit. Here is the slide deck I originally learned it from.

  • Learn to use emacs macros in two ways:

    • named macros you'll use a lot and save in your init.el (or macros.el if you want a separate file).
    • temp unnamed macros to automate some task you need to do some one-time repetitive thing, say, 10 times in a file. This EmacsRocks video shows a great example of that.

/* ---[ 4. Groovy ]--- */

When I was first learning Ruby, many years ago, I remember experiencing Matz' principle of least surprise. Once you learned the basic gist of Ruby and its blocks and classes, you could often just guess how to do something or what a method would be called and it would work. It was a very satisfying experience.

This year I joined a new company and they have largely adopted Groovy as their primary scripting language. As I jumped in to learn it, I had that deja vu feeling of learning Ruby, this time wrappering the Java language we know and love.

For example, I started to write a Groovy script that would have to recursively traverse a directory structure, and I remembered the pain of doing this in Java with its FilenameFilters and other APIs you had to learn to get anything done. I said to myself "I hope Groovy has made this easier". Holy smokes, they wrapped to have an eachFileRecurse method that takes a closure:

new File('.').eachFileRecurse {
  if ( =~ /.*\.txt/) println it;

There is also an eachDirRecurse and variations where you can pass in a file type filter.

The more I learn about Groovy the more I like it. In fact, the "groovy-JDK" is one of my favorite things: The Groovy creators and contributors have wrapped a large number of the Java classes, using the Groovy metaclass concept, and given them additional useful methods. Such as:

  • String now has an eachLine method and versions of replaceAll and replaceFirst that take a closure, allowing arbitrarily complex logic to be executed to determine the replacement string.

  • Map now has an any method that takes a predicate closure to see if at least one entry passes the predicate test. It also now has map and reduce, though the authors unfortunately followed Ruby in calling them collect and inject respectively.

  • And thank the gods, they wrappered the horrible java.util.Date class and made it more useful.

It provides many functional programming constructs, such as closures (the lambdas of Groovy), immutable data structures, higher order functions and very importantly: regex, list and map literals, akin to JavaScript or Closure literals (though the map literal syntax is different in Groovy).

With GStrings you get string interpolation and multi-line strings. And Groovy gives you simpler syntax for accessing getters and setters - you grab them like properties.

In short when you are hacking out large swaths of boilerplate in Java, using tedious syntax to do stuff with Maps, Lists, Regular Expressions and a variety of other things, you constantly think to yourself, "man I wish I could be doing in this in Groovy". Groovy makes programming a pleasure.

I'm still learning it and look forward to using it for years to come.

/* ---[ 5. Clojure ]--- */

And speaking of bringing the joy back to programming, Clojure is a combination of elegance, joy and ... wait a minute, how do I do this in Clojure? I ran across someone who described himself as a "perennial Clojure beginner". I can identify with that. Since I don't come from a Lisp or functional programming background, the last year learning Clojure has been like learning to ride a bicycle again. Except this bicycle is tricked out and has gears, knobs and restrictions that are different from the other bicycles.

I've started proselytizing co-workers about Clojure. I get the "why Clojure?" question a lot, so here is my version:

  • Combines the best of Lisp, such as macros, and Java/JVM, such as its world class garbage collector (which a language built on immutable data structures needs)
  • Brilliant design for immutable data structures that is now being adopted by other languages (Scala for one)
  • Functional programming model, but with practical bent (not Haskell, but more pure than Common Lisp)
  • STM: software transaction memory -- brilliant solution to shared mutable state
  • Designed for concurrency (in a couple of different ways)
  • A fast dynamic language: faster than ruby and python, comparable to Java in many areas and can drop into Java easily when performance is the most important thing
  • ClojureScript: bring the power of Clojure macro writing, namespaces and better syntax to doing your JavaScript work
  • Data centric (like lisps), but even better by being abstraction centric
  • Clean design for solving the “expression problem”:
  • Separation of concerns – an overall philosophy to tease things apart into simple (non-completed) pieces:
    • Example: polymorphism is not tied to inheritance
  • Simple and elegant syntax. For example, I find Scala to be powerful but overwhelming and confusing in its approach to syntax and expression
  • Community:
    • Small focused libraries (separation of concerns, non-complected)
    • Datomic => one of the greatest examples of separation of concerns there is
    • Core.logic => modern logic programming easily integrated into your program
  • Finally, an argument ad hominem: Rich Hickey. You need to watch the series of presentations he’s made over the past 5 years (perhaps one every week as Bodil suggests). Unquestionably the most impactful thinker in CS I’ve ever encountered. Even if you end up not agreeing with all of his views, you will learn a lot and think about things in a different way, possibly changing the way you think about our craft.

Finally, as a coda to this paean to Clojure: The O'Reilly Clojure Programming book came out this year. Chas Emerick, Brian Carper and Christophe Grand have written a fantastic book. It is a book you will learn from and come back to for its insights, examples and reference material for many years. Definitely belongs on my top 10 for 2012 list.

/* ---[ 6. rupa/z ]--- */

The z shell script (not zsh) is one of my favorite discoveries of 2012. To give it more press, I gave it its own blog entry, which you read here:

Here's the short summary: z is a 200-line shell script compatible with bash and z-shell that is a clever LRU-type cache of your directory visitations - the cache weighting is based on both frequency and recentness, which the author dubs "frecency". As you navigate around to different directories, it keeps track of where you've been, how often you've been there and how recently.

To navigate somewhere you've been, pass a part of the path to the z command and it will take you to the highest weighted directory in your cache.

/* ---[ 7. Xubuntu ]--- */

I'm a Linux guy. I was on the Ubuntu bandwagon for many years. I played with Linux Mint a little. I've got Fedora and CentOS running in VirtualBox VMs. But when Unity came out on Ubuntu, I struggled to get used to its desktop model. It does not fit how I work. I tried it for a month and was considering what to switch to when I saw a Slashdot article that Linus Torvalds was adopting XFCE to get away from the strangeness of many modern Linux desktop environments.

So that prompted me to try Xubuntu, based on XFCE and also Lubuntu, based on the LXDE desktop. Lubuntu was a little too minimal for me, but Xubuntu clicked for me right away. I don't like the Dash of Unity and I really really hate the fact that when I try to open a new console shell it brings the current one to the forefront. That is not what I want. I'll use Alt-Tab for that.

Xubuntu behaves as you expect. Click the terminal icon and it opens a new terminal. Xubuntu puts shortcut icons on the bottom, similar to Apple's desktop, but without the annoying enlargement animations. I don't do a lot of customization of my desktop. I just want one that has sane defaults and Xubuntu is that for me.

Ubuntu also stirred up criticism for its integration with Amazon affiliated advertisements, making the Dash a purchasing platform, in the process creating data leaks. Now you don't even have privacy when operating your desktop. The EFF write-up summarizes this nicely.

You can turn it off, but even among Linux users I suspect the "tyranny of the default" will mean that most users are leaking data and thus are at the mercy of Canonical, which people are starting to develop some mistrust for.

Well, Xubuntu doesn't have Dash. So you get the goodness of the Ubuntu ecosystem without the privacy violations. Its defaults are sane.

Try it out.

/* ---[ 8. Coursera and Online Education ]--- */

2012 is the year that online education skyrocketed. I've done a few CodeSchool courses and enjoyed those. But now there's Udacity and Coursera and Udemy and edX and probably 10 more I don't know about.

This year I took a Coursera course: Functional Programming Principles in Scala taught by Martin Odersky. It was a great experience. The format is excellent - each week there is about 2 to 3 hours of video lectures and a programming assignment that takes anywhere from 5 to 15 hours to complete. The examples were challenging enough to make the time investment worth it. And I got a nice certificate at the end for having a passing grade.

Uploading assignments was done via a command in Scala's sbt tool; it was easy and seamless. The assignments were graded automatically in about 10 minutes and gave good feedback, allowing you to fix problems and resubmit. The only part of the course I didn't enjoy was using the Scala Eclipse IDE, which is still quite painful compared to Java in Eclipse or Clojure in Emacs.

It's amazing what you can get online for free these days. I've signed up for two more courses and have my eye on a cryptography course there as well.

/* ---[ 9. Fiddler2 and Wireshark ]--- */

I spent a good deal of time this year maintaining and enhancing a large "legacy" web app written that uses Ajax calls to communicate with the Java back-end. In many cases, the shortcut to figuring out what is going on is to watch the traffic between the browser and server. Fiddler2 is an invaluable tool for that.

I also tried Wireshark, but the output from Fiddler2 is just as intuitive and easy to follow as can be, since it focuses only on HTTP traffic.

Wireshark is more general purpose. I started learning it this year and want to get better at configuring and customizing it, so I can use it effectively (and efficiently) on Linux, since Fiddler2 is unfortunately a Windows-only product.

/* ---[ 10. TrueCrypt, GPG and other encryption tools ]--- */

If you aren't using encryption for your files, hard drives and passwords, make it your new year resolution to learn the tools. Ever since Phil Zimmerman bravely pioneered encryption for the everyman, the suite of tools available to do this have gotten better and better.

I use GPG to encrypt individual files, TrueCrypt to encrypt thumb drives and external drives and Ubuntu's full disk encryption for my laptops. If you have a laptop and thumb drives, they should be encrypted.

A nice file encryption tool on Windows is AxCrypt.

For passwords, I use LastPass, which I believe does it all correctly and securely in a "trust no one" fashion.

Consider using an encrypted "Trust No One" backup and file syncing service. Dropbox is not encrypted, nor is SkyDrive or Google Drive or many other popular services. Do not upload anything to those systems that you wouldn't mind having broadcast on the internet or at least read by employees of those companies.

Steve Gibson (of the Security Now podcast) did a multi-episode analysis of backup and file syncing services from an encryption and "trust no one" perspective. Start with episode #349. There are a number of good solutions. I use SpiderOak on Linux.

If you already know and do this stuff, have a CryptoParty in your area. If you live in my area (Raleigh/Durham, North Carolina, USA), join the DC919 group.

/* ---[ Datomic: Mine goes to 11 ]--- */

While I did attend a Datomic training course this year and wrote a fairly long blog post about it, I just haven't made the time to really study it yet. I fully intend to, as I think it is one of the most profound and important things to have come out in 2012. I've queued it up to be on my top 10 list in 2013.

Monday, December 24, 2012

An irritation no longer: command line joy

/* ---[ pushd and dirs ]--- */

I have long been a command line person. I hate having to use the mouse. One thing that is a little cumbersome about the command line is jumping around into various deeply nested directory structures. I've long been a user of pushd, dirs and popd on Unix/Linux/Cygwin consoles. But if you are alternating between 3 or more directories with some regularity, those commands require some care to use correctly.

/* ---[ Improvement #1: pd ]--- */

An improvement on that is a small bash function that I found on stackexchange:

function pd() { 
  if [ "$1" ]; then
    pushd "${1/#[0-9]*/+$1}";
  fi > /dev/null

which simplifies using pushd. A basic session of use would be:

midpeter444:~/lang/clojure/concurrency-fun$ pd .
midpeter444:~/lang/clojure/concurrency-fun$ dirs
 0  ~/lang/clojure/concurrency-fun
 1  ~/lang/clojure/concurrency-fun
midpeter444:~/lang/clojure/concurrency-fun$ cd ~/.mozilla/
midpeter444:~/.mozilla$ dirs
 0  ~/.mozilla
 1  ~/lang/clojure/concurrency-fun
midpeter444:~/.mozilla$ pd .
midpeter444:~/.mozilla$ cd /tmp
midpeter444:/tmp$ dirs
 0  /tmp
 1  ~/.mozilla
 2  ~/lang/clojure/concurrency-fun
midpeter444:/tmp$ pd 2

pd can take either a dot, which means "remember this directory" or a number which refers to the position on the dirs history list. The random-access list metaphor is easier to work with the the pushd stack-based metaphor.

/* ---[ Vast Improvement #2: z ]--- */

But recently I discovered rupa/z or "z" and now I only occasionally use pd anymore. z is the biggest change and improvement to my command line life in years. I really love it. It works with cygwin as well for my time on Windows machines in my day job.

If you use the command line much, go get it now:

What is it?

First it is not z-shell (I'm a bash user), which is what I thought initially. (It doesn't help that the main GitHub page for it starts with "ZSH USERS BACKWARD COMPATIBILITY WARNING").

What it is, is a 200-line shell script compatible with bash and z-shell that is basically a clever LRU-type cache of your directory visitations - the cache weighting is based on both frequency and recentness, which the author dubs "frecency". As you navigate around to different directories, it keeps track of where you've been, how often you've been there and how recently.

To see your current cache in ascending order of 'frecency', just type z:

midpeter444:~$ z
0.313808   /home/midpeter444/apps/apache-ant-1.8.4/bin
0.313808   /home/midpeter444/lang/clojure/books/land-of-lisp/wizards-adventure/doc
0.313808   /home/midpeter444/lang/java/projects/mybatis-koans
0.392263   /tmp
0.429067   /home/midpeter444/lang/clojure/source-code
0.627622   /home/midpeter444/lang/lisp
0.784525   /home/midpeter444/.mozilla/firefox
0.83702    /home/midpeter444/lang/clojure/projects/clj-how-tos/clj-sockets
0.86298    /home/midpeter444/media/ebooks
1.62       /home/midpeter444/Dropbox/scripts-and-config-files
2.32335    /home/midpeter444/lang/clojure/sandbox/src/sandbox
5.6486     /home/midpeter444/lang/clojure/books/land-of-lisp/wizards-adventure
8.54205    /home/midpeter444/lang/clojure/books/land-of-lisp/orc-battle
10.7351    /home/midpeter444/lang/clojure/projects/clj-how-tos/clj-sockets
20.9559    /home/midpeter444/lang/clojure/books/land-of-lisp
30.7926    /home/midpeter444/lang/clojure/books/land-of-lisp/webserver
32.099     /home/midpeter444/Downloads
192.24     /home/midpeter444/lang/clojure/concurrency-fun

The number on left indicates the frecency score. So ambiguous entries will resolve in favor of the one with the higher score.

To navigate somewhere you've been, pass a part of the path to the z command:

midpeter444:~$ z fun

midpeter444:~$ z lisp

midpeter444:~$ z moz

It also has tab-completion. If I hit tab for the example above where I typed moz it expands to:

$ z /home/midpeter444/.mozilla

Now the only part of the command line usage I once found irritating is pure joy.

Saturday, December 15, 2012

Programming Praxis Amazon Interview Question, part 2

In part 1 of this blog entry I covered two relatively efficient implementations of an algorithm to keep the top N entries in a stream of points that came from the Programming Praxis web site:

Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).

I was happy with my ad-hoc solution since it performed twice as fast as using a sorted-set data structure.

The answer chosen by the author of the Programming Praxis site used a Heap data structure, where the max value is kept at the top of the heap. His implementation for this exercise was a mutable heap, so it wasn't not a viable candidate for a Clojure implementation. However, he has links to other praxis challenges where he implemented heaps (in Scheme), some of which use immutable data structures. I chose to (manually) transpile his Scheme-based "Leftist heap":

This is only one implementation of a heap and there might be (probably is) one that is more efficient, but I chose this as a representative to see how it would compare to the other solutions.

Here is my implementation of the "top 100" challenge using the Leftist Priority Queue Heap:

The Priority Queue does not keep track of its size and I didn't want to modify the data structure to do that. To compensate, I used the Clojure split-at function to split the points vector into two lazy-seqs: the first max-size entries from the points vector, which are just directly inserted into the heap and the rest.

Those remaining points then have to be sifted: if a point's distance from the origin is less than the first element on the heap, then that top entry on the heap needs to be pulled off and the new point is inserted. That is done with the code:

(q/pq-insert dist-lt? pt (q/pq-rest dist-lt? pq))

pq-rest is like Clojure's pop - it gives you the data structure minus its head and we insert the new point into that remaining data structure.

The dist-lt? function is a comparator function required by the Leftist Heap algorithm.

I did this additional exercise, because I suspected that a heap would a more efficient implementation that a fully sorted-set.

Here are some representative benchmark results again using the Clojure criterium tool. (This time I truncated some of the output to make it easier to read.)

user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
user=> (count points)

;; this is the "ad-hoc" solution from part 1
user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 525.526800 ms

;; this is the sorted-set solution from part 1
user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.250241 sec

;; this is the new heap-based implementation
user=> (bench (top-heap points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.063965 sec

I've only shown three results, but I ran them many times in varied order and got remarkably consistent results.

So without going into specific precision, the heap-based implementation is about 20% faster than the sorted-set implementation and my ad-hoc solution is about 50% faster than the heap-based implementation.

Which is about what I expected. I thought the heap-based solution might even be a little faster than this. One problem with the heap implementation I'm using is that it's central function (merge) uses true recursion. I didn't see an easy way to make it use loop-recur or a lazy-seq. If anyone sees a clever way to do that, I'd love to hear it.

Friday, December 14, 2012

Programming Praxis - Would Amazon hire me?

I was finishing up my work for the day and decided to check Hacker News. High on the list that day was Programming Praxis, which I'd never seen before and the featured post was the following Amazon interview question:

Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).

I had been planning to go home and start working on Dice of Doom (in Clojure) from Chapter 16 of Conrad Barski's Land of Lisp, but this problem sounded intriguing enough that I would take it up.

After sketching out a few ideas, I concluded that a strict O(n) solution isn't possible, but something near-O(n) would be feasible. Similar to how Clojure's persistent data structures often operate in "essentially constant" time -- O(log-32 n) being close enough to constant time to be considered basically a constant factor.

I decided to try an ad-hoc solution and then a sorted-set implementation. My guess was that the ad-hoc solution would be faster and that gave me a good excuse to try out Hugo Duncan's criterium benchmark tool to prove it (or prove me wrong).

/* ---[ The ad-hoc solution ]--- */

The approach I decided upon was to use an unsorted hash-map with mixed keys.

One key :ct would keep track of the count: how many entries ("points") were in the hashmap. It's max would the max-size (100 in the challenge statement).

The second key :hd, short for highest distance, would keep track of the entry (or entries) with the farthest distance from the the center point.

The rest of the keys are integers representing the distance to the origin. This distance key is mapped to a vector of points. Each point is a tuple (vector) of the x and y coordinate.

I decided to interpret distance from (0,0) not as the square root of the sum of the squares of x and y, but rather the absolute value of x plus the absolute value of y, but it wouldn't be too hard to imagine this with the other distance formula.

So the data structure would look like this:

{:ct 4
 :hd 7
 0 [[0 0]]
 2 [[1 1] [0 2]]
 3 [[1 2] [0 3] [3 0]]
 6 [[5 1]]
 7 [[4 3]]}

In this example, the data structure has 7 points, 1 with distance 0, 2 with distance 2, and so on.

Based on showing you this data structure I probably don't need to describe the algorithm I'm going to use. Rob Pike's Rule 5 is:

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming

Or Brooks' famous statement:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

So now the fun is to implement this in Clojure's immutable data structures using nicely composed functions.

We know we have to iterate through all the input points - there's our O(n) cost without doing any thing else - and we want to end up with one data structure. Therefore, we'll use reduce.

Even though the problem statement says the closest 100 points, for testing, I want to parameterize the max-size and what set of points I'll feed it, so the main function will look like this:

(def init-map {:hd Long/MIN_VALUE, :ct 0})

(defn topN [points max-size]
  (reduce (mk-sift-fn max-size) init-map points))

points will be a collection or seq of x-y tuples and max-size will indicate how many points the final data structure should retain. I pass max-size to a "make-sift-function", which is a higher-order function that will return the function that will "sift" through each point and determine whether it goes into the data structure and if so, where. A place for everything and everything in its place.

(defn mk-sift-fn [max-size]
  (fn [m pt]
    (let [dist (distance pt)]
      (if (< (:ct m) max-size)
        (add-point m dist pt)
        (if (< dist (:hd m))
          (-> m (remove-point (:hd m)) (add-point dist pt))

The flow to the function returned by mk-sift-fn is:

  • if you haven't seen max-size entries yet, add it to the map (letting the add-point fn figure out where to stick it)

  • if the data structure is at max capacity, then if the distance from (0,0) of the current point under consideration is less than the current highest distance in the map, then we have to remove one point that maps to the highest-point and add this new point.

I use the -> threading macro to weave through the remove-point and add-point functions, allowing nice composition of the functions with immutable structures.

Here is the full implementation with all the helper methods:

Here's an example run with a small data set:

user=> (require '[sandbox.top100 :refer [topN]] :reload)
user=> (def points (shuffle (for [x (range 2 5) y (range 3)] [x y])))
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (topN points 5)
{2 [[2 0]], 3 [[3 0] [2 1]], 4 [[4 0] [2 2]], :hd 4, :ct 5}

/* ---[ The sorted-set solution ]--- */

Clojure's sorted-sets are binary (persistent) trees. The sort order I use for the sorted set is distance from (0,0) descending.

As before, we'll directly add the first max-size entries we see. After that, we have to remove entries from the set if one with a shorter distance is seen.

Due to our sort order, the point with the greatest distance would be at the top of the tree and is easily removed in constant time using disj when we find a point that is closer to the origin. However, we then have to add that new point to the sorted set and all of these additions average O(log-n) insertion time. I was pretty sure my ad-hoc solution would be more efficient overall because of this extra sort time for all elements that get added.

To define a sorting comparator in Clojure, you use the sorted-set-by fn which takes a comparator of your choosing.

I stated above that the sort order would by distance descending, but since this is a sorted set, not a sorted list or vector, that won't actually work:

user=> (require '[sandbox.top100 :refer [distance]] :reload)
user=> (defn by-dist [pt1 pt2]
  #_=>   (> (distance pt1) (distance pt2)))
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (apply sorted-set-by by-dist points)
#{[4 2] [3 2] [3 1] [3 0] [2 0]}

We lost some points in the set. We have [3 1], but not [4 0]. Since these have the same "value" in the eyes of the set and a set can keep only one value, the other is dropped.

So I made the sort-by method to take into account equal distances and then do a secondary sort based on the value of the x coordinate, thus keeping all the points we are fed:

(defn dist-then-first [pt1 pt2]
  (let [dist1 (distance pt1)
        dist2 (distance pt2)]
    (if (= dist1 dist2)
      (> (first pt1) (first pt2))
      (> dist1 dist2))))

As before I used reduce to iterate over all the input points. The overall solution is nicely shorter than the ad-hoc one:

/* ---[ Performance showdown ]--- */

At the Clojure Conj this year, I discovered Hugo Duncan's criterium benchmark tool, which gives you more robust benchmarks than simply using time.

I used it to compare the solutions above. I redefined the points vector to have 100,000 points. I ran bench twice (in the opposite order). The first time I kept the closest 6 points. The second time I kept the closest 100.

user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
user=> (count points)

user=> (use 'criterium.core)
user=> (bench (topNSS points 6))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.022929 sec
    Execution time std-deviation : 10.070356 ms
   Execution time lower quantile : 1.006426 sec ( 2.5%)
   Execution time upper quantile : 1.044943 sec (97.5%)

user=> (bench (topN points 6))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 545.035140 ms
    Execution time std-deviation : 4.335731 ms
   Execution time lower quantile : 538.861529 ms ( 2.5%)
   Execution time upper quantile : 554.198797 ms (97.5%)

user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 531.174287 ms
    Execution time std-deviation : 4.642063 ms
   Execution time lower quantile : 522.942875 ms ( 2.5%)
   Execution time upper quantile : 541.571260 ms (97.5%)

user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.260036 sec
    Execution time std-deviation : 15.670337 ms
   Execution time lower quantile : 1.240810 sec ( 2.5%)
   Execution time upper quantile : 1.292583 sec (97.5%)

Sweet! My ad-hoc solutions runs twice as fast for this dataset. In the end, it's a trade off between the more elegant code and performance, which is often the case.