In part 1 of this blog entry I covered two relatively efficient implementations of an algorithm to keep the top N entries in a stream of points that came from the Programming Praxis web site:
Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).
I was happy with my ad-hoc solution since it performed twice as fast as using a sorted-set data structure.
The answer chosen by the author of the Programming Praxis site used a Heap data structure, where the max value is kept at the top of the heap. His implementation for this exercise was a mutable heap, so it wasn't not a viable candidate for a Clojure implementation. However, he has links to other praxis challenges where he implemented heaps (in Scheme), some of which use immutable data structures. I chose to (manually) transpile his Scheme-based "Leftist heap":
This is only one implementation of a heap and there might be (probably is) one that is more efficient, but I chose this as a representative to see how it would compare to the other solutions.
Here is my implementation of the "top 100" challenge using the Leftist Priority Queue Heap:
The Priority Queue does not keep track of its size and I didn't want to modify the data structure to do that. To compensate, I used the Clojure split-at
function to split the points vector into two lazy-seqs: the first max-size
entries from the points
vector, which are just directly inserted into the heap and the rest.
Those remaining points then have to be sifted: if a point's distance from the origin is less than the first element on the heap, then that top entry on the heap needs to be pulled off and the new point is inserted. That is done with the code:
(q/pq-insert dist-lt? pt (q/pq-rest dist-lt? pq))
pq-rest
is like Clojure's pop
- it gives you the data structure minus its head and we insert the new point into that remaining data structure.
The dist-lt?
function is a comparator function required by the Leftist Heap algorithm.
I did this additional exercise, because I suspected that a heap would a more efficient implementation that a fully sorted-set.
Here are some representative benchmark results again using the Clojure criterium tool. (This time I truncated some of the output to make it easier to read.)
user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
#'user/points
user=> (count points)
100000
;; this is the "ad-hoc" solution from part 1
user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
Execution time mean : 525.526800 ms
;; this is the sorted-set solution from part 1
user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
Execution time mean : 1.250241 sec
;; this is the new heap-based implementation
user=> (bench (top-heap points 100))
Evaluation count : 60 in 60 samples of 1 calls.
Execution time mean : 1.063965 sec
I've only shown three results, but I ran them many times in varied order and got remarkably consistent results.
So without going into specific precision, the heap-based implementation is about 20% faster than the sorted-set implementation and my ad-hoc solution is about 50% faster than the heap-based implementation.
Which is about what I expected. I thought the heap-based solution might even be a little faster than this. One problem with the heap implementation I'm using is that it's central function (merge
) uses true recursion. I didn't see an easy way to make it use loop-recur or a lazy-seq. If anyone sees a clever way to do that, I'd love to hear it.
No comments:
Post a Comment