Friday, December 14, 2012

Programming Praxis - Would Amazon hire me?

I was finishing up my work for the day and decided to check Hacker News. High on the list that day was Programming Praxis, which I'd never seen before and the featured post was the following Amazon interview question:

Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).

I had been planning to go home and start working on Dice of Doom (in Clojure) from Chapter 16 of Conrad Barski's Land of Lisp, but this problem sounded intriguing enough that I would take it up.

After sketching out a few ideas, I concluded that a strict O(n) solution isn't possible, but something near-O(n) would be feasible. Similar to how Clojure's persistent data structures often operate in "essentially constant" time -- O(log-32 n) being close enough to constant time to be considered basically a constant factor.

I decided to try an ad-hoc solution and then a sorted-set implementation. My guess was that the ad-hoc solution would be faster and that gave me a good excuse to try out Hugo Duncan's criterium benchmark tool to prove it (or prove me wrong).


/* ---[ The ad-hoc solution ]--- */

The approach I decided upon was to use an unsorted hash-map with mixed keys.

One key :ct would keep track of the count: how many entries ("points") were in the hashmap. It's max would the max-size (100 in the challenge statement).

The second key :hd, short for highest distance, would keep track of the entry (or entries) with the farthest distance from the the center point.

The rest of the keys are integers representing the distance to the origin. This distance key is mapped to a vector of points. Each point is a tuple (vector) of the x and y coordinate.

I decided to interpret distance from (0,0) not as the square root of the sum of the squares of x and y, but rather the absolute value of x plus the absolute value of y, but it wouldn't be too hard to imagine this with the other distance formula.

So the data structure would look like this:

{:ct 4
 :hd 7
 0 [[0 0]]
 2 [[1 1] [0 2]]
 3 [[1 2] [0 3] [3 0]]
 6 [[5 1]]
 7 [[4 3]]}

In this example, the data structure has 7 points, 1 with distance 0, 2 with distance 2, and so on.

Based on showing you this data structure I probably don't need to describe the algorithm I'm going to use. Rob Pike's Rule 5 is:

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming

Or Brooks' famous statement:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

So now the fun is to implement this in Clojure's immutable data structures using nicely composed functions.

We know we have to iterate through all the input points - there's our O(n) cost without doing any thing else - and we want to end up with one data structure. Therefore, we'll use reduce.

Even though the problem statement says the closest 100 points, for testing, I want to parameterize the max-size and what set of points I'll feed it, so the main function will look like this:

(def init-map {:hd Long/MIN_VALUE, :ct 0})

(defn topN [points max-size]
  (reduce (mk-sift-fn max-size) init-map points))

points will be a collection or seq of x-y tuples and max-size will indicate how many points the final data structure should retain. I pass max-size to a "make-sift-function", which is a higher-order function that will return the function that will "sift" through each point and determine whether it goes into the data structure and if so, where. A place for everything and everything in its place.

(defn mk-sift-fn [max-size]
  (fn [m pt]
    (let [dist (distance pt)]
      (if (< (:ct m) max-size)
        (add-point m dist pt)
        (if (< dist (:hd m))
          (-> m (remove-point (:hd m)) (add-point dist pt))
          m)))))

The flow to the function returned by mk-sift-fn is:

  • if you haven't seen max-size entries yet, add it to the map (letting the add-point fn figure out where to stick it)

  • if the data structure is at max capacity, then if the distance from (0,0) of the current point under consideration is less than the current highest distance in the map, then we have to remove one point that maps to the highest-point and add this new point.

I use the -> threading macro to weave through the remove-point and add-point functions, allowing nice composition of the functions with immutable structures.

Here is the full implementation with all the helper methods:

Here's an example run with a small data set:

user=> (require '[sandbox.top100 :refer [topN]] :reload)
nil
user=> (def points (shuffle (for [x (range 2 5) y (range 3)] [x y])))
#'user/points
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (topN points 5)
{2 [[2 0]], 3 [[3 0] [2 1]], 4 [[4 0] [2 2]], :hd 4, :ct 5}


/* ---[ The sorted-set solution ]--- */

Clojure's sorted-sets are binary (persistent) trees. The sort order I use for the sorted set is distance from (0,0) descending.

As before, we'll directly add the first max-size entries we see. After that, we have to remove entries from the set if one with a shorter distance is seen.

Due to our sort order, the point with the greatest distance would be at the top of the tree and is easily removed in constant time using disj when we find a point that is closer to the origin. However, we then have to add that new point to the sorted set and all of these additions average O(log-n) insertion time. I was pretty sure my ad-hoc solution would be more efficient overall because of this extra sort time for all elements that get added.

To define a sorting comparator in Clojure, you use the sorted-set-by fn which takes a comparator of your choosing.

I stated above that the sort order would by distance descending, but since this is a sorted set, not a sorted list or vector, that won't actually work:

user=> (require '[sandbox.top100 :refer [distance]] :reload)
nil
user=> (defn by-dist [pt1 pt2]
  #_=>   (> (distance pt1) (distance pt2)))
#'user/by-dist
user=> points
[[3 1] [3 0] [4 0] [2 2] [2 1] [3 2] [4 1] [4 2] [2 0]]
user=> (apply sorted-set-by by-dist points)
#{[4 2] [3 2] [3 1] [3 0] [2 0]}

We lost some points in the set. We have [3 1], but not [4 0]. Since these have the same "value" in the eyes of the set and a set can keep only one value, the other is dropped.

So I made the sort-by method to take into account equal distances and then do a secondary sort based on the value of the x coordinate, thus keeping all the points we are fed:

(defn dist-then-first [pt1 pt2]
  (let [dist1 (distance pt1)
        dist2 (distance pt2)]
    (if (= dist1 dist2)
      (> (first pt1) (first pt2))
      (> dist1 dist2))))

As before I used reduce to iterate over all the input points. The overall solution is nicely shorter than the ad-hoc one:


/* ---[ Performance showdown ]--- */

At the Clojure Conj this year, I discovered Hugo Duncan's criterium benchmark tool, which gives you more robust benchmarks than simply using time.

I used it to compare the solutions above. I redefined the points vector to have 100,000 points. I ran bench twice (in the opposite order). The first time I kept the closest 6 points. The second time I kept the closest 100.

user=> (def points (shuffle (for [x (range 500) y (range 200)] [x y])))
#'user/points
user=> (count points)
100000

user=> (use 'criterium.core)
nil
user=> (bench (topNSS points 6))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.022929 sec
    Execution time std-deviation : 10.070356 ms
   Execution time lower quantile : 1.006426 sec ( 2.5%)
   Execution time upper quantile : 1.044943 sec (97.5%)

user=> (bench (topN points 6))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 545.035140 ms
    Execution time std-deviation : 4.335731 ms
   Execution time lower quantile : 538.861529 ms ( 2.5%)
   Execution time upper quantile : 554.198797 ms (97.5%)

user=> (bench (topN points 100))
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 531.174287 ms
    Execution time std-deviation : 4.642063 ms
   Execution time lower quantile : 522.942875 ms ( 2.5%)
   Execution time upper quantile : 541.571260 ms (97.5%)

user=> (bench (topNSS points 100))
Evaluation count : 60 in 60 samples of 1 calls.
             Execution time mean : 1.260036 sec
    Execution time std-deviation : 15.670337 ms
   Execution time lower quantile : 1.240810 sec ( 2.5%)
   Execution time upper quantile : 1.292583 sec (97.5%)

Sweet! My ad-hoc solutions runs twice as fast for this dataset. In the end, it's a trade off between the more elegant code and performance, which is often the case.


[Updates]

2 comments:

  1. I think that Amazon would be stupid not to hire you! :)
    btw, I would be interested to hear how you like our testing solution - you can check it out at TestDome
    we focus more on testing the real-world programming skills (stuff that you actually use in practice)

    ReplyDelete