Saturday, December 24, 2011

My CS/Programming Top 10 for 2011

As a year end summary, I took a little time look back through my personal wiki where I keep notes on technical topics in order to make a "Top 10" List of the papers, books, articles, blogs entries, podcasts, videocasts and conferences that impacted and challenged me to expand my knowledge this year.  I made a list and checked it twice.

So in no particular order, here are my top 10 with a short bit of commentary on why you may want to read/watch/investigate these topics if you haven't already.



Simple-Made-Easy, Video Presentation, by Rich Hickey

Link: http://www.infoq.com/presentations/Simple-Made-Easy

Summary and Thoughts: If this presentation hadn't been given by Rich Hickey, I probably would have skipped it. But this year I ran across a saying (which I put as a moniker on my tech wiki): When brilliant people create things, study them. I'm not alone in thinking that Rich Hickey falls into the above category. 

I ended up watching this presentation three times over the course of a couple of months, in part because I got into a debate with work colleagues over what is a more simple vs. a more complex design. This talk helped me define those terms in terms of entanglement or intertwining - less is simpler.

It also challenges some key articles of faith of the recent modern software engineering thinking around how following TDD almost blindly will lead to proper design. All things in moderation, said the Greeks (the ancient ones, not the ones in current financial disarray) - and that applies to the often polarized comparisons between TDD vs. up-front design. On this I agree with Hickey. 




The Vietnam of Computer Science, blog entry, by Ted Neward

Link: http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx

Summary and Thoughts: This blog post is from 2006, so you can tell this is a 2011 top 10 from my personal experience, not publications of 2011. In addition to providing a refresher course in the history of the Vietnam War, Neward argues that Object-Relational Mapping (ORM) is in general an extremely leaky abstraction whose siren song of an easy life is alluring, but eventually a nasty law of diminishing returns kicks in. To pile on, Rich Hickey, in the presentation listed in item #1 above, says that ORMs have OMG complexity (OMG = oh my goodness), where the goal of software design is simplicity, not complexity.

Neward uses the analogy of the "Drug Trap": you start off taking low doses of a pharmaceutical your doctor prescribes and it works great for a while. Then it stops working, so you increase the dose, which works great for a while and then you increase it again, forming a cycle of ratcheting up until unhealthy extremes are reached. For an ORM, the ratcheting is the amount of effort you need to get your complex domain model requirements into and out of your (relational) database - and as the leaky abstraction begins to bleed more heavily you have to make some tough choices: either bear down and force the O/R model to do what you want with increasingly complex APIs or workarounds (and Neward outlines a number of those),  decide to stop using O/R for the hard stuff and go back to good ol' SQL, or find another way altogether, such as integration of relational concepts into the language. In the summary section, he lists the six possible options - to which a seventh could now be added of using a non-relational (NoSQL) database.

As popular as Hibernate is in the Java space and ActiveRecord in the Ruby (Rails) space as the near-absolute default way of working, this is a controversial opinion.  One could argue that 5 years later, ORMs have gotten better and there are good arguments that given some scenarios, ORMs make good sense.

However, where I work we have settled on MyBatis (formerly iBatis), which is not an ORM tool, as a better mapping model for most database scenarios and also integrates nicely with Spring. I need to research LINQ (which even Rich Hickey praised in his talk on simplicity) and the Spring Data efforts to get a more rounded view on the whole area before I make up my mind. And in the Ruby world, Sequel is getting a lot of press, and is also something on my "to-research" list.



REST to My Wife, blog entry, by Ryan Tomayko


Link:  http://tomayko.com/writings/rest-to-my-wife

Summary and Thoughts: I got deeper into REST and RESTful design this year and this blog entry started me off early in 2011. A good read to step up a level from the details of REST vs. SOAP type designs to see the bigger picture of what we are grappling with designing inter-system communication.



VoltDB and NewSQL, presentations and writings by Michael Stonebraker


Links: 
Presentation 1: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
Presentation 2http://bit.ly/v6ANsq
ACM article, 10 rules for scalable performance in 'simple operation' datastores: http://bit.ly/tAlaVe

Summary and Thoughts:
 Now that the NoSQL movement is in full swing and has gained wider corporate adoption, some of its tenets, such as dropping full ACID consistency and the relational model, are being revisited and challenged as being necessary requirements to doing web-scale distributed databases. This movement has branded itself as "NewSQL" to distinguish itself from "Old SQL" (Oracle, MySQL, PostgreSQL and the like) and from the non-relational, mostly non-ACID NoSQL data stores (Neo4j being a pleasant exception). NewSQL, and in particular VoltDB, are targeting the high-data-volume and high-data-throughput OLTP space - directly taking on the traditional relational databases. Michael Stonebraker, one of the early pioneers of relational databases (in his work in Ingres, which lead directly to PostgreSQL), is the CTO of VoltDB and proponent of its new model.

I haven't seen VoltDB get a lot of press, but after I watched the presentations above, I was enamored by this approach, in part because it provides a very clean solution to the problem domain I am working on at my day job.  The key aspects to their approach include:

  1. Individual node performance matters, not just parallelization: get rid of the overhead of databases by making all writes (inserts/updates) single threaded - no read/write lock contention or B-Tree latch contention to handle
  2. Have an in-memory database, with very lightweight persistence of the "commands", not the current state of the db.  
  3. Push the database logic and processing to the database using VoltDB style stored procedures (which you write in Java)
  4. Use shared-nothing scalability: find the keys by which to shard your data and distribute your data over as many machines as you need/can as long as you try to make the vast majority of your transactions single-sharded - replicate fully the other data. Make sure you database can do this for you transparently.
  5. The SQL and relational models are solid foundations on which to continue to build OLTP type applications (but granted they do not fit all models/use cases).
By following this model, Stonebraker claims you can potentially achieve 40x faster database operations that are fully ACID-transactional.  If you combine this with a processing model using the LMAX Disruptors (next entry below), one could build an extraordinarily fast OLTP processing and data storage system. This is something I'm extremely interested in prototyping and considering for production use.


LMAX Disruptors, software library (and presentations/blogs), by LMAX team in the UK

Link:
 
Presentation: http://www.infoq.com/presentations/LMAX
Fowler bliki entry: http://martinfowler.com/articles/lmax.html

Summary and Thoughts:
 Many of us are trying to find ways to leverage parallel computation. The trend towards Functional Programming promises easier concurrency models. Google formalized and consolidated many threads around massively parallel multi-node processing with MapReduce - now made open source with Hadoop and similar tools. We also see implementations of highly concurrent fault-tolerant systems using Erlang and Akka to spin up thousands of Actors in a fault-tolerant system to do highly concurrent processing. Using technologies like these many, like Google, are processing astounding amounts of data on compute grids or handling massive numbers of simultaneous users. 

At a time like this when the industry is focused on going parallel, the LMAX team comes along and says there is at least one class of problems for which massively parallel systems (and often more functional style of programming) is actually not the answer.  In fact, there are times when a single threaded processing model solves a host of problems - but can we make it performant enough to be web-scale?

The team at LMAX faces very high throughput demands that requires predictable and very low latency. They tried using a SEDA queue based system (such as the Actor model), but found that the queue is actually, perhaps surprisingly, a rather poor data structure for a highly concurrent environment.  As with the Stonebraker/Harizopoulos analysis on where multi-threaded databases spend their time (see the "VoltDB and NewSQL" entry above), a system with very high volumes that uses queues to pass data from one processing thread/stage to another actually spends far too much of its time negotiating locks rather than doing real work.

So they devoted research to understand how to optimize L1/L2 cache and utilize multiple cores efficiently. They combined this with a RingBuffer data structure (not something new - it is used in the Linux kernel for example) to create a Java library they call a Disruptor and then released it with an open source license.

With their model, they measure processing 6 million transactions per second on a single server, including persisting "commands" or "events" to disk. There are a couple of ways to use their API (one form is a simplified "DSL version"). It will manage "barriers" between writers (producers) and readers (consumers) for you. Consumers will not run ahead of producers and you can have consumers process in parallel or be gated off each other to process serially.

They have done a number of presentations available on the web and Martin Fowler was impressed enough to write a long and detailed bliki entry about it - those are the links above.

This model is hard to describe in a few short sentences, so I recommend watching the presentations and reading Fowler's write up. I think this is really impressive battle-tested stuff to have in your arsenal.


Neo4j koans, software learning tool via unit tests


Link: https://github.com/jimwebber/neo4j-tutorial


Summary and Thoughts: I've already written a blog entry on my experience setting up the koans: http://thornydev.blogspot.com/2011/11/neo4j-koans-how-do-i-begin.html. Neo4j is good stuff and these koans definitely help you learn most aspects of the Neo4j API (using Java).



The problem of Garbage Collection in the face of really really big RAM sizes, video presentation, by Gil Tene (Azul)


Link: http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection

Summary and Thoughts: At my first full-time software job, during the dot-com boom, I and a team of 3 others built an internet ad-server in Java.  We built a large (at the time) heap of 512M and tried to reuse objects through our own object pools as much as possible to avoid gc. We tried to get fancy in a few ways, including trying to use JNI to control garbage collection. In the end, despite valiant efforts, we failed, with 10 second gc pauses killing our real-time ad serving every 15 or so minutes. Management wasn't happy, we realized that gc is hard and decided to give up and rewrite the ad server in C++.

This presentation brought back all those memories and pointed out that we are in an absurd dilemma now as "memory becomes the new disk". At the time of this presentation, the author says that 512 GB (that's a "G") of memory is actually the cheapest price point on servers when evaluated on a per unit memory cost. And most Java developers don't have a prayer of using that much memory in a single application. Garbage collection pauses can be delayed, but grow basically linearly with the growth of memory size. If we faced 10s "stop-the-world" pauses with 512M memory, you might be facing ~5 ''minute'' stop-the-world pauses with 512GB memory (Tene says ~1 sec pause per GB live memory). Talk about management not being happy...

"Most of what people know about garbage collection turns out to be wrong" says the presenter, Tene. Kind of like trying to get advice on software security and cryptography by searching google. Hit-or-miss for sure.

You'll learn a lot about garbage collection in general during the first hour or so of the presentation and even though the last bit is a bit rushed, there are key insights in the last 10-15 minutes well worth waiting for and rewatching and researching. This problem, along with concurrent programming with multiple cores, are two of the key challenges that face software development for the next 5+ years (I'd throw in solving mobile development as another key challenge). The Azul guys seem to be leading the way on the first problem in the JVM space.

I haven't used their garbage collector, but I'm definitely keeping it mind as we consider putting more and more data and cache in memory.  (And I wonder if this would make VoltDB even better?)

More on the Azul gc: http://www.infoq.com/articles/azul_gc_in_detail




A Clojure Cornucopia:
Joy of Clojure, book by Michael Fogus, et al.

The Clojure koans, software program
The suggestion of Clojure as Guiding Light into the next phase of programming, blog by Uncle Bob Martin


Links:  


[31-Dec-2011 Update: I just discovered the online Clojure koans at: http://www.4clojure.com and highly recommend them as well.]

Summary and Thoughts:
I had to put something about Clojure in here, as there is just so much right about it. I'm still a Clojure newbie and I've dabbled on and off with it, and currently the Joy of Clojure is kicking my butt.  Start with the new Programming Clojure coming out early next year or Clojure in Action, out now, do the koans, do the Little Schemer (see last entry in this blog) and do Joy of Clojure. In many ways, one has to re-learn programming while learning Clojure. Learning to think more functionally is the obvious one, but also learning to think in terms of immutable data structures and how to use loop/recur kind-of-like tail recursion, using accumulators. 

Since I'm still swimming in the sea-of-newness, I don't have anything profound to say on Clojure at this point, but I will invoke 
argumentum ad verecundiam (argument from authority), from well respected people that see great promise in Clojure:
For those of us using the JVM or .NET, Clojure feels like a minor miracle. It’s an astoundingly high-quality language, sure—in fact, I’m beginning to think it’s the best I’ve ever seen—yet somehow it has still managed to be fashionable. That’s quite a trick. It gives me renewed hope for the overall future of productivity in our industry. We might just dig ourselves out of this hole we’re in and get back to where every project feels like a legacy-free startup, just like it was in the early days of Java. 
 - Steve Yegge, forward to Joy of Clojure

How are we going to write that code? What language are we going to use to express the concepts that execute in thousands of machines?
Of course the current answer is “Functional Programming”. OK, maybe — maybe. But I gotta tell ya, the new functional languages out there aren’t looking too good to me. Scala and F# are still closely tied to the hardware. Scala is java with a few cute twists. And F#? Is it really the language that’s going to take us into the next age? My fear is that these languages suffer from the Sapir/Whorf trap. Their mode of expression does not sufficiently change our world view. They are too much like Java and C# and all the other old languages.
The only language I’ve seen so far that comes close to a different mode of expression, and that is capable for use in the enterprise, is Clojure.
- Uncle Bob Martin, blog entry above


Akka Cornucopia:
Akka library, 
Akka se-radio podcast, and 
Concurrent Programming Concurrency on the JVM, book

Links: 

Summary and Thoughts: At the NFJS conference this year, Akka was one of the prominent libraries being touted and one of the NFJS conference presenters, Venkat Subramaniam, released a new book Programming Concurrency on the JVM, which I read that heavily explores Akka. 

The Akka model is an Actor implementation based on Erlang's model. Jonas Boner says in the podcast mentioned above that he learned Erlang a while back and really liked its model, but wasn't able to convince enough people to adopt Erlang, so he decided to bring Erlang's Actor model to the JVM via Scala. Scala already had an Actor implementation, but was not as sophisticated or robust as Erlang's so he created the Akka library. Akka now has a large following and can be used not only by Scala developers but Java developers. I can attest that we are starting to use it in our projects where I work. 

If your problem domain either breaks down into nicely composable SEDA stages or you want a model with fault-tolerant concurrency with no shared mutable state, Akka and the Actor model are worth considering.

Interestingly, however, Clojure (Rich Hickey) has decided against incorporating the Actor model into Clojure, preferring instead agents as he describes here: http://clojure.org/state. I have to confess I haven't fully grokked why - something that may need to be learned through experience. Of course, being a JVM language and the fact that Akka has a nice Java API, you could use Akka in Clojure via its Java interop features :-)  However, Hickey does say at the end: Clojure may eventually support the actor model for distributed programming, paying the price only when distribution is required, but I think it is quite cumbersome for same-process programming.

This points up another bonus feature of using Akka - it has remoting built in and includes integration with Apache Camel. We haven't started using these features yet, but since we have a Camel integration framework, this is a nicer feature to know we can leverage this built-in features.



Little Schemer

Link: http://www.amazon.com/Little-Schemer-Daniel-P-Friedman/dp/0262560992/ref=sr_1_1?ie=UTF8&qid=1324704715&sr=8-1

Summary and Thoughts: I was not fortunate enough (some would say unfortunate enough) to learn Lisp/Scheme in college.  It has been on my list for years and when Clojure came along and started getting popular, I decided it was time.

Douglas Crockford says in
JavaScript: The Good Parts that all computer scientists should read The Little Schemer to truly understand recursion and how to think recursively.  He, in fact, re-did the little schemer exercises in JavaScript, which is actually a very expressive language for functional concepts (just with C-style syntax).

So this year, I finally decided to sit down and read it. I combined this reading with learning the modern Lisp-on-the-JVM, Clojure.  So every time The Little Schemer challenged you to think through the algorithm, I wrote it in Clojure.

About half way through, I started doing it in Scala as well, which I am also in the early stages of learning. (Trying to figure out how to do functional programming in Scala does not seem as natural as doing it in Clojure (for a newbie at least).)


In any case, I can attest that after "doing" this book, recursive thinking is definitely more natural now and sometimes when I approach a problem I can immediately see how to do it recursively and how to do it iteratively takes longer to think through - a Necker Cube-like transition. I got
The Seasoned Schemer (the sequel) for Christmas this year, so that will be on my todo list for 2012.

Sunday, December 11, 2011

Happiness requires discipline


Staying singly focused on a task in this digital era is like trying to resist eating while sitting in a bakery as cookies, pies, and cakes emerge fresh and fragrant from the oven.

We live in the era of rampant multitasking, but I've seen studies that prove people aren't as good at as they think.  Kinda like people think they can learn to do with less sleep. But here we fight our biology. It turns out that most of really do need a full 8 hours of sleep. Not 6, not even 7. Empirical task-based studies show that if you don't get 8 pretty much every night, you are not performing optimally. And as programmers, we need to be optimal to do the hard task of programming.

Self-discipline, self-control and focus are the keys to success in programming and life - including health and happiness.

At the NFJS conference this year, I heard about the Pomodoro technique for the first time - a form of focused intense concentration alternated with a short break.  I've been trying it out. For my environment at work, I find that it is pretty effective.  But for big problems where you really need to keep a lot of state in your head for prolonged periods, a double pomodoro may be needed.  Of course, one could argue that even longer than 50 minutes is optimal. Probably true, especially if you are in flow, but here too, sigh, it turns out we fight our biology.

A lot of studies lately have been showing that sitting for prolonged periods is detrimental to your health.  Sitting longer than an hour is unfortunately not advised (see links below).  So do some walking, stretching and knee bends during those pomodoro breaks.

So a programmer is happy and optimal when self-disciplined to:
  • minimize distractions (my old roommate used to program with the TV on in the background, aargh)
  • spend most of the day in periods of intense concentration mentally stepping through code paths, design considerations, testing approaches and problem solving
  • balance the mental discipline with the physical discipline of exercise, breaks and, *sigh*, 8 hours of sleep a night, even when hacking late into the night would be more fun

That's how I see it, anyway.  (And now I need to get my *rse out of the chair and get some blood flowing.)



Links about prolonged sitting:
Excellent book on our biological requirements for sleep: