Saturday, January 21, 2012

Events and Event-Driven Architecture: Part 3

In part 1 and part 2 of this series, I introduced Event Driven Architecture (EDA), compared it to Event Driven Programming and then spent time distinguishing between intra-system and inter-system EDA. In this installment, I enhance the discussion around inter-system Domain Event Driven Architecture with a hodge podge of follow-on topics:

  1. Domain Event Formats
  2. Performance Characteristics of EDA
  3. Complex Event Processing
  4. Guaranteed Delivery


/* ---[ Domain Event Formats ]--- */

Events in inter-system EDA, remember, are basically messages from one system to another that fully describe an occurrence in the system that issues it. That event cascades to some number of subscribers, unknown to the publisher (aka event generator).

What formats could be used for a Domain Event? Basically any data format with a header and body. Let's consider a few options here and criteria for selecting one appropriate your domain requirements.

First, viable options include:

  1. XML
  2. JSON
  3. ASN.1
  4. Google Protocol Buffers
  5. CSV
  6. Serialized Objects
  7. Hessian
  8. Industry-specific options

Criteria by which these can be evaluated include:

  1. Is there are standard in my particular industry I should follow?
  2. Do it have support for a schema?
  3. Binary vs. text-based
  4. Does it have cross-platform (cross-language) support?

If the answer to #1 is yes, then that is probably your default choice. Beyond that though, some argue that there are definite advantages in choosing a format that has a schema.

First off, a schema obviously allows you to validate against the structure as well as the content. But probably it's most important feature is to be able to do versioning of formats. In part 2 I described that a key advantage of inter-system EDA is that systems do not have to be lockstep when upgrading to new functionality and versions.

If a consumer can only process version 1, as version 2 causes it to choke, one way to handle this would be to use the Message Translator Pattern to convert from version 2 back to version 1.

Another option is for the event generator to publish both versions either to different event channels (such as different JMS Topics) or to the same channel with different metadata, such as the version itself. This allows the subscribers to selectively choose which version to subscribe to and process.

Binary formats will be smaller and thus faster to send and probably faster to parse. The documentation on Google Protocol Buffers, for example, claims that processing them is 20 to 100 times faster than a comparable XML document.

Lastly, if you work in an environment that needs to interoperate with multiple languages, you will need to use a format that is amenable to all those platforms.

Using the above criteria, good old XML is probably the best text-based format, as it supports schemas and is universal to all platforms. For a binary protocol, Google Protocol Buffers is quite attractive, as it is schema-based, binary and efficient. It is not inherently universal, however, since you must have language bindings to serialize and deserialize objects to and from that format. Google supports Java, C++ and Python. There are community supported open source bindings for other languages here, but you'll have to research how mature and production ready any given one is.

The documentation on Protocol Buffers is very good and provides helpful advice on planning for domain event content/format changes. For example, they recommend making most fields optional rather than required, especially new ones being added in subsequent versions and to have a convention on how to deprecate older non-required fields, such as using an OBSOLETE_ prefix.

Stefan Norberg's presentation on Domain EDA, which I referenced heavily in part 2, has a more detailed breakdown of Domain Event formats using these criteria and a few more (see slides 47-49).

One option not considered in the Norberg presentation is Thrift, originally built by Facebook, and now an Apache project. It has support for more than a dozen programming languages, is binary and has support for versioning (which I assume means it has a schema, but I haven't been able to tell from the reading I've done on it). Unfortunately, I found complaints as far back as 2008 about is lack of documentation and the its apache web site still is sadly lacking in that department. I wonder if it has lost out to Google Protocol Buffers? If any readers have experience using it, I'd be interested in hearing from you about it.


/* ---[ Performance Characteristics ]--- */

Event Driven Architecture gains loose coupling by basically passing data structures (event payloads) into generic, standardized queueing or messaging systems and using asynchrony. Since EDA is asynchronous by nature, it is amenable to high performance, high throughput and high scalability, with the tradeoff being that you have to live with eventual consistency between systems.

In a synchronous model the highest throughput possible (or lowest latency if that is the measure you most care about) is set by the slowest system in the chain, whereas in a environment where systems are connected asynchronously the lowest latency is not constrained by the slowest system.

To illustrate, suppose we have two systems A and B. System A by itself can process 500 transactions per second (tps), while system B can process 200 tps by itself. Next assume system A has to send some sort of message or RPC call to system B. If we make the interaction synchronous, system A can now process at most 200 tps (ignoring communication overhead between A and B). But if we can connect them with asynchronous messaging, system A can continue to process at 500 tps.

In isolation, this is excellent and often this works out very well. Of course, an architect considering both systems will need to consider that system B may continually fall farther and farther behind and never catch up to system A, which could be a problem in some scenarios. Nevertheless, the principle of asynchronous connectivity allows EDA to maximize throughput (and ideally minimize latency of individual requests) for only the parts that absolutely must be in request-reply synchronous mode.

Related Idea: Overcoming SLA Inversion

A side note I want to throw in here that isn't related to performance but follows the same line of reasoning as the above example is the fact that EDA can help overcome SLA Inversion, a termed coined (as far as I know) by Michael Nygard in Release It! (which if you haven't read, should move to the top of your reading list).

SLA (Service Level Agreement) inversion is the problem of trying to meet an SLA about uptime (say 99.5%) when your system is dependent on other systems. If those other dependent systems are coupled to yours with synchronous call behavior and their SLAs are less than yours, then you cannot meet your higher SLA. In fact, in synchronously coupled systems, the SLA of your app is the product of all the SLAs of the systems.

For example, if system A's SLA is 99.5%, system B's is 99.5% and system C's is 98.0%, if system A is synchronously dependent on B and C, then at best its SLA is (0.995 * 0.995 * 0.98), which calculates to an overall 97.0% SLA.

One of Nygard's proposed solutions to SLA inversion is decoupling middleware, which is exactly what is provided by an Event Driven Architecture with an asynchronous messaging backbone between systems.


/* ---[ Complex Event Processing ]--- */

To define Complex Event Processing (CEP), let's first give an informal definition of "simple" or regular event processing. Simple processing is typically a situation where a single event from an event source causes a predictable, expected response that does not require fancy algorithms to respond to. Examples of simple processing include a data warehouse that subscribes to events and simply consumes them into its own data store. Or a case where a "low threshold" event, such as from an inventory system triggers a "restock" action in a downstream consumer.

In a case where patterns over time in the event stream must be monitored, compared and evaluated, then you have complex event processing. Let's return to the fraud detection system we outlined in part 2. In order to detect fraud, two types of analyses might be set up:

  1. Watch for abnormal purchase behavior against some relative scale, such as previous purchase amounts or purchase frequency.
  2. Watch for abnormal purchase behavior against some absolute scale, such as many repeated purchases in a short period of time.

The event streams can be monitored in real-time and constantly compared to fraud rules and heuristics based on past behavior. Averages and other aggregations for a time series, either overall or on a per use basis can also be kept in the consumer's own data store. If a significant pattern is detected, the CEP can take action or itself fire off a domain event to a system to handle it.

For the right need, CEP can be a very powerful use of event driven architecture. In addition, you don't have to build it from scratch. Tools like Esper for both Java and .NET are designed to do Event Stream Processing and make inferences against patterns in high-volume event streams using algorithms of your choice or design.


/* ---[ Guaranteed Delivery ]--- */

In part 2, I discussed using asynchronous events from a source system, such as an inventory or point-of-sale system to populate a reporting database. In many cases, this would require that the reporting database have complete data - no gaps due to lost messages. To do this via events put onto a message bus, you would need some form of guaranteed delivery, where each and every message sent can be guaranteed to be delivered to the consumer.

Guaranteed Delivery (my definition)

Once a message is placed into a message channel, it is guaranteed to ultimately reach all its targeted destinations, even if various parts of the system fail at any point between initial publication and final delivery. Or the delivery failure would fail in a way that the broker/sender knows the message needs to be resent.

Here I outline three potential ways of achieving guaranteed delivery from a messaging system. There are probably other options and I'd be eager to hear of them from readers.

  1. Synchronous transactional mode
  2. Asynchronous send + reply with acknowledgment
  3. Persistent messages with durable subscriptions (also asynchronous)

Note: I don't have much knowledge of the Microsoft/.NET side of the world, so in this discussion I will focus on JMS (Java Message Service) and some parts of AMQP (Advanced Message Queueing Protocol) that I have read about.

Option 1: Synchronous transactional mode

JMS has a synchronous mode which in theory could be used for guaranteed delivery, but obviously in a very limiting way that defeats the purpose of asynchronous messaging systems.

In AMQP (from my limited study so far) it appears that the typical way to get guaranteed delivery is via synchronous transactions, where the publisher must wait until the broker has processed the message before continuing. This is likely to be the option with the lowest throughput (see note in RabbitMQ section below).

Option 2: Asynchronous send + reply with acknowledgment

The next option is to set up an asynchronous request/reply-with-acknowledgment system to ensure that all messages delivered are received by all relevant subscribers. The ActiveMQ in Action book describes the model shown in Figure 1 below.

Figure 1: Request / reply model using asynchronous messaging

One option for doing this is to use header fields in the messages (JMSReplyTo and JMSCorrelationID) and create temporary destinations (queues) for the acknowledgment replies. Any replies not received in some time frame could be considered undelivered messages that would be resent to the receiver. To handle cases where the producer’s message was received by the consumer, but the acknowledgment was not received, the receiver will have to be able to detect and ignore duplicate messages.

RabbitMQ, an AMQP compliant broker (but not JMS compliant), provides an extension to AMQP they call Lightweight Publisher Confirms. It implements a form of asynchronous messaging followed by acknowledgement of receipt.

The RabbitMQ website claims this is 100 times faster than using typical AMQP synchronous transaction model for guaranteed delivery.

Option 3: Persistent messages with durable subscriptions (also asynchronous)

Finally, in JMS, the preferred option to get guaranteed delivery appears to be a model where:

  1. Messages are persisted to disk once they are delivered to a channel (broker)
  2. Subscribers have durable subscriptions

I describe this option in detail below.

-- Persisting Messages --

In JMS, message persistence is set by the "delivery mode" property (rather than "subscription type") - JMSDelivery.PERSISTENT or JMSDelivery.NON_PERSISTENT. If a message delivery mode is persistent, then it must be saved to stable storage on the broker.

For Apache ActiveMQ, for example, there are two file-based persistence options (AMQ and KahaDB) and relational database-persistence options using JDBC. For ActiveMQ, the file-based persistence options have higher throughput and KahaDB is the recommended option for recent releases.

As shown in Figure 2 below, you can choose to have a single central broker or you can set up a local (or embedded) broker on the sender computer (or VM) where the message is persisted before being sent to a remote broker.

Figure 2: Options for persistence of JMS messages


-- Message durability vs. message persistence --

Message durability refers to a what JMS calls a durable subscription - the "durability" is an attribute of the messages with respect to the behavior of the subscriber. In the durable subscription model, if a subscriber disconnects, the messages will be retained until it reconnects and gets them. If the subscription is set to be non-durable, then the subscriber will miss any messages delivered while it is not connected.

Message persistence refers to the persistence of messages in the face of a stability problem in the broker or provider itself. If the broker goes down while there are undelivered messages, will the messages be there when it starts up again? If the messages are set to persistent mode and the persistence is to disk (rather than memory cache), then and only then is the answer "yes".

With a durable subscription, the broker remembers what the last message delivered was on a per (durable) subscriber basis, as shown below:

Figure 3: Durable subscribers – the broker remembers which subscribers have received which messages. The arrows to the durable subscribers show the last message sent to that subscriber.

-- Performance Costs of Guaranteed Delivery with Message Persistence and Durable Subscriptions --

There is a performance hit to having message persistence and durable subscriptions. I was only able to locate a single source that had published any metrics – a 2008 report for Apache ActiveMQ. If anyone knows of others, please let me know.

I summarize the key findings of that paper with respect to the settings required for guaranteed delivery. The paper shows the following two graphs relevant to the performance hit of the "guaranteed delivery" options:

Figure 4: Performance (throughput) of ActiveMQ with non-persistent message and non-durable subscriptions (Note: “1/1/1” refers to a one producer/one consumer/one broker – the second, third or fourth options shown in Figure 2.)

Figure 5: Performance (throughput) of ActiveMQ with persistent messages and non-durable subscriptions (blue) and persistent message with durable subscriptions (red). The tests here are otherwise the same as those shown in Figure 4.

ComparisonTimes Faster
non-persistent msgs, non-durable subs > persistent msgs with durable producer only2.3
non-persistent msgs, non-durable subs > persistent msgs with durable producer and consumer15.6
persistent msgs with durable producer only > persistent msgs with durable producer and consumer6.7

It is not clearly laid out in this benchmarking document, but I believe “durable producer / non-durable consumer” means the second option shown in Figure 2, namely that messages are persisted in a single broker on the producer side. If that is correct, then “durable producer / durable consumer” means the first option in Figure 2 – that there are two brokers, one on the producer side and one on the consumer side as each has persistent messages and durable subscribers.


/* ---[ Other Uses of Events ]--- */

In this blog series we have covered events from the point of view of Event-Driven Programming (encompassing topics like "Evented I/O"), Intra-system Staged Event Driven Architecture and inter-system Event-Driven Architecture. There is at least one other major context that uses the concept of event and that is "Event-Sourcing", which leads to topics like Retroactive Event Processing and CQRS. I encourage you to investigate those areas as I believe they are becoming more popular. Greg Young has done a lot to popularize them and convince people that it is a robust model for enterprise computing.

As a reminder, nothing in this short series was intended to be original work on my part. Rather, I have tried to do a survey of the various meanings around "event" in programming and architecture. Someday I may do a piece on Event Sourcing and CQRS, but have plans to blog on other topics for the near term. I can recommend this blog entry from Martin Krasser, which hopefully will be the first in a very interesting series of a first-hand account of building an event-sourced web app in Scala.



Previous entries in this blog series:
Events and Event-Driven Architecture: Part 1
Events and Event-Driven Architecture: Part 2


Sunday, January 15, 2012

Events and Event-Driven Architecture: Part 2

In my last blog entry, I introduced Event Driven Architecture (EDA), compared it to Event Driven Programming and then spent most of the time diving into SEDA - Staged Event Driven Architecture, which is an intra-system design. In this blog entry, I will describe inter-system EDA, which is typically what people mean when they speak of "Event-Driven Architecture".


/* ---[ Inter-System Event Driven Architecture ]--- */

Event Driven Architecture describes a loosely coupled design between systems that communicate asynchronously via events. The format of the event is the API between systems and the bus between them is some form of messaging backbone, most typically a message queue.

Let's define "event" in the EDA sense (not necessarily the EDP sense) with a "business" definition and then dissect it a from a technical point of view.

Event

a notable occurrence that is deemed important enough to notify others about. It can signal an action (such as an inventory change), a problem or impending problem (out of or nearly out of an inventory item), a threshold, a deviation or something else. It is typically a description of something that has happened, rather than a command to do something.


Event - technical breakdown

An event is communicated in a message with a header and a body.

  • The header can include: type, name, timestamp, creator and should be consistently defined across specifications
  • The body describes what happened. It should fully describe the event such that an interested party does not have to go back to the source system in order to get essential information.
    • This distinguishes it from a RESTful model (at least level 3 of the Richardson Maturity Model) where typically only minimal essential information is communicated. Instead URIs are provided for the client to go back and retrieve more complete information about an entity. REST is typically a request-reply interaction, whereas events in EDA are usually background asynchronous notifications broadcast to a messaging system, so a policy of providing all relevant details in the event message makes sense under those circumstances.

So what formats could be used for event info? Basically anything with a header and body. XML and JSON are reasonable choices. I'll discuss other options and considerations on how to choose a format in the next EDA blog post.

Interacting With a System

In an Event Driven Architecture, systems communicate with each other by sending events describing what just happened. These are generally not point-to-point connections and they are not synchronous request-reply connections. Instead, an event is broadcast to any and all listeners in an asynchronous fire-and-forget fashion. The broadcast is done via a queue or other messaging system. This is why EDA creates a more loosely coupled system than a traditional SOA design.

There are only two dependencies between systems now:

  1. the domain event format and content
  2. the messaging middleware

System A, publishing its events, is unaware of any subscribers to its stream of domain events. System A does not have any dependencies on subscriber systems. If downstream system B is currently down or overloaded, system A will continue publishing at its own pace. System B can get to its "inbox" whenever it is ready and it can do whatever it likes with the events it receives. None of that has any effect on system A, which can continue to go about its business as if it was the only system in the world.

Each system in an Event Driven Architecture can be carefree about other systems, but of course an Enterprise Architect still needs to ensure that all systems under his or her purview are interacting appropriately and meeting their SLAs. So with EDA there are still hard integration problems to solve. But there are circumstances where this loosely coupled design is a better and simpler choice. I'll cover an example later.

Note: there are some who would argue with the idea that an asynchronous eventually-consistent model is "simpler" than a typical SOA synchronous request-reply architecture.

As usual, whenever I use the word simple I (try to) always mean it in the sense portrayed by Rich Hickey in his Simple-Made-Easy talk - namely, simple means unentangled with concerns of others. Minding your own business is simple. Being involved and entangled in everybody else's is complex. Any repeat readers of my blog will probably get tired of me talking about it, but I think that was one of the most profound lectures I've listened to, so head there next if you haven't listened to it yet.


/* ---[ Some Formal Terminology ]--- */

Let's give some formal names to the players in a event-driven system.

First, there is the event generator. This is usually a software system that has just processed a request (say from a user) or an event (from another event generator).

Next, there is the event channel or event bus - usually a messaging queue such as a JMS or AMQP compliant broker.

Event subscribers are interested parties (applications) that sign up to receive events off the event channel. They then become event processors, possibly doing some action in response to the event or a series of events. The sum total of the downstream event processing is the event consequence.


/* ---[ Making It Real ]--- */

OK, enough formality, let's have an example. For this, I have found no better source than a presentation available on InfoQ by Stefan Norberg. I highly recommend watching this presentation. With apologies (or kudos) to Stefan, I'm going to steal his example here.

Suppose you have an online store where users purchase some goods from you. In your enterprise, your architects and developers have put together some front end systems like a shopping web site and a payment system. In the back end you have your databases, inventory and reporting systems. It is composed with SOA services through some enterprise server bus (ESB). Currently those systems interconnect with request-reply synchronous behavior.

Now management says to the IS team: "we want to institute two new features - a real-time fraud detection system and a loyalty program". If you stick with the current design, you will have to embed a fraud detection system in multiple systems - those systems will have to become fraud aware. And the fraud business rules may need to correlate across systems, building in even more complex dependencies. I'm going to steal two slides from Norberg's presentation and here is the first to illustrate:

(Image from slides available here)

This becomes an unscalable model. The fraud and loyalty systems are not only embedded in other systems, but they also will need to query the database(s) you have - thus creating more inter-dependencies, the opposite of a simple system. If a database schema were to change, the fraud and loyalty systems also have to be updated.

Event-Driven Architecture challenges us to look for opportunities where a more loosely-coupled design is possible. What if, instead, we pulled the fraud and loyalty systems out, made them standalone, perhaps give them their own databases (to keep just what they want for as long as they need) and had them subscribe to a domain event queue from the other systems?

If we modify the Shop and Payment systems to fire off domain events that describe what they do as they do it, the fraud and loyalty systems can subscribe to those in real-time. Remember also that events should ideally contain all information about the event so that subscribers do not have to go back to the source systems to get more information. This then also decouples the existing Customer and Reporting databases from the Fraud and Loyalty system, simplifying the design.

In fact, now that we are thinking this way, we can see that the Reporting database is a read-only database that could also be populated by subscribing to the event stream. So by rearchitecting now we have this:

(Image from slides available here)

In this slide there is also a generic Business Activity Module (BAM) to show that additional feature requests from management could be added here as well. This slide illustrates nicely how EDA and SOA can work together, allowing us to choose the right model for each subdomain.


/* ---[ Integration Modes ]--- */

Integrations, as you've seen, are not point-to-point in an EDA system. Thus, an EDA system is very amenable to an ESB or EIP (Enterprise Integration Patterns) distributed system, such as Apache Camel.

Unlike batch-based ("offline") modes of integration, EDA systems can stay up-to-date in real or near-real time. For example, instead of an hourly or nightly ETL job to synchronize datastores, one datastore can publish its new entries to a queue and the other system can subscribe and update its datastore immediately.

The flexibility at each layer the is strength of EDA. What gets generated, how it is processed/formatted, and who receives it and what they do with it are loosely coupled and each layer has dependencies only to the one whose queue/event stream it directly subscribes to. Even then, it is not a deep complex API-based dependency.

With CORBA, DCOM and RMI, the caller and recipient have to be in lock-step with each other through all version changes - they are hard-coded pre-compiled APIs that must match. But with Events, you need not have as much coupling. If event generators begin publishing new types of events, it will not break downstream systems and the subscribers can continue on as before, or adapt and be able to handle the new messages on their own update schedule.

You can evolve what types of messages you can process. For example, you could pass around the equivalent of HashMaps that contain the various data you need, rather than strictly formatted XML docs, for example. If you add more keys and values to a hash map, the downstream subscriber still can pull out the old ones it knows about.

The picture below shows how a reporting system can be upgraded at a different rate than the systems generating the events it subscribes to. As long as the basic “API”, i.e., format of the message, is retained, the consumers can be updated at a slower rate than the producers without breaking the overall system.


/* ---[ Next Installment ]--- */

Hopefully this installment in the EDA series gave you a sense of the power of EDA and promoted the idea that EDA and SOA are not competitors, but rather co-exist peacefully.

In the next blog post on EDA, I'll cover four additional topics: performance characteristics, domain event/message formats, complex event processing and guaranteed delivery. The latter is something you will want to have in many circumstances. Take our example above of having the reporting database populated by an event stream from the shopping and payment systems. In order to have accurate reporting, you will want to guarantee that eventually all events reach the reporting database, by some means of guaranteed delivery.


Key References:
1. Domain Event Driven Architectecture, presentation by Stefan Norberg
2. Event-Driven Architecture Overview (PDF), whitepaper
3. How EDA extends SOA and why it is important, blog post



Other entries in this blog series:
Events and Event-Driven Architecture: Part 1
Events and Event-Driven Architecture: Part 3

Saturday, January 7, 2012

Events and Event-Driven Architecture: Part 1

The terms "event" and "event-driven" in software design are actually somewhat overloaded, so I endeavored to do a little research on their various meanings. In this blog entry I start by defining and contrasting Event-Driven Programming (EDP) from Event-Driven Architecture (EDA) and then dive into some parts of the latter. Later blog entries will delve into additional areas of EDA.

There isn't any original research or design here - rather a summary of key usages I have found.

First let's make a initial distinction between event-driven programming vs. event-driven architecture.


/*---[ Event Driven: Programming vs. Architecture ]---*/

Event Driven Architecture in general describes an architecture where layers are partitioned and decoupled from each other to allow better or easier composition, asynchronous performance characteristics, loose coupling of APIs and dependencies and easier profiling and monitoring. Its goal is a highly reliable design that allows different parts of the system to fail independently and still allow eventual consistency. Layers in an Event Driven Architecture typically communicate by means of message queues (or some other messaging backbone).

Event Driven Programming is a programming model that deals with incoming events by having the programmer process those events one after the other in a single thread. Events come from outside the main thread - from a user action, an asynchronous server or system reply - that gets put onto the event queue for the main processing thread to handle. This is the model of many GUI programming environments, such as Java AWT/Swing. The most prominent example of EDP I'm aware of is the JavaScript model in the browser and now on the server side with Node.js.

EDP and EDA have no special co-dependent relationship to each other. One can do either, both or neither. However, they do share in common the central idea that events comes in, typically from "outside" (the meaning of which is context dependent), are communicated in some known format and put onto queues to be processed one at a time.

I won't speak any more of Event Driven Programming here, but a nice introductory tutorial to Node.js is the The Node Beginner Book by Manuel Kiessling.


/*---[ Staged Event Driven Architecture ]---*/

The next major divide I will consider is between inter-system EDA and intra-system EDA.

Staged-Event Driven Architecture, or SEDA, is a form of intra-system EDA. The term was first coined by Matt Welsh for his Ph.D. work. It models a system by breaking it down into serial processing stages that are connected by queues. Welsh calls this a "well-conditioned" system, meaning one that can handle web-scale loads without dropping requests on the floor. When incoming demand exceeds system processing capacity, it can simply queue up requests in its internal buffer queues.

The processing stages of a SEDA system can either be low-level or high-level subsystems. The Sandstorm web server built by Welsh for his thesis used very low-level subsystems - things like HTTP parsers, socket readers and writers, and cache handlers.

In SEDA, events are pieces of information to be processed or acted upon, at a given stage in the processing workflow. This definition of event is more compatible with the EDP definition of event than the typical EDA definition of event (which will be covered in the next blog entry).

The Actor model is a good fit for doing SEDA. I'll speak more about this in a section below. First, let's do a quick assessment of the benefits and drawbacks of SEDA.


Proposed Benefits of SEDA Design

  • Support high demand volume with high concurrency

    • The system can be set to perform at a fixed maximum level for a given number of simultaneous requests and if overload levels are reached, then those requests pile up on the queue, rather than degrading performance by having the system process too many requests simultaneously.
  • Handle stage level "backpressure" and load management

    • requests coming in can cause Out-of-Memory errors if they come in faster than can be processed
  • Isolate well-conditioned modular services and simplify their construction

    • Simple here is the objective definition of Rich Hickey: unentwined or unentangled with other systems.

    • Well-conditioned refers to a service acting like a simple pipeline, where the depth of the pipeline is determined by the number of processing stages

  • Enable introspection by watching the queues between modules - allows a system to tune responses to conditions

    • It allows self-tuning, dynamic resource management, such as adjusting thread pool count, connection pool counts

    • Queues can incorporate filtering and throttling

    • Note: I haven't seen it mentioned in any reading/examples yet, but one could also implement the queues as priority queues that push higher priority requests to the front of the line.

However, those interested in pursuing a SEDA design for systems that will need to handle heavy loads, particularly when needing predictable latency, should read Welsh's post-thesis note and the retrospective he wrote a few years later on SEDA.

It turns out that SEDA designs can be extremely finicky - they can require a lot of tuning to get the throughput desired. In many cases, the SEDA designs do not outperform alternative designs. Welsh points out that the primary intent of the SEDA model was to handle backpressure load (when requests come in faster than they can be processed), not necessarily be more performant.

The reason for the finicky performance, I believe, was uncovered by the LMAX team in London - queues are actually very poor data structures for high concurrency environments. When many readers and writers are trying to access the queue simultaneously, the system bogs down, since both readers and writers having to "write" to the queue (in terms of where head and tail point) and most of the time is spent managing, acquiring and waiting on locks, rather than doing productive work. There are better high-volume concurrency data structures such as Ring Buffers that should be considered instead of queues.

Two side notes

  1. I wonder if anyone is considering building an Actor model on a ring buffer or the LMAX Disruptors? I haven't studied the Akka library carefully enough to know whether this would be a good idea and potentially make the Actor model even more performant - I'd love to hear from people with experience on this.

  2. There are other data structures being designed for the "multicore age". One article worth reviewing was published in the ACM in March 2011. Currently you have to pay to get the article, but there is a version available on scribd last time I checked.


/*---[ Using The Actor Model to Implement SEDA ]---*/

An alternative that Matt Welsh (originator of the SEDA model) never discusses in his writings (at least the ones I've read) is the Erlang Actor model. In Erlang, actors communicate across processes by message passing. They can be arranged in any orientation desired, but one useful orientation would certainly be a staged pipeline.

There are a number of Actor libraries that bring the Erlang model to the JVM using threads rather than Erlang-type processes. The libraries likely have differences between them. I will focus here on the [Akka|http://akka.io/] library, since that is one I am familiar with. Akka is written in Scala, but is fully usable in Java itself.

Four of the key strengths of the Erlang/Akka model are:

  1. Each Actor is fully isolated from other actors (and other non-actor objects), even more so than regular objects. The only way to communicate with an Actor is via message passing, so it is a shared-nothing environment except for the messages, which must/should be immutable.

  2. As a set of queues, this encourages designing your application in terms of workflows, determining where stage boundaries are and where concurrent processing is needed or useful.

    • Akka allows you to do message passing like you would in a traditional ESB but with speed!
  3. It is built on a philosophy that embraces failure, rather than fighting it and losing. This is Erlang’s "Let It Crash" fault-tolerant philosophy using a worker/supervisor mode of programming: http://blogs.teamb.com/craigstuntz/2008/05/19/37819/

    • In this model you don’t try to carefully program to prevent every possible thing that can go wrong. Rather, accept that failures and crashes of actors/threads/processes (even whole servers in the Erlang model) will occur and let the framework (through supervisors) restart them for you

    • This is the key to the “5 nines” of reliability in the software of the telecom industry that Erlang was built for

  4. The Actor model is built for distributed computing. In Erlang’s model, Actors are separate processes. In Akka’s model they can either be in separate processes or in the same process.

    • Akka Actors are extremely lightweight, using only 600 bytes of memory (when bare) and consuming no resources when not actively processing a message. They typically are configured to share threads with other Actors, so many thousands, even millions of Actors can be spun up in memory on modern server architecture.
  5. Finally, Actors bring a lock free approach to concurrency. If you put any mutable state in the Actor itself, then it cannot be shared across threads, avoiding the evil of shared mutable state.

Programming with messaging passing to queues (or mailboxes as they are called in Akka) raises the level of abstraction. You can now think in terms of modules or module-like subsystems that can be designed to have a deterministic pathway. The internals of a system become easier to reason about, especially compared to a hand-rolled multi-threaded code using shared mutable state.

In addition, the Akka Actor model:

  • applies well to transactional systems, CPU-bound operations, or I/O bound operations (by using multiple concurrent Actors).
  • applies to either fully isolated actions (say number crunching or the "map" part of a MapReduce function) or to coordinated actions, but using isolated mutability, rather than shared mutability.
  • Has load balancing built in (in two ways):
    • An explicit load balancer you can create and send all message to (optional)
    • A dispatcher (required) behind the scenes that determines how events are passed and shared between Actors. For example, one Dispatcher is “work stealing”, having actors that are heavily loaded give work to actors that are less heavily loaded (in terms of items in their mailbox)

With these features, hopefully you can see how the Actor model fits a SEDA approach. Next I describe one way to do so.


(One Model For) Applying a SEDA Design to Actors

  1. Define one Actor class per Stage (http://www.slideshare.net/bdarfler/akka-developing-seda-based-applications)
  2. Use a shared dispatcher and load balancer
  3. Grab a reference to the Actors (or Load Balancer) of the next stage by looking it up in the Actor registry (set up automatically by the Akka framework).
  4. In cases where you need to throttle load by applying “backpressure” to requests, you can configure Actor mailboxes
    • Note: I need more information on exactly what this entails and when it would be useful, as I haven't researched this enough yet - feedback from readers welcome!
  5. The Actor benchmarks I have seen are very impressive – hundreds of thousands of messages per second, but as with SEDA in general, Akka provides many knobs and dials to adjust nearly everything
    • As with SEDA, it can be a two-edged sword – you have many knobs tweak to get better throughput in your system, but you may also have to do a higher amount of tweaking to get performance that you might get with by default with a more traditional programming model, which doesn't require so much fiddling.




I'll stop here for today. I have not yet hit on what I think is the main meaning of Event-Driven Architecture - that of inter-system EDA.

Another use of events is for "Event-Sourcing", which often fits hand-in-glove with the Command Query Responsibility Segregation (CQRS) pattern.

I will definitely do a write-up on EDA for another blog entry and possibly will dive into Event Sourcing and CQRS at a later time.

Feedback, disagreements, builds from readers with experience in these domains are welcome!




Key Links/Sources:

SEDA
http://www.eecs.harvard.edu/~mdw/proj/seda/
http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
http://www.theserverside.com/news/1363672/Building-a-Scalable-Enterprise-Applications-Using-Asynchronous-IO-and-SEDA-Model
http://www.slideshare.net/soaexpert/actors-model-asynchronous-design-with-non-blocking-io-and-seda
http://www.infoq.com/articles/SEDA-Mule


Akka
http://akka.io/
http://www.slideshare.net/bdarfler/akka-developing-seda-based-applications


Next entries in this blog series:
Events and Event-Driven Architecture: Part 2
Events and Event-Driven Architecture: Part 3

Saturday, December 24, 2011

My CS/Programming Top 10 for 2011

As a year end summary, I took a little time look back through my personal wiki where I keep notes on technical topics in order to make a "Top 10" List of the papers, books, articles, blogs entries, podcasts, videocasts and conferences that impacted and challenged me to expand my knowledge this year.  I made a list and checked it twice.

So in no particular order, here are my top 10 with a short bit of commentary on why you may want to read/watch/investigate these topics if you haven't already.



Simple-Made-Easy, Video Presentation, by Rich Hickey

Link: http://www.infoq.com/presentations/Simple-Made-Easy

Summary and Thoughts: If this presentation hadn't been given by Rich Hickey, I probably would have skipped it. But this year I ran across a saying (which I put as a moniker on my tech wiki): When brilliant people create things, study them. I'm not alone in thinking that Rich Hickey falls into the above category. 

I ended up watching this presentation three times over the course of a couple of months, in part because I got into a debate with work colleagues over what is a more simple vs. a more complex design. This talk helped me define those terms in terms of entanglement or intertwining - less is simpler.

It also challenges some key articles of faith of the recent modern software engineering thinking around how following TDD almost blindly will lead to proper design. All things in moderation, said the Greeks (the ancient ones, not the ones in current financial disarray) - and that applies to the often polarized comparisons between TDD vs. up-front design. On this I agree with Hickey. 




The Vietnam of Computer Science, blog entry, by Ted Neward

Link: http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx

Summary and Thoughts: This blog post is from 2006, so you can tell this is a 2011 top 10 from my personal experience, not publications of 2011. In addition to providing a refresher course in the history of the Vietnam War, Neward argues that Object-Relational Mapping (ORM) is in general an extremely leaky abstraction whose siren song of an easy life is alluring, but eventually a nasty law of diminishing returns kicks in. To pile on, Rich Hickey, in the presentation listed in item #1 above, says that ORMs have OMG complexity (OMG = oh my goodness), where the goal of software design is simplicity, not complexity.

Neward uses the analogy of the "Drug Trap": you start off taking low doses of a pharmaceutical your doctor prescribes and it works great for a while. Then it stops working, so you increase the dose, which works great for a while and then you increase it again, forming a cycle of ratcheting up until unhealthy extremes are reached. For an ORM, the ratcheting is the amount of effort you need to get your complex domain model requirements into and out of your (relational) database - and as the leaky abstraction begins to bleed more heavily you have to make some tough choices: either bear down and force the O/R model to do what you want with increasingly complex APIs or workarounds (and Neward outlines a number of those),  decide to stop using O/R for the hard stuff and go back to good ol' SQL, or find another way altogether, such as integration of relational concepts into the language. In the summary section, he lists the six possible options - to which a seventh could now be added of using a non-relational (NoSQL) database.

As popular as Hibernate is in the Java space and ActiveRecord in the Ruby (Rails) space as the near-absolute default way of working, this is a controversial opinion.  One could argue that 5 years later, ORMs have gotten better and there are good arguments that given some scenarios, ORMs make good sense.

However, where I work we have settled on MyBatis (formerly iBatis), which is not an ORM tool, as a better mapping model for most database scenarios and also integrates nicely with Spring. I need to research LINQ (which even Rich Hickey praised in his talk on simplicity) and the Spring Data efforts to get a more rounded view on the whole area before I make up my mind. And in the Ruby world, Sequel is getting a lot of press, and is also something on my "to-research" list.



REST to My Wife, blog entry, by Ryan Tomayko


Link:  http://tomayko.com/writings/rest-to-my-wife

Summary and Thoughts: I got deeper into REST and RESTful design this year and this blog entry started me off early in 2011. A good read to step up a level from the details of REST vs. SOAP type designs to see the bigger picture of what we are grappling with designing inter-system communication.



VoltDB and NewSQL, presentations and writings by Michael Stonebraker


Links: 
Presentation 1: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
Presentation 2http://bit.ly/v6ANsq
ACM article, 10 rules for scalable performance in 'simple operation' datastores: http://bit.ly/tAlaVe

Summary and Thoughts:
 Now that the NoSQL movement is in full swing and has gained wider corporate adoption, some of its tenets, such as dropping full ACID consistency and the relational model, are being revisited and challenged as being necessary requirements to doing web-scale distributed databases. This movement has branded itself as "NewSQL" to distinguish itself from "Old SQL" (Oracle, MySQL, PostgreSQL and the like) and from the non-relational, mostly non-ACID NoSQL data stores (Neo4j being a pleasant exception). NewSQL, and in particular VoltDB, are targeting the high-data-volume and high-data-throughput OLTP space - directly taking on the traditional relational databases. Michael Stonebraker, one of the early pioneers of relational databases (in his work in Ingres, which lead directly to PostgreSQL), is the CTO of VoltDB and proponent of its new model.

I haven't seen VoltDB get a lot of press, but after I watched the presentations above, I was enamored by this approach, in part because it provides a very clean solution to the problem domain I am working on at my day job.  The key aspects to their approach include:

  1. Individual node performance matters, not just parallelization: get rid of the overhead of databases by making all writes (inserts/updates) single threaded - no read/write lock contention or B-Tree latch contention to handle
  2. Have an in-memory database, with very lightweight persistence of the "commands", not the current state of the db.  
  3. Push the database logic and processing to the database using VoltDB style stored procedures (which you write in Java)
  4. Use shared-nothing scalability: find the keys by which to shard your data and distribute your data over as many machines as you need/can as long as you try to make the vast majority of your transactions single-sharded - replicate fully the other data. Make sure you database can do this for you transparently.
  5. The SQL and relational models are solid foundations on which to continue to build OLTP type applications (but granted they do not fit all models/use cases).
By following this model, Stonebraker claims you can potentially achieve 40x faster database operations that are fully ACID-transactional.  If you combine this with a processing model using the LMAX Disruptors (next entry below), one could build an extraordinarily fast OLTP processing and data storage system. This is something I'm extremely interested in prototyping and considering for production use.


LMAX Disruptors, software library (and presentations/blogs), by LMAX team in the UK

Link:
 
Presentation: http://www.infoq.com/presentations/LMAX
Fowler bliki entry: http://martinfowler.com/articles/lmax.html

Summary and Thoughts:
 Many of us are trying to find ways to leverage parallel computation. The trend towards Functional Programming promises easier concurrency models. Google formalized and consolidated many threads around massively parallel multi-node processing with MapReduce - now made open source with Hadoop and similar tools. We also see implementations of highly concurrent fault-tolerant systems using Erlang and Akka to spin up thousands of Actors in a fault-tolerant system to do highly concurrent processing. Using technologies like these many, like Google, are processing astounding amounts of data on compute grids or handling massive numbers of simultaneous users. 

At a time like this when the industry is focused on going parallel, the LMAX team comes along and says there is at least one class of problems for which massively parallel systems (and often more functional style of programming) is actually not the answer.  In fact, there are times when a single threaded processing model solves a host of problems - but can we make it performant enough to be web-scale?

The team at LMAX faces very high throughput demands that requires predictable and very low latency. They tried using a SEDA queue based system (such as the Actor model), but found that the queue is actually, perhaps surprisingly, a rather poor data structure for a highly concurrent environment.  As with the Stonebraker/Harizopoulos analysis on where multi-threaded databases spend their time (see the "VoltDB and NewSQL" entry above), a system with very high volumes that uses queues to pass data from one processing thread/stage to another actually spends far too much of its time negotiating locks rather than doing real work.

So they devoted research to understand how to optimize L1/L2 cache and utilize multiple cores efficiently. They combined this with a RingBuffer data structure (not something new - it is used in the Linux kernel for example) to create a Java library they call a Disruptor and then released it with an open source license.

With their model, they measure processing 6 million transactions per second on a single server, including persisting "commands" or "events" to disk. There are a couple of ways to use their API (one form is a simplified "DSL version"). It will manage "barriers" between writers (producers) and readers (consumers) for you. Consumers will not run ahead of producers and you can have consumers process in parallel or be gated off each other to process serially.

They have done a number of presentations available on the web and Martin Fowler was impressed enough to write a long and detailed bliki entry about it - those are the links above.

This model is hard to describe in a few short sentences, so I recommend watching the presentations and reading Fowler's write up. I think this is really impressive battle-tested stuff to have in your arsenal.


Neo4j koans, software learning tool via unit tests


Link: https://github.com/jimwebber/neo4j-tutorial


Summary and Thoughts: I've already written a blog entry on my experience setting up the koans: http://thornydev.blogspot.com/2011/11/neo4j-koans-how-do-i-begin.html. Neo4j is good stuff and these koans definitely help you learn most aspects of the Neo4j API (using Java).



The problem of Garbage Collection in the face of really really big RAM sizes, video presentation, by Gil Tene (Azul)


Link: http://www.infoq.com/presentations/Understanding-Java-Garbage-Collection

Summary and Thoughts: At my first full-time software job, during the dot-com boom, I and a team of 3 others built an internet ad-server in Java.  We built a large (at the time) heap of 512M and tried to reuse objects through our own object pools as much as possible to avoid gc. We tried to get fancy in a few ways, including trying to use JNI to control garbage collection. In the end, despite valiant efforts, we failed, with 10 second gc pauses killing our real-time ad serving every 15 or so minutes. Management wasn't happy, we realized that gc is hard and decided to give up and rewrite the ad server in C++.

This presentation brought back all those memories and pointed out that we are in an absurd dilemma now as "memory becomes the new disk". At the time of this presentation, the author says that 512 GB (that's a "G") of memory is actually the cheapest price point on servers when evaluated on a per unit memory cost. And most Java developers don't have a prayer of using that much memory in a single application. Garbage collection pauses can be delayed, but grow basically linearly with the growth of memory size. If we faced 10s "stop-the-world" pauses with 512M memory, you might be facing ~5 ''minute'' stop-the-world pauses with 512GB memory (Tene says ~1 sec pause per GB live memory). Talk about management not being happy...

"Most of what people know about garbage collection turns out to be wrong" says the presenter, Tene. Kind of like trying to get advice on software security and cryptography by searching google. Hit-or-miss for sure.

You'll learn a lot about garbage collection in general during the first hour or so of the presentation and even though the last bit is a bit rushed, there are key insights in the last 10-15 minutes well worth waiting for and rewatching and researching. This problem, along with concurrent programming with multiple cores, are two of the key challenges that face software development for the next 5+ years (I'd throw in solving mobile development as another key challenge). The Azul guys seem to be leading the way on the first problem in the JVM space.

I haven't used their garbage collector, but I'm definitely keeping it mind as we consider putting more and more data and cache in memory.  (And I wonder if this would make VoltDB even better?)

More on the Azul gc: http://www.infoq.com/articles/azul_gc_in_detail




A Clojure Cornucopia:
Joy of Clojure, book by Michael Fogus, et al.

The Clojure koans, software program
The suggestion of Clojure as Guiding Light into the next phase of programming, blog by Uncle Bob Martin


Links:  


[31-Dec-2011 Update: I just discovered the online Clojure koans at: http://www.4clojure.com and highly recommend them as well.]

Summary and Thoughts:
I had to put something about Clojure in here, as there is just so much right about it. I'm still a Clojure newbie and I've dabbled on and off with it, and currently the Joy of Clojure is kicking my butt.  Start with the new Programming Clojure coming out early next year or Clojure in Action, out now, do the koans, do the Little Schemer (see last entry in this blog) and do Joy of Clojure. In many ways, one has to re-learn programming while learning Clojure. Learning to think more functionally is the obvious one, but also learning to think in terms of immutable data structures and how to use loop/recur kind-of-like tail recursion, using accumulators. 

Since I'm still swimming in the sea-of-newness, I don't have anything profound to say on Clojure at this point, but I will invoke 
argumentum ad verecundiam (argument from authority), from well respected people that see great promise in Clojure:
For those of us using the JVM or .NET, Clojure feels like a minor miracle. It’s an astoundingly high-quality language, sure—in fact, I’m beginning to think it’s the best I’ve ever seen—yet somehow it has still managed to be fashionable. That’s quite a trick. It gives me renewed hope for the overall future of productivity in our industry. We might just dig ourselves out of this hole we’re in and get back to where every project feels like a legacy-free startup, just like it was in the early days of Java. 
 - Steve Yegge, forward to Joy of Clojure

How are we going to write that code? What language are we going to use to express the concepts that execute in thousands of machines?
Of course the current answer is “Functional Programming”. OK, maybe — maybe. But I gotta tell ya, the new functional languages out there aren’t looking too good to me. Scala and F# are still closely tied to the hardware. Scala is java with a few cute twists. And F#? Is it really the language that’s going to take us into the next age? My fear is that these languages suffer from the Sapir/Whorf trap. Their mode of expression does not sufficiently change our world view. They are too much like Java and C# and all the other old languages.
The only language I’ve seen so far that comes close to a different mode of expression, and that is capable for use in the enterprise, is Clojure.
- Uncle Bob Martin, blog entry above


Akka Cornucopia:
Akka library, 
Akka se-radio podcast, and 
Concurrent Programming Concurrency on the JVM, book

Links: 

Summary and Thoughts: At the NFJS conference this year, Akka was one of the prominent libraries being touted and one of the NFJS conference presenters, Venkat Subramaniam, released a new book Programming Concurrency on the JVM, which I read that heavily explores Akka. 

The Akka model is an Actor implementation based on Erlang's model. Jonas Boner says in the podcast mentioned above that he learned Erlang a while back and really liked its model, but wasn't able to convince enough people to adopt Erlang, so he decided to bring Erlang's Actor model to the JVM via Scala. Scala already had an Actor implementation, but was not as sophisticated or robust as Erlang's so he created the Akka library. Akka now has a large following and can be used not only by Scala developers but Java developers. I can attest that we are starting to use it in our projects where I work. 

If your problem domain either breaks down into nicely composable SEDA stages or you want a model with fault-tolerant concurrency with no shared mutable state, Akka and the Actor model are worth considering.

Interestingly, however, Clojure (Rich Hickey) has decided against incorporating the Actor model into Clojure, preferring instead agents as he describes here: http://clojure.org/state. I have to confess I haven't fully grokked why - something that may need to be learned through experience. Of course, being a JVM language and the fact that Akka has a nice Java API, you could use Akka in Clojure via its Java interop features :-)  However, Hickey does say at the end: Clojure may eventually support the actor model for distributed programming, paying the price only when distribution is required, but I think it is quite cumbersome for same-process programming.

This points up another bonus feature of using Akka - it has remoting built in and includes integration with Apache Camel. We haven't started using these features yet, but since we have a Camel integration framework, this is a nicer feature to know we can leverage this built-in features.



Little Schemer

Link: http://www.amazon.com/Little-Schemer-Daniel-P-Friedman/dp/0262560992/ref=sr_1_1?ie=UTF8&qid=1324704715&sr=8-1

Summary and Thoughts: I was not fortunate enough (some would say unfortunate enough) to learn Lisp/Scheme in college.  It has been on my list for years and when Clojure came along and started getting popular, I decided it was time.

Douglas Crockford says in
JavaScript: The Good Parts that all computer scientists should read The Little Schemer to truly understand recursion and how to think recursively.  He, in fact, re-did the little schemer exercises in JavaScript, which is actually a very expressive language for functional concepts (just with C-style syntax).

So this year, I finally decided to sit down and read it. I combined this reading with learning the modern Lisp-on-the-JVM, Clojure.  So every time The Little Schemer challenged you to think through the algorithm, I wrote it in Clojure.

About half way through, I started doing it in Scala as well, which I am also in the early stages of learning. (Trying to figure out how to do functional programming in Scala does not seem as natural as doing it in Clojure (for a newbie at least).)


In any case, I can attest that after "doing" this book, recursive thinking is definitely more natural now and sometimes when I approach a problem I can immediately see how to do it recursively and how to do it iteratively takes longer to think through - a Necker Cube-like transition. I got
The Seasoned Schemer (the sequel) for Christmas this year, so that will be on my todo list for 2012.

Sunday, December 11, 2011

Happiness requires discipline


Staying singly focused on a task in this digital era is like trying to resist eating while sitting in a bakery as cookies, pies, and cakes emerge fresh and fragrant from the oven.

We live in the era of rampant multitasking, but I've seen studies that prove people aren't as good at as they think.  Kinda like people think they can learn to do with less sleep. But here we fight our biology. It turns out that most of really do need a full 8 hours of sleep. Not 6, not even 7. Empirical task-based studies show that if you don't get 8 pretty much every night, you are not performing optimally. And as programmers, we need to be optimal to do the hard task of programming.

Self-discipline, self-control and focus are the keys to success in programming and life - including health and happiness.

At the NFJS conference this year, I heard about the Pomodoro technique for the first time - a form of focused intense concentration alternated with a short break.  I've been trying it out. For my environment at work, I find that it is pretty effective.  But for big problems where you really need to keep a lot of state in your head for prolonged periods, a double pomodoro may be needed.  Of course, one could argue that even longer than 50 minutes is optimal. Probably true, especially if you are in flow, but here too, sigh, it turns out we fight our biology.

A lot of studies lately have been showing that sitting for prolonged periods is detrimental to your health.  Sitting longer than an hour is unfortunately not advised (see links below).  So do some walking, stretching and knee bends during those pomodoro breaks.

So a programmer is happy and optimal when self-disciplined to:
  • minimize distractions (my old roommate used to program with the TV on in the background, aargh)
  • spend most of the day in periods of intense concentration mentally stepping through code paths, design considerations, testing approaches and problem solving
  • balance the mental discipline with the physical discipline of exercise, breaks and, *sigh*, 8 hours of sleep a night, even when hacking late into the night would be more fun

That's how I see it, anyway.  (And now I need to get my *rse out of the chair and get some blood flowing.)



Links about prolonged sitting:
Excellent book on our biological requirements for sleep:

Wednesday, November 30, 2011

Neo4j Koans - How do I begin?

Koans have become a smart way for people to learn a new language or technology.  I've seen them in the Ruby space with the ruby koans: http://rubykoans.com/, but I've never actually done one before.  They are set up as a set of failing unit tests and one by one you have to fix each test to gain enlightenment.

I recently became very interested in the Neo4j graph database: http://neo4j.org/, as I'm working on data structures that are hierarchical and tricky to walk in a relational database.

I came across an interesting video presentation on neo4j by Ian Robinson on the skillsmatter website - http://skillsmatter.com/podcast/design-architecture/neo4j. He and colleagues have put together a database of facts from the Doctor Who British TV series. That's when I became aware that he, Jim Webber and some other colleagues created set of Neo4j koans available on github: https://github.com/jimwebber/neo4j-tutorial/.

So I decided to tackle my first ever set of koans.


Soon after, I discovered that while Jim Webber and crew put in a lot of time creating the neo4j koan tutorial, they seem to have to put in little time on how a newbie to koans can actually run them - at least as of November 2011 when I downloaded it. I don't understand why you would put hours into it and not put a few minutes into documenting what to do. Please provide a README on how to do things like this. If you go to all the effort of creating it, why wouldn't you?

[Dec 2011 Update: There is now good starter info on the neo4j koan github site, so many thanks to Jim et al. for improving it!]

So in case it helps anyone else, here is my version of how I fumbled around to do the neo4j koans.


/* ---[ Getting the Koans Installed and Set up ]--- */

These instructions are targeted particularly for a Unix/Linux environment. My koan work was done on Linux (Xubuntu 11.10). I also tested them on Windows where I have Mingw32 that comes with the git download (and I have cygwin).

[Dec 2011 Update:  The neo4j koan github site now says it will work with cygwin on Windows - they actually say "sorry" about this, but being a Unix/Linux devotee my advice is that if you don't have cygwin yet, this is a good reason to get it and learn it. Consider it a bonus to learn a better way of doing things.]

I'm also primarily a command line kind of guy, though I do also use Eclipse for many things. In this case, I went commando - so I didn't use Eclipse's JUnit integration or download the neoclipse plug in (which sounds cool -- need to try it someday).

First download it from github (URL above). Be aware that the download is around 350 MB, so it may take a awhile if you have a lower-speed internet connection like me.

Second, cd to the main directory (neo4j-tutorial) and type ant - this will run ivy and download half the known universe in good ivy/maven fashion (grrr...). After waiting about an hour (or less if you have a better internet connection than me), you can begin by wondering what to do next. First make sure the build ran to successful completion - happily the koans as unit tests all passed out of the box for me, so you want to make sure that is all working on your system before beginning.

The authors provide a presentation in the presentation directory (for some reason in ppt and not converted to pdf for more general viewing), which can be helpful, but wasn't enough to really know how to do the koans. I recommend coming back to the presentation periodically and reviewing it for the section you are working on.  Some of its visuals and notes are helpful, but mostly you'll just need to read the codebase and neo4j javadocs to really know how to get things done.

Next run the tree command to get a look around (you may need to download tree - its a great Unix command line tool to see files/dirs in a compact tree structure).

You'll see that the koans are in the src directory (output from the tree cmd):
├── src
│   ├── koan
│   │   └── java
│   │       └── org
│   │           └── neo4j
│   │               └── tutorial
│   │                   ├── Koan01.java
│   │                   ├── Koan02.java
│   │                   ├── Koan03.java
│   │                   ├── Koan04.java
│   │                   ├── Koan05.java
│   │                   ├── Koan06.java
│   │                   ├── Koan07.java
│   │                   ├── Koan08a.java
│   │                   ├── Koan08b.java
│   │                   ├── Koan08c.java
│   │                   ├── Koan09.java
│   │                   ├── Koan10.java
│   │                   ├── Koan11.java

and if you look into them, you'll see comments and snippet sections that say "your code here", but the come pre-filled in with the answers.

However, if you go into the src/main/scripts directory you'll notice a a "release.sh" script, which extracts the relevant portion of the koans from the current github dir and copies it to a temp directory and runs the remove_snippets script. However I couldn't get it to work after 15 minutes of futzing with it and the documentation for it is basically useless.

[Dec 2011 Update: I tried it again and it works now for me. Either I did something wrong the first time or Jim Webber tweaked it.]


There is also a remove_snippets.sh.  You can run that directly - you are supposed to do it from the top-level dir, not the scripts directory. Either way I got an error message.  But it does work if you run it from the top level dir, despite the error message.  Here's what I got:

$ src/main/scripts/remove_snippets.sh
sed: can't read : No such file or directory
$ git st
 M src/koan/java/org/neo4j/tutorial/Koan02.java
 M src/koan/java/org/neo4j/tutorial/Koan03.java
 M src/koan/java/org/neo4j/tutorial/Koan04.java
 M src/koan/java/org/neo4j/tutorial/Koan05.java
 M src/koan/java/org/neo4j/tutorial/Koan06.java
 M src/koan/java/org/neo4j/tutorial/Koan07.java
 M src/koan/java/org/neo4j/tutorial/Koan08a.java
 M src/koan/java/org/neo4j/tutorial/Koan08b.java
 M src/koan/java/org/neo4j/tutorial/Koan08c.java
 M src/koan/java/org/neo4j/tutorial/Koan09.java
 M src/koan/java/org/neo4j/tutorial/Koan10.java
 M src/koan/java/org/neo4j/tutorial/Koan11.java

(git st is aliased to git status -s on my machine)

So it did modify the Koans and remove the parts I'm supposed to fill in.

After doing this, I recommend doing git reset --hard to get back the filled in koans. Copy them to another directory, so you can peek at them when you are doing the koans in case you get stuck or want to compare your solution with the official one.

Then run the remove_snippets.sh script again and do a git add and git commit. Now we are ready to start the koans (whew!).


/* ---[ Doing the Koans  ]--- */

The koans are unit tests. After you run the remove_snippets script, all the koan unit tests will fail (except for Koan01, which for some reason has no snippet for you to replace - it is a just a reading koan, not a doing koan, I guess).

You need to fix the koans one by one and get each test passing. Unfortunately, I couldn't find a way to run each test separately, you have to do the full battery, plus the annoying-as-hell ivy checks.  <rant>Speaking of which, you won't be able to run these tests while offline, even after you've download everything via ivy.  This is my biggest complaint about the ivy/maven model.  I frequently want to do offline working, so I curse setups that require everything to be done via ivy/maven.</rant>

One way you can sift through all the noise of the output is to run this unit tests like this:
$ ant | grep Koan11

Then you will just get the output of Koan11 (though you have to run everything unless you want to edit the Ant build.xml file), like this:

$ ant | grep Koan11
    [junit] Running org.neo4j.tutorial.Koan11
    [junit] TEST org.neo4j.tutorial.Koan11 FAILED
    [junit] Tests FAILED


BUILD FAILED
/home/midpeter444/java/projects/neo4j-koans/neo4j-tutorial/build.xml:68: Build failed due to Koan failures

I also found context lines and simple pattern matching to help, such as:
ant | grep -C 4 Koan0[123]



/* ---[ Only Running the Tests You Want  ]--- */

As I said, there is no target in the ant file (or any helper scripts) to only run one test/koan at a time. And each one takes many seconds, so the whole thing can take well over a minute (depending on machine speed).

To do the easiest thing that would work, I issued the following command (you could do it with sed if you don't have perl installed):

find src/koan/java -name Koan*.java -print | xargs perl -pi -e "s/([@]Test)/\/\/\1/g"

It just comments out all the @Test annotations in the Koan files.  Run this once at the beginning. Then you can remove the comments from each test as you are working on them.  So the ant file stills runs all the koans, but they don't take very long if you haven't uncommented them.


/* ---[ Reading the error report output  ]--- */

Don't make the same mistake I did spending lots of time going through target/koan/reports/TESTS-TestSuites.xml output. I later found that a nice html report is provided one more directory down.  Open the target/koan/reports/output/index.html in your browser and refresh after each test - this is very nice!

So the cycle is:
1. Edit the Koan test until you are ready to run (uncomment that one's @Test annotation)
2. Run: ant | grep -C 4 Koan0[34]  (modify the numbers as needed)
3. Refresh target/koan/reports/output/index.html in your browser and refresh after each test.

Note also that if you debug by printing to stdout, there is a link on the index.html output to view it - use that as needed.

Finally, I wrote a little helper class that will print out all the properties and values of a Node - this is helpful in debugging. Here is the code:


package org.neo4j.tutorial;


import org.neo4j.graphdb.Node;


public class NodePP {
  public static String pp(Node n) {
    String s = "Node: ";
    for (String k: n.getPropertyKeys()) {
      s += String.format("\n  %s: %s", k, n.getProperty(k));
    }
    return s;
  }
}


Complaints aside, I'm currently finishing Koan08c and I recommend them as a good way to learn neo4j and think in terms of graphs. And so far, I really like the cypher query language....


[Dec 2011 Update:] I've finished them all now. Since I only broke down and cheated once by looking at the pre-filled in version (Koan11 was a doozy), it took me a while, but I feel like I have a very good sense of how to use neo4j now.  The koans provide great coverage of the approaches to using the database.


Neo4j is a very promising database and I think it is a serious player in the NoSQL space. The next question for me is how to use it with one's domain model in POJOs. I see four options:
  1. Serialize/deserialize POJOs to/from JSON and use the neo4j REST API. This will be less performant, but is the option if you are using a standalone database server.
  2. Make your POJOs Neo-aware - have them wrap Nodes and Relationship and keep their attributes in neo Node/Relationship properties. This obviously tightly couples your domain to your persistance layer.
  3. Use CQL (cypher query language) like you do SQL when not using an ORM. Cypher is very nice and well thought out. I wonder how hard it would be to construct a MyBatis-like mapper between Cypher and your POJOs.
  4. Use the Spring Data Neo4j annotation bindings for neo4j. This looks promising. I've started looking at it, but no strong opinion yet. They do say that it will be less performant than directly using the direct neo4j API (such as the Traversal API), as there is metaprogramming (Java Reflection API usage) going on.