Saturday, January 21, 2012

Events and Event-Driven Architecture: Part 3

In part 1 and part 2 of this series, I introduced Event Driven Architecture (EDA), compared it to Event Driven Programming and then spent time distinguishing between intra-system and inter-system EDA. In this installment, I enhance the discussion around inter-system Domain Event Driven Architecture with a hodge podge of follow-on topics:

  1. Domain Event Formats
  2. Performance Characteristics of EDA
  3. Complex Event Processing
  4. Guaranteed Delivery


/* ---[ Domain Event Formats ]--- */

Events in inter-system EDA, remember, are basically messages from one system to another that fully describe an occurrence in the system that issues it. That event cascades to some number of subscribers, unknown to the publisher (aka event generator).

What formats could be used for a Domain Event? Basically any data format with a header and body. Let's consider a few options here and criteria for selecting one appropriate your domain requirements.

First, viable options include:

  1. XML
  2. JSON
  3. ASN.1
  4. Google Protocol Buffers
  5. CSV
  6. Serialized Objects
  7. Hessian
  8. Industry-specific options

Criteria by which these can be evaluated include:

  1. Is there are standard in my particular industry I should follow?
  2. Do it have support for a schema?
  3. Binary vs. text-based
  4. Does it have cross-platform (cross-language) support?

If the answer to #1 is yes, then that is probably your default choice. Beyond that though, some argue that there are definite advantages in choosing a format that has a schema.

First off, a schema obviously allows you to validate against the structure as well as the content. But probably it's most important feature is to be able to do versioning of formats. In part 2 I described that a key advantage of inter-system EDA is that systems do not have to be lockstep when upgrading to new functionality and versions.

If a consumer can only process version 1, as version 2 causes it to choke, one way to handle this would be to use the Message Translator Pattern to convert from version 2 back to version 1.

Another option is for the event generator to publish both versions either to different event channels (such as different JMS Topics) or to the same channel with different metadata, such as the version itself. This allows the subscribers to selectively choose which version to subscribe to and process.

Binary formats will be smaller and thus faster to send and probably faster to parse. The documentation on Google Protocol Buffers, for example, claims that processing them is 20 to 100 times faster than a comparable XML document.

Lastly, if you work in an environment that needs to interoperate with multiple languages, you will need to use a format that is amenable to all those platforms.

Using the above criteria, good old XML is probably the best text-based format, as it supports schemas and is universal to all platforms. For a binary protocol, Google Protocol Buffers is quite attractive, as it is schema-based, binary and efficient. It is not inherently universal, however, since you must have language bindings to serialize and deserialize objects to and from that format. Google supports Java, C++ and Python. There are community supported open source bindings for other languages here, but you'll have to research how mature and production ready any given one is.

The documentation on Protocol Buffers is very good and provides helpful advice on planning for domain event content/format changes. For example, they recommend making most fields optional rather than required, especially new ones being added in subsequent versions and to have a convention on how to deprecate older non-required fields, such as using an OBSOLETE_ prefix.

Stefan Norberg's presentation on Domain EDA, which I referenced heavily in part 2, has a more detailed breakdown of Domain Event formats using these criteria and a few more (see slides 47-49).

One option not considered in the Norberg presentation is Thrift, originally built by Facebook, and now an Apache project. It has support for more than a dozen programming languages, is binary and has support for versioning (which I assume means it has a schema, but I haven't been able to tell from the reading I've done on it). Unfortunately, I found complaints as far back as 2008 about is lack of documentation and the its apache web site still is sadly lacking in that department. I wonder if it has lost out to Google Protocol Buffers? If any readers have experience using it, I'd be interested in hearing from you about it.


/* ---[ Performance Characteristics ]--- */

Event Driven Architecture gains loose coupling by basically passing data structures (event payloads) into generic, standardized queueing or messaging systems and using asynchrony. Since EDA is asynchronous by nature, it is amenable to high performance, high throughput and high scalability, with the tradeoff being that you have to live with eventual consistency between systems.

In a synchronous model the highest throughput possible (or lowest latency if that is the measure you most care about) is set by the slowest system in the chain, whereas in a environment where systems are connected asynchronously the lowest latency is not constrained by the slowest system.

To illustrate, suppose we have two systems A and B. System A by itself can process 500 transactions per second (tps), while system B can process 200 tps by itself. Next assume system A has to send some sort of message or RPC call to system B. If we make the interaction synchronous, system A can now process at most 200 tps (ignoring communication overhead between A and B). But if we can connect them with asynchronous messaging, system A can continue to process at 500 tps.

In isolation, this is excellent and often this works out very well. Of course, an architect considering both systems will need to consider that system B may continually fall farther and farther behind and never catch up to system A, which could be a problem in some scenarios. Nevertheless, the principle of asynchronous connectivity allows EDA to maximize throughput (and ideally minimize latency of individual requests) for only the parts that absolutely must be in request-reply synchronous mode.

Related Idea: Overcoming SLA Inversion

A side note I want to throw in here that isn't related to performance but follows the same line of reasoning as the above example is the fact that EDA can help overcome SLA Inversion, a termed coined (as far as I know) by Michael Nygard in Release It! (which if you haven't read, should move to the top of your reading list).

SLA (Service Level Agreement) inversion is the problem of trying to meet an SLA about uptime (say 99.5%) when your system is dependent on other systems. If those other dependent systems are coupled to yours with synchronous call behavior and their SLAs are less than yours, then you cannot meet your higher SLA. In fact, in synchronously coupled systems, the SLA of your app is the product of all the SLAs of the systems.

For example, if system A's SLA is 99.5%, system B's is 99.5% and system C's is 98.0%, if system A is synchronously dependent on B and C, then at best its SLA is (0.995 * 0.995 * 0.98), which calculates to an overall 97.0% SLA.

One of Nygard's proposed solutions to SLA inversion is decoupling middleware, which is exactly what is provided by an Event Driven Architecture with an asynchronous messaging backbone between systems.


/* ---[ Complex Event Processing ]--- */

To define Complex Event Processing (CEP), let's first give an informal definition of "simple" or regular event processing. Simple processing is typically a situation where a single event from an event source causes a predictable, expected response that does not require fancy algorithms to respond to. Examples of simple processing include a data warehouse that subscribes to events and simply consumes them into its own data store. Or a case where a "low threshold" event, such as from an inventory system triggers a "restock" action in a downstream consumer.

In a case where patterns over time in the event stream must be monitored, compared and evaluated, then you have complex event processing. Let's return to the fraud detection system we outlined in part 2. In order to detect fraud, two types of analyses might be set up:

  1. Watch for abnormal purchase behavior against some relative scale, such as previous purchase amounts or purchase frequency.
  2. Watch for abnormal purchase behavior against some absolute scale, such as many repeated purchases in a short period of time.

The event streams can be monitored in real-time and constantly compared to fraud rules and heuristics based on past behavior. Averages and other aggregations for a time series, either overall or on a per use basis can also be kept in the consumer's own data store. If a significant pattern is detected, the CEP can take action or itself fire off a domain event to a system to handle it.

For the right need, CEP can be a very powerful use of event driven architecture. In addition, you don't have to build it from scratch. Tools like Esper for both Java and .NET are designed to do Event Stream Processing and make inferences against patterns in high-volume event streams using algorithms of your choice or design.


/* ---[ Guaranteed Delivery ]--- */

In part 2, I discussed using asynchronous events from a source system, such as an inventory or point-of-sale system to populate a reporting database. In many cases, this would require that the reporting database have complete data - no gaps due to lost messages. To do this via events put onto a message bus, you would need some form of guaranteed delivery, where each and every message sent can be guaranteed to be delivered to the consumer.

Guaranteed Delivery (my definition)

Once a message is placed into a message channel, it is guaranteed to ultimately reach all its targeted destinations, even if various parts of the system fail at any point between initial publication and final delivery. Or the delivery failure would fail in a way that the broker/sender knows the message needs to be resent.

Here I outline three potential ways of achieving guaranteed delivery from a messaging system. There are probably other options and I'd be eager to hear of them from readers.

  1. Synchronous transactional mode
  2. Asynchronous send + reply with acknowledgment
  3. Persistent messages with durable subscriptions (also asynchronous)

Note: I don't have much knowledge of the Microsoft/.NET side of the world, so in this discussion I will focus on JMS (Java Message Service) and some parts of AMQP (Advanced Message Queueing Protocol) that I have read about.

Option 1: Synchronous transactional mode

JMS has a synchronous mode which in theory could be used for guaranteed delivery, but obviously in a very limiting way that defeats the purpose of asynchronous messaging systems.

In AMQP (from my limited study so far) it appears that the typical way to get guaranteed delivery is via synchronous transactions, where the publisher must wait until the broker has processed the message before continuing. This is likely to be the option with the lowest throughput (see note in RabbitMQ section below).

Option 2: Asynchronous send + reply with acknowledgment

The next option is to set up an asynchronous request/reply-with-acknowledgment system to ensure that all messages delivered are received by all relevant subscribers. The ActiveMQ in Action book describes the model shown in Figure 1 below.

Figure 1: Request / reply model using asynchronous messaging

One option for doing this is to use header fields in the messages (JMSReplyTo and JMSCorrelationID) and create temporary destinations (queues) for the acknowledgment replies. Any replies not received in some time frame could be considered undelivered messages that would be resent to the receiver. To handle cases where the producer’s message was received by the consumer, but the acknowledgment was not received, the receiver will have to be able to detect and ignore duplicate messages.

RabbitMQ, an AMQP compliant broker (but not JMS compliant), provides an extension to AMQP they call Lightweight Publisher Confirms. It implements a form of asynchronous messaging followed by acknowledgement of receipt.

The RabbitMQ website claims this is 100 times faster than using typical AMQP synchronous transaction model for guaranteed delivery.

Option 3: Persistent messages with durable subscriptions (also asynchronous)

Finally, in JMS, the preferred option to get guaranteed delivery appears to be a model where:

  1. Messages are persisted to disk once they are delivered to a channel (broker)
  2. Subscribers have durable subscriptions

I describe this option in detail below.

-- Persisting Messages --

In JMS, message persistence is set by the "delivery mode" property (rather than "subscription type") - JMSDelivery.PERSISTENT or JMSDelivery.NON_PERSISTENT. If a message delivery mode is persistent, then it must be saved to stable storage on the broker.

For Apache ActiveMQ, for example, there are two file-based persistence options (AMQ and KahaDB) and relational database-persistence options using JDBC. For ActiveMQ, the file-based persistence options have higher throughput and KahaDB is the recommended option for recent releases.

As shown in Figure 2 below, you can choose to have a single central broker or you can set up a local (or embedded) broker on the sender computer (or VM) where the message is persisted before being sent to a remote broker.

Figure 2: Options for persistence of JMS messages


-- Message durability vs. message persistence --

Message durability refers to a what JMS calls a durable subscription - the "durability" is an attribute of the messages with respect to the behavior of the subscriber. In the durable subscription model, if a subscriber disconnects, the messages will be retained until it reconnects and gets them. If the subscription is set to be non-durable, then the subscriber will miss any messages delivered while it is not connected.

Message persistence refers to the persistence of messages in the face of a stability problem in the broker or provider itself. If the broker goes down while there are undelivered messages, will the messages be there when it starts up again? If the messages are set to persistent mode and the persistence is to disk (rather than memory cache), then and only then is the answer "yes".

With a durable subscription, the broker remembers what the last message delivered was on a per (durable) subscriber basis, as shown below:

Figure 3: Durable subscribers – the broker remembers which subscribers have received which messages. The arrows to the durable subscribers show the last message sent to that subscriber.

-- Performance Costs of Guaranteed Delivery with Message Persistence and Durable Subscriptions --

There is a performance hit to having message persistence and durable subscriptions. I was only able to locate a single source that had published any metrics – a 2008 report for Apache ActiveMQ. If anyone knows of others, please let me know.

I summarize the key findings of that paper with respect to the settings required for guaranteed delivery. The paper shows the following two graphs relevant to the performance hit of the "guaranteed delivery" options:

Figure 4: Performance (throughput) of ActiveMQ with non-persistent message and non-durable subscriptions (Note: “1/1/1” refers to a one producer/one consumer/one broker – the second, third or fourth options shown in Figure 2.)

Figure 5: Performance (throughput) of ActiveMQ with persistent messages and non-durable subscriptions (blue) and persistent message with durable subscriptions (red). The tests here are otherwise the same as those shown in Figure 4.

ComparisonTimes Faster
non-persistent msgs, non-durable subs > persistent msgs with durable producer only2.3
non-persistent msgs, non-durable subs > persistent msgs with durable producer and consumer15.6
persistent msgs with durable producer only > persistent msgs with durable producer and consumer6.7

It is not clearly laid out in this benchmarking document, but I believe “durable producer / non-durable consumer” means the second option shown in Figure 2, namely that messages are persisted in a single broker on the producer side. If that is correct, then “durable producer / durable consumer” means the first option in Figure 2 – that there are two brokers, one on the producer side and one on the consumer side as each has persistent messages and durable subscribers.


/* ---[ Other Uses of Events ]--- */

In this blog series we have covered events from the point of view of Event-Driven Programming (encompassing topics like "Evented I/O"), Intra-system Staged Event Driven Architecture and inter-system Event-Driven Architecture. There is at least one other major context that uses the concept of event and that is "Event-Sourcing", which leads to topics like Retroactive Event Processing and CQRS. I encourage you to investigate those areas as I believe they are becoming more popular. Greg Young has done a lot to popularize them and convince people that it is a robust model for enterprise computing.

As a reminder, nothing in this short series was intended to be original work on my part. Rather, I have tried to do a survey of the various meanings around "event" in programming and architecture. Someday I may do a piece on Event Sourcing and CQRS, but have plans to blog on other topics for the near term. I can recommend this blog entry from Martin Krasser, which hopefully will be the first in a very interesting series of a first-hand account of building an event-sourced web app in Scala.



Previous entries in this blog series:
Events and Event-Driven Architecture: Part 1
Events and Event-Driven Architecture: Part 2


Sunday, January 15, 2012

Events and Event-Driven Architecture: Part 2

In my last blog entry, I introduced Event Driven Architecture (EDA), compared it to Event Driven Programming and then spent most of the time diving into SEDA - Staged Event Driven Architecture, which is an intra-system design. In this blog entry, I will describe inter-system EDA, which is typically what people mean when they speak of "Event-Driven Architecture".


/* ---[ Inter-System Event Driven Architecture ]--- */

Event Driven Architecture describes a loosely coupled design between systems that communicate asynchronously via events. The format of the event is the API between systems and the bus between them is some form of messaging backbone, most typically a message queue.

Let's define "event" in the EDA sense (not necessarily the EDP sense) with a "business" definition and then dissect it a from a technical point of view.

Event

a notable occurrence that is deemed important enough to notify others about. It can signal an action (such as an inventory change), a problem or impending problem (out of or nearly out of an inventory item), a threshold, a deviation or something else. It is typically a description of something that has happened, rather than a command to do something.


Event - technical breakdown

An event is communicated in a message with a header and a body.

  • The header can include: type, name, timestamp, creator and should be consistently defined across specifications
  • The body describes what happened. It should fully describe the event such that an interested party does not have to go back to the source system in order to get essential information.
    • This distinguishes it from a RESTful model (at least level 3 of the Richardson Maturity Model) where typically only minimal essential information is communicated. Instead URIs are provided for the client to go back and retrieve more complete information about an entity. REST is typically a request-reply interaction, whereas events in EDA are usually background asynchronous notifications broadcast to a messaging system, so a policy of providing all relevant details in the event message makes sense under those circumstances.

So what formats could be used for event info? Basically anything with a header and body. XML and JSON are reasonable choices. I'll discuss other options and considerations on how to choose a format in the next EDA blog post.

Interacting With a System

In an Event Driven Architecture, systems communicate with each other by sending events describing what just happened. These are generally not point-to-point connections and they are not synchronous request-reply connections. Instead, an event is broadcast to any and all listeners in an asynchronous fire-and-forget fashion. The broadcast is done via a queue or other messaging system. This is why EDA creates a more loosely coupled system than a traditional SOA design.

There are only two dependencies between systems now:

  1. the domain event format and content
  2. the messaging middleware

System A, publishing its events, is unaware of any subscribers to its stream of domain events. System A does not have any dependencies on subscriber systems. If downstream system B is currently down or overloaded, system A will continue publishing at its own pace. System B can get to its "inbox" whenever it is ready and it can do whatever it likes with the events it receives. None of that has any effect on system A, which can continue to go about its business as if it was the only system in the world.

Each system in an Event Driven Architecture can be carefree about other systems, but of course an Enterprise Architect still needs to ensure that all systems under his or her purview are interacting appropriately and meeting their SLAs. So with EDA there are still hard integration problems to solve. But there are circumstances where this loosely coupled design is a better and simpler choice. I'll cover an example later.

Note: there are some who would argue with the idea that an asynchronous eventually-consistent model is "simpler" than a typical SOA synchronous request-reply architecture.

As usual, whenever I use the word simple I (try to) always mean it in the sense portrayed by Rich Hickey in his Simple-Made-Easy talk - namely, simple means unentangled with concerns of others. Minding your own business is simple. Being involved and entangled in everybody else's is complex. Any repeat readers of my blog will probably get tired of me talking about it, but I think that was one of the most profound lectures I've listened to, so head there next if you haven't listened to it yet.


/* ---[ Some Formal Terminology ]--- */

Let's give some formal names to the players in a event-driven system.

First, there is the event generator. This is usually a software system that has just processed a request (say from a user) or an event (from another event generator).

Next, there is the event channel or event bus - usually a messaging queue such as a JMS or AMQP compliant broker.

Event subscribers are interested parties (applications) that sign up to receive events off the event channel. They then become event processors, possibly doing some action in response to the event or a series of events. The sum total of the downstream event processing is the event consequence.


/* ---[ Making It Real ]--- */

OK, enough formality, let's have an example. For this, I have found no better source than a presentation available on InfoQ by Stefan Norberg. I highly recommend watching this presentation. With apologies (or kudos) to Stefan, I'm going to steal his example here.

Suppose you have an online store where users purchase some goods from you. In your enterprise, your architects and developers have put together some front end systems like a shopping web site and a payment system. In the back end you have your databases, inventory and reporting systems. It is composed with SOA services through some enterprise server bus (ESB). Currently those systems interconnect with request-reply synchronous behavior.

Now management says to the IS team: "we want to institute two new features - a real-time fraud detection system and a loyalty program". If you stick with the current design, you will have to embed a fraud detection system in multiple systems - those systems will have to become fraud aware. And the fraud business rules may need to correlate across systems, building in even more complex dependencies. I'm going to steal two slides from Norberg's presentation and here is the first to illustrate:

(Image from slides available here)

This becomes an unscalable model. The fraud and loyalty systems are not only embedded in other systems, but they also will need to query the database(s) you have - thus creating more inter-dependencies, the opposite of a simple system. If a database schema were to change, the fraud and loyalty systems also have to be updated.

Event-Driven Architecture challenges us to look for opportunities where a more loosely-coupled design is possible. What if, instead, we pulled the fraud and loyalty systems out, made them standalone, perhaps give them their own databases (to keep just what they want for as long as they need) and had them subscribe to a domain event queue from the other systems?

If we modify the Shop and Payment systems to fire off domain events that describe what they do as they do it, the fraud and loyalty systems can subscribe to those in real-time. Remember also that events should ideally contain all information about the event so that subscribers do not have to go back to the source systems to get more information. This then also decouples the existing Customer and Reporting databases from the Fraud and Loyalty system, simplifying the design.

In fact, now that we are thinking this way, we can see that the Reporting database is a read-only database that could also be populated by subscribing to the event stream. So by rearchitecting now we have this:

(Image from slides available here)

In this slide there is also a generic Business Activity Module (BAM) to show that additional feature requests from management could be added here as well. This slide illustrates nicely how EDA and SOA can work together, allowing us to choose the right model for each subdomain.


/* ---[ Integration Modes ]--- */

Integrations, as you've seen, are not point-to-point in an EDA system. Thus, an EDA system is very amenable to an ESB or EIP (Enterprise Integration Patterns) distributed system, such as Apache Camel.

Unlike batch-based ("offline") modes of integration, EDA systems can stay up-to-date in real or near-real time. For example, instead of an hourly or nightly ETL job to synchronize datastores, one datastore can publish its new entries to a queue and the other system can subscribe and update its datastore immediately.

The flexibility at each layer the is strength of EDA. What gets generated, how it is processed/formatted, and who receives it and what they do with it are loosely coupled and each layer has dependencies only to the one whose queue/event stream it directly subscribes to. Even then, it is not a deep complex API-based dependency.

With CORBA, DCOM and RMI, the caller and recipient have to be in lock-step with each other through all version changes - they are hard-coded pre-compiled APIs that must match. But with Events, you need not have as much coupling. If event generators begin publishing new types of events, it will not break downstream systems and the subscribers can continue on as before, or adapt and be able to handle the new messages on their own update schedule.

You can evolve what types of messages you can process. For example, you could pass around the equivalent of HashMaps that contain the various data you need, rather than strictly formatted XML docs, for example. If you add more keys and values to a hash map, the downstream subscriber still can pull out the old ones it knows about.

The picture below shows how a reporting system can be upgraded at a different rate than the systems generating the events it subscribes to. As long as the basic “API”, i.e., format of the message, is retained, the consumers can be updated at a slower rate than the producers without breaking the overall system.


/* ---[ Next Installment ]--- */

Hopefully this installment in the EDA series gave you a sense of the power of EDA and promoted the idea that EDA and SOA are not competitors, but rather co-exist peacefully.

In the next blog post on EDA, I'll cover four additional topics: performance characteristics, domain event/message formats, complex event processing and guaranteed delivery. The latter is something you will want to have in many circumstances. Take our example above of having the reporting database populated by an event stream from the shopping and payment systems. In order to have accurate reporting, you will want to guarantee that eventually all events reach the reporting database, by some means of guaranteed delivery.


Key References:
1. Domain Event Driven Architectecture, presentation by Stefan Norberg
2. Event-Driven Architecture Overview (PDF), whitepaper
3. How EDA extends SOA and why it is important, blog post



Other entries in this blog series:
Events and Event-Driven Architecture: Part 1
Events and Event-Driven Architecture: Part 3

Saturday, January 7, 2012

Events and Event-Driven Architecture: Part 1

The terms "event" and "event-driven" in software design are actually somewhat overloaded, so I endeavored to do a little research on their various meanings. In this blog entry I start by defining and contrasting Event-Driven Programming (EDP) from Event-Driven Architecture (EDA) and then dive into some parts of the latter. Later blog entries will delve into additional areas of EDA.

There isn't any original research or design here - rather a summary of key usages I have found.

First let's make a initial distinction between event-driven programming vs. event-driven architecture.


/*---[ Event Driven: Programming vs. Architecture ]---*/

Event Driven Architecture in general describes an architecture where layers are partitioned and decoupled from each other to allow better or easier composition, asynchronous performance characteristics, loose coupling of APIs and dependencies and easier profiling and monitoring. Its goal is a highly reliable design that allows different parts of the system to fail independently and still allow eventual consistency. Layers in an Event Driven Architecture typically communicate by means of message queues (or some other messaging backbone).

Event Driven Programming is a programming model that deals with incoming events by having the programmer process those events one after the other in a single thread. Events come from outside the main thread - from a user action, an asynchronous server or system reply - that gets put onto the event queue for the main processing thread to handle. This is the model of many GUI programming environments, such as Java AWT/Swing. The most prominent example of EDP I'm aware of is the JavaScript model in the browser and now on the server side with Node.js.

EDP and EDA have no special co-dependent relationship to each other. One can do either, both or neither. However, they do share in common the central idea that events comes in, typically from "outside" (the meaning of which is context dependent), are communicated in some known format and put onto queues to be processed one at a time.

I won't speak any more of Event Driven Programming here, but a nice introductory tutorial to Node.js is the The Node Beginner Book by Manuel Kiessling.


/*---[ Staged Event Driven Architecture ]---*/

The next major divide I will consider is between inter-system EDA and intra-system EDA.

Staged-Event Driven Architecture, or SEDA, is a form of intra-system EDA. The term was first coined by Matt Welsh for his Ph.D. work. It models a system by breaking it down into serial processing stages that are connected by queues. Welsh calls this a "well-conditioned" system, meaning one that can handle web-scale loads without dropping requests on the floor. When incoming demand exceeds system processing capacity, it can simply queue up requests in its internal buffer queues.

The processing stages of a SEDA system can either be low-level or high-level subsystems. The Sandstorm web server built by Welsh for his thesis used very low-level subsystems - things like HTTP parsers, socket readers and writers, and cache handlers.

In SEDA, events are pieces of information to be processed or acted upon, at a given stage in the processing workflow. This definition of event is more compatible with the EDP definition of event than the typical EDA definition of event (which will be covered in the next blog entry).

The Actor model is a good fit for doing SEDA. I'll speak more about this in a section below. First, let's do a quick assessment of the benefits and drawbacks of SEDA.


Proposed Benefits of SEDA Design

  • Support high demand volume with high concurrency

    • The system can be set to perform at a fixed maximum level for a given number of simultaneous requests and if overload levels are reached, then those requests pile up on the queue, rather than degrading performance by having the system process too many requests simultaneously.
  • Handle stage level "backpressure" and load management

    • requests coming in can cause Out-of-Memory errors if they come in faster than can be processed
  • Isolate well-conditioned modular services and simplify their construction

    • Simple here is the objective definition of Rich Hickey: unentwined or unentangled with other systems.

    • Well-conditioned refers to a service acting like a simple pipeline, where the depth of the pipeline is determined by the number of processing stages

  • Enable introspection by watching the queues between modules - allows a system to tune responses to conditions

    • It allows self-tuning, dynamic resource management, such as adjusting thread pool count, connection pool counts

    • Queues can incorporate filtering and throttling

    • Note: I haven't seen it mentioned in any reading/examples yet, but one could also implement the queues as priority queues that push higher priority requests to the front of the line.

However, those interested in pursuing a SEDA design for systems that will need to handle heavy loads, particularly when needing predictable latency, should read Welsh's post-thesis note and the retrospective he wrote a few years later on SEDA.

It turns out that SEDA designs can be extremely finicky - they can require a lot of tuning to get the throughput desired. In many cases, the SEDA designs do not outperform alternative designs. Welsh points out that the primary intent of the SEDA model was to handle backpressure load (when requests come in faster than they can be processed), not necessarily be more performant.

The reason for the finicky performance, I believe, was uncovered by the LMAX team in London - queues are actually very poor data structures for high concurrency environments. When many readers and writers are trying to access the queue simultaneously, the system bogs down, since both readers and writers having to "write" to the queue (in terms of where head and tail point) and most of the time is spent managing, acquiring and waiting on locks, rather than doing productive work. There are better high-volume concurrency data structures such as Ring Buffers that should be considered instead of queues.

Two side notes

  1. I wonder if anyone is considering building an Actor model on a ring buffer or the LMAX Disruptors? I haven't studied the Akka library carefully enough to know whether this would be a good idea and potentially make the Actor model even more performant - I'd love to hear from people with experience on this.

  2. There are other data structures being designed for the "multicore age". One article worth reviewing was published in the ACM in March 2011. Currently you have to pay to get the article, but there is a version available on scribd last time I checked.


/*---[ Using The Actor Model to Implement SEDA ]---*/

An alternative that Matt Welsh (originator of the SEDA model) never discusses in his writings (at least the ones I've read) is the Erlang Actor model. In Erlang, actors communicate across processes by message passing. They can be arranged in any orientation desired, but one useful orientation would certainly be a staged pipeline.

There are a number of Actor libraries that bring the Erlang model to the JVM using threads rather than Erlang-type processes. The libraries likely have differences between them. I will focus here on the [Akka|http://akka.io/] library, since that is one I am familiar with. Akka is written in Scala, but is fully usable in Java itself.

Four of the key strengths of the Erlang/Akka model are:

  1. Each Actor is fully isolated from other actors (and other non-actor objects), even more so than regular objects. The only way to communicate with an Actor is via message passing, so it is a shared-nothing environment except for the messages, which must/should be immutable.

  2. As a set of queues, this encourages designing your application in terms of workflows, determining where stage boundaries are and where concurrent processing is needed or useful.

    • Akka allows you to do message passing like you would in a traditional ESB but with speed!
  3. It is built on a philosophy that embraces failure, rather than fighting it and losing. This is Erlang’s "Let It Crash" fault-tolerant philosophy using a worker/supervisor mode of programming: http://blogs.teamb.com/craigstuntz/2008/05/19/37819/

    • In this model you don’t try to carefully program to prevent every possible thing that can go wrong. Rather, accept that failures and crashes of actors/threads/processes (even whole servers in the Erlang model) will occur and let the framework (through supervisors) restart them for you

    • This is the key to the “5 nines” of reliability in the software of the telecom industry that Erlang was built for

  4. The Actor model is built for distributed computing. In Erlang’s model, Actors are separate processes. In Akka’s model they can either be in separate processes or in the same process.

    • Akka Actors are extremely lightweight, using only 600 bytes of memory (when bare) and consuming no resources when not actively processing a message. They typically are configured to share threads with other Actors, so many thousands, even millions of Actors can be spun up in memory on modern server architecture.
  5. Finally, Actors bring a lock free approach to concurrency. If you put any mutable state in the Actor itself, then it cannot be shared across threads, avoiding the evil of shared mutable state.

Programming with messaging passing to queues (or mailboxes as they are called in Akka) raises the level of abstraction. You can now think in terms of modules or module-like subsystems that can be designed to have a deterministic pathway. The internals of a system become easier to reason about, especially compared to a hand-rolled multi-threaded code using shared mutable state.

In addition, the Akka Actor model:

  • applies well to transactional systems, CPU-bound operations, or I/O bound operations (by using multiple concurrent Actors).
  • applies to either fully isolated actions (say number crunching or the "map" part of a MapReduce function) or to coordinated actions, but using isolated mutability, rather than shared mutability.
  • Has load balancing built in (in two ways):
    • An explicit load balancer you can create and send all message to (optional)
    • A dispatcher (required) behind the scenes that determines how events are passed and shared between Actors. For example, one Dispatcher is “work stealing”, having actors that are heavily loaded give work to actors that are less heavily loaded (in terms of items in their mailbox)

With these features, hopefully you can see how the Actor model fits a SEDA approach. Next I describe one way to do so.


(One Model For) Applying a SEDA Design to Actors

  1. Define one Actor class per Stage (http://www.slideshare.net/bdarfler/akka-developing-seda-based-applications)
  2. Use a shared dispatcher and load balancer
  3. Grab a reference to the Actors (or Load Balancer) of the next stage by looking it up in the Actor registry (set up automatically by the Akka framework).
  4. In cases where you need to throttle load by applying “backpressure” to requests, you can configure Actor mailboxes
    • Note: I need more information on exactly what this entails and when it would be useful, as I haven't researched this enough yet - feedback from readers welcome!
  5. The Actor benchmarks I have seen are very impressive – hundreds of thousands of messages per second, but as with SEDA in general, Akka provides many knobs and dials to adjust nearly everything
    • As with SEDA, it can be a two-edged sword – you have many knobs tweak to get better throughput in your system, but you may also have to do a higher amount of tweaking to get performance that you might get with by default with a more traditional programming model, which doesn't require so much fiddling.




I'll stop here for today. I have not yet hit on what I think is the main meaning of Event-Driven Architecture - that of inter-system EDA.

Another use of events is for "Event-Sourcing", which often fits hand-in-glove with the Command Query Responsibility Segregation (CQRS) pattern.

I will definitely do a write-up on EDA for another blog entry and possibly will dive into Event Sourcing and CQRS at a later time.

Feedback, disagreements, builds from readers with experience in these domains are welcome!




Key Links/Sources:

SEDA
http://www.eecs.harvard.edu/~mdw/proj/seda/
http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
http://www.theserverside.com/news/1363672/Building-a-Scalable-Enterprise-Applications-Using-Asynchronous-IO-and-SEDA-Model
http://www.slideshare.net/soaexpert/actors-model-asynchronous-design-with-non-blocking-io-and-seda
http://www.infoq.com/articles/SEDA-Mule


Akka
http://akka.io/
http://www.slideshare.net/bdarfler/akka-developing-seda-based-applications


Next entries in this blog series:
Events and Event-Driven Architecture: Part 2
Events and Event-Driven Architecture: Part 3