Sunday, June 7, 2015

Merkle Tree

I hit upon the need this week to do checkpointing in a data processing system that has the requirement that no data event can ever be lost and no events can be processed and streamed out of order. I wanted a way to auto-detect this in production in real time.

There are a couple of ways to do this, but since our data events already have a signature attached to them (a SHA1 hash), I decided that a useful way to do the checkpoint is basically keep a hash of hashes. One could do this with a hash list, where a chain of hashes for each data element is kept and when a checkpoint occurs the hash of all those hashes in order is taken.

A disadvantage of this model is if the downstream system detects a hash mismatch (either due to a lost message or messages that are out-of-order) it would then have to iterate the full list to detect where the problem is.

An elegant alternative is a hash tree, aka a Merkle Tree named after its inventor Ralph Merkle.


/* ---[ Merkle Trees ]--- */

Merkle trees are typically implemented as binary trees where each non-leaf node is a hash of the two nodes below it. The leaves can either be the data itself or a hash/signature of the data.

Thus, if any difference at the root hash is detected between systems, a binary search can be done through the tree to determine which particular subtree has the problem. Thus typically only log(N) nodes need to be inspected rather than all N nodes to find the problem area.

Merkle trees are particularly effective in distributed systems where two separate systems can compare the data on each node via a Merkle tree and quickly determine which data sets (subtrees) are lacking on one or the other system. Then only the subset of missing data needs to be sent. Cassandra, based on Amazon's Dynamo, for example, uses Merkle trees as an anti-entropy measure to detect inconsistencies between replicas.

The Tree Hash EXchange format (THEX) is used in some peer-to-peer systems for file integrity verification. In that system the internal (non-leaf) nodes are allowed to have a different hashing algorithm than the leaf nodes. In the diagram below IH=InternalHashFn and LH=LeafHashFn.

The THEX system also defines a serialization format and format for dealing with incomplete trees. The THEX system ensures that all leaves are at the same depth from the root node. To do that it "promotes" nodes. That is when a parent only has one child, it cannot does not take a hash of the child hash; instead it just "inherits" it. If that is confusing, think of the Merkle tree as being built from the bottom up: all the leaves are present and hashes of hashes are built until a single root is present.

Notation: The first token is a node label, followed by a conceptual value for the hash/signature of the node. Note that E, H and J nodes all have the same signature, since they only have one child node.


/* ---[ Merkle Tree as Checkpoint Data ]--- */

I recently published my implementation of a Merkle Tree: https://github.com/quux00/merkle-tree

Before I describe the implementation, it will help to see the use case I'm targeting.

The scenario above is a data processing pipeline where messages flow in one direction. All the messages that come out of A go into B and are processed and transformed to some new value-added structure and sent on to C. In between are queues to decouple the systems.

Throughput needs to be as high as possible and every message that comes out of A must be processed by B and sent to C in the same order. No data events can be lost or reordered. System A puts a signature (a SHA1 hash) into the metadata of the event and that metadata is present on the message event that C receives.

To ensure that all messages are received and in the correct order, a checkpoint is periodically created by A, summarizing all the messages sent since the last checkpoint. That checkpoint message is put onto the Queue between A and B; B passes it downstream without alteration so that C can read it. Between checkpoints, system C keeps a running list of all the events it has received so that it can compute the signatures necessary to validate what it has received against the checkpoint message that periodically comes in from A.


/* ---[ My Implementation of a Merkle Tree ]--- */

The THEX Merkle Tree design was the inspiration for my implementation, but for my use case I made some simplifying assumptions. For one, I start with the leaves already having a signature. Since THEX is designed for file integrity comparisons, it assumes that you have segmented a file into fixed size chunks. That is not the use case I'm targeting.

The THEX algorithm "salts" the hash functions in order to ensure that there will be no collisions between the leaf hashes and the internal node hashes. It concatenates the byte 0x01 to the internal hash and the byte 0x00 to the leaf hash:

internal hash function = IH(X) = H(0x01, X)
leaf hash function = LH(X) = H(0x00, X)

It is useful to be able to distinguish leaf from internal nodes (especially when deserializing), so I morphed this idea into one where each Node has a type byte -- 0x01 identifies an internal node and 0x00 identifies a leaf node. This way I can leave the incoming leaf hashes intact for easier comparison by the downstream consumer.

So my MerkleTree.Node class is:

static class Node {
  public byte type;  // INTERNAL_SIG_TYPE or LEAF_SIG_TYPE
  public byte[] sig; // signature of the node
  public Node left;
  public Node right;
}


/* ---[ Hash/Digest Algorithm ]--- */

Since the leaf nodes are being passed in, my MerkleTree does not know (or need to know) what hashing algorithm was used on the leaves. Instead it only concerns itself with the internal leaf node digest algorithm.

The choice of hashing or digest algorithm is important, depending if you want to maximize performance or security. If one is using a Merkle tree to ensure integrity of data between peers that should not trust one another, then security is paramount and a cryptographically secure hash, such as SHA-256, Tiger, or SHA-3 should be used.

For my use case, I was not concerned with detecting malicious tampering. I only need to detect data loss or reordering, and have as little impact on overall throughput as possible. For that I can use a CRC rather than a full hashing algorithm.

Earlier I ran some benchmarks comparing the speed of Java implementations of SHA-1, Guava's Murmur hash, CRC32 and Adler32. Adler32 (java.util.zip.Adler32) was the fastest of the bunch. The typical use case for the Adler CRC is to detect data transmission errors. It trades off reliability for speed, so it is the weakest choice, but I deemed it sufficient to detect the sort of error I was concerned with.

So in my implementation the Adler32 checksum is hard-coded into the codebase. But if you want to change that we can either make the internal digest algorithm injectable or configurable or you can just copy the code and change it to use the algorithm you want.

The rest of the code is written to be agnostic of the hashing algorithm - all it deals with are the bytes of the signature.


/* ---[ Serialization / Deserialization ]--- */

My implementation has efficient binary serialization built into the MerkleTree and an accompanying MerkleDeserializer class that handles the deserialization.

I chose not to use the Java Serialization framework. Instead the serialize method just returns an array of bytes and deserialize accepts that byte array.

The serialization format is:

(magicheader:int)(numnodes:int)
[(nodetype:byte)(siglength:int)(signature:[]byte)]

where (foo:type) indicates the name (foo) and the type/size of the serialized element. I use a magic header of 0xcdaace99 to allow the deserializer to be certain it has received a valid byte array.

The next number indicates the number of nodes in the tree. Then follows an "array" of numnodes size where the elements are the node type (0x01 for internal, 0x00 for leaf), the length of the signature and then the signature as an array of bytes siglength long.

By including the siglength field, I can allow leaf nodes signatures to be "promoted" to the parent internal node when there is an odd number of leaf nodes. This allows the internal nodes to use signatures of different lengths.


/* ---[ Usage ]--- */

For the use case described above, you can imagine that system A does the following:

List<String> eventSigs = new ArrayList<>();

while (true) {
  Event event = receiveEvent();
  String hash = computeHash(event);
  // ... process and transmit the message to the downstream Queue
  sendToDownstreamQueue(hash, event);

  eventSigs.add(has);

  if (isTimeForCheckpoint()) {
    MerkleTree mtree = new MerkleTree(eventSigs);
    eventSigs.clear();
    byte[] serializedTree = mtree.serialize();
    sendToDownstreamQueue(serializedTree);
  }
}

And system C would then do something like:

List<String> eventSigs = new ArrayList<>();

while (true) {
  Event event = receiveEvent();

  if (isCheckpointMessage(event)) {
    MerkleTree mytree = new MerkleTree(eventSigs);
    eventSigs.clear();

    byte[] treeBytes = event.getDataAsBytes();
    MerkleTree expectedTree = MerkleDeserializer.deserialize(treeBytes);
    byte[] myRootSig = mytree.getRoot().sig;
    byte[] expectedRootSig = expectedTree.getRoot().sig;
    if (!signaturesAreEqual(myRootSig, expectedRootSig)) {
      evaluateTreeDifferences(mytree, expectedTree);
      // ... send alert
    }

  } else {
    String hash = event.getOriginalSignature();
    eventSigs.add(hash);
    // .. do something with event
  }
}

Thursday, December 25, 2014

Hooking up Kafka to Storm with the KafkaSpout in Storm 0.9.3

I have recently published a code example for using the Kafka Spout that is now part of Storm 0.9.3. After wasting a day fighting with incompatible logging frameworks between Kafka and Storm, I found an example project in Scala that had the necessary exclusions to get the uber-jar to run in Storm cluster mode. I've converted that example to Java and Maven, rather than Scala and sbt, and considerably simplified the examples so you can focus on just one thing: getting Kafka hooked up to Storm using the KafkaSpout.

The code repo is here: https://github.com/quux00/kstorm

Sunday, October 19, 2014

Updated microbenchmarks for Java String tokenization

I've been focusing a lot on Java performance lately and ran across a Programmers StackExchange post asking the most efficient way to tokenize a string in Java. It struck a nerve with me because the most common way to do this is to use String#split. But the split method takes as its tokenization delimiter a string that gets compiled into a regular expression. Which seems nice when you need it, but likely to be rather inefficient if all you need is to split on a character, such as a space. Then I looked up the old StringTokenizer class which I used to use years and years ago. And its Javadoc has this interesting tidbit:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

Strangely, however, the accepted answer for the StackExchange question links to a blog post that shows benchmarks claiming that StringTokenizer is faster than String#split. So why shouldn't I use StringTokenizer if I don't need a regex to split a string into fields?


/* ---[ Someone is wrong on the internet ]--- */

Someone is wrong on the internet

Well, that blog post needs to be updated for two reasons:

  1. it used Java 6, so we need to see whether things are different in Java 7 and 8
  2. microbenchmarks are hard to do correctly and the author used questionable means (for example - there is not much warm up time and his timings are awfully short)

The difficulty of doing accurate microbenchmarks has been greatly assuaged by the creation of the JMH benchmark tool. If you haven't adopted it yet, put it at the top your technical TODO list. The documentation is excellent and it comes with many useful samples.


/* ---[ Splitting a Java string into tokens ]--- */

Of the techniques used in the original post, I am interested in four candidates:

  1. tokenize the string yourself with String#indexOf
  2. use String#split
  3. use the almost deprecated StringTokenizer
  4. use Guava's Splitter

Here are the reported outcomes of the earlier measurements for these 4 options:

IndexOf: 50ms
StringTokenizer: 89ms
GuavaSplit: 109ms
Split: 366ms

I reran the author's exact code on my system using Java 7 and got the following:

IndexOf: 52ms
StringTokenizer: 104ms
GuavaSplit: 119ms
Split: 145ms

So using his benchmark technique, I got the same ordering, though the ratios have changed a bit. In Java 7, split is only 3 times slower than index, rather than 7 times slower.


/* ---[ What does JMH say? ]--- */

I mildly rewrote these benchmarks using JMH. I also made longer strings to tokenize, rather than the two token string used by the blog post author, as that seemed a more realistic data set. And I used three strings, instead of one:

String[] lines = new String[]{
    "0000 12 34 5 666 77 88888 99",
    "aa bb cc dd ee ff gg hh ii jj kk ll mm nn oo pp qq rr ss tt uu vv ww xx yy zz",
    "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod"
};

The JMH based benchmark code is listed below. The results are rather interesting. In the original benchmarks raw time is measure, so lower is better. With JMH, I measured throughput (the default metric) in operations/second, so higher is better:

With Java 7 (1.7.0_60-b19) the results were:

m.JavaStringSplits.indexOf           thrpt    10  840.147 ± 28.553 ops/ms
m.JavaStringSplits.javaSplit         thrpt    10  694.870 ±  3.858 ops/ms
m.JavaStringSplits.stringTokenizer   thrpt    10  473.812 ±  8.422 ops/ms
m.JavaStringSplits.guavaSplit        thrpt    10  392.759 ±  8.707 ops/ms

And with Java 8 (1.8.0_20-ea):

m.JavaStringSplits.indexOf           thrpt    10  827.136 ± 30.831 ops/ms
m.JavaStringSplits.javaSplit         thrpt    10  700.809 ±  7.215 ops/ms
m.JavaStringSplits.stringTokenizer   thrpt    10  480.788 ± 16.793 ops/ms
m.JavaStringSplits.guavaSplit        thrpt    10  386.398 ±  5.323 ops/ms

With the presumably more robust and accurate JMH model of measuring microbenchmarks, String#split moved up the ranks and handily outperforms both StringTokenizer and Guava Splitter.

My hand-coded splitter using indexOf was about 18% "faster" (higher throughput) than Java#split, so we do see some overhead in Java#split, but for most scenarios I conclude that the Javadoc for StringTokenizer is correct: use String#split for most scenarios unless you know that 1) you don't need regular expressions to tokenize and 2) string splitting is your bottleneck.

A follow-up set of tests that could be done would be on very large strings. In his Letter the Caller Choose Mechanical Sympathy post from a few years ago, Martin Thompson advocated devising an API where you pass in the datastructure to be populated during the tokenization, rather than relying one to be created in the underlying tokenization method and passed back to you. He states (but does not demonstrate):

In the Java world this idiom becomes critical to performance when large arrays are involved.


/* ---[ Find the flaw ]--- */

One of my favorite interview questions is to put a piece of code in front of someone and say "find the flaws". And Aleksey Shipilëv, the author of JMH, states:

Your [JMH] benchmarks should be peer-reviewed.

Mine have not been, so I throw it out to the internet to find the flaws in my quick analysis.


/* ---[ The code and setup ]--- */


Benchmarks were performed on:

Ubuntu (Xubuntu) 14.04 LTS with 3.13.0-37 Linux kernel with 8 Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz cores.

Monday, March 17, 2014

Rust namespaces by example


/* ---[ Cheat sheet for Rust namespaces ]--- */

Rust namespaces can be a little mind-bending. This brief blog post is meant to provide instruction by example on setting up an application or library of Rust code that uses multiple files in a hierarchical namespace.

Though the current docs-project-structure guide on the Rust wiki is pretty sparse, you should start there. Then read the section on Crates and Modules in the tutorial.

I used those references plus an examination of how files and namespaces in the libstd code tree are structured to come up with this example.

In this simple example, I want to have an application where things are namespace to the top level module abc. I want to have a couple of files (namespaces) under abc, some additional directories (namespaces) below abc, such as abc::xyz and have modules in the abc::xyz namespace. Furthermore, I want all those namespaces to be able to refer to each - both down the chain and up the chain.

Here is a simple example that illustrates how to do it. I am using Rust-0.10pre (built 16-Mar-2014).

First, I have a project I called "space-manatee", under which I have a src directory and then my code hierarchy starts:

quux00:~/rustlang/space-manatee/src$ tree
.
├── abc
│   ├── mod.rs
│   ├── thing.rs
│   ├── wibble.rs
│   └── xyz
│       ├── mod.rs
│       └── waldo.rs
└── main.rs

2 directories, 6 files

To provide a namespace foo in Rust, you can either create a file called foo.rs or a dir/file combo of foo/mod.rs. The content of my abc/mod.rs is:

quux00:~/rustlang/space-manatee/src$ cat abc/mod.rs 
pub mod thing;
pub mod wibble;
pub mod xyz;

All this module does is export other modules in the same directory. It could have additional code in it - functions and data structures, but I elected not to do that.



xyz is a directory, and since I created the xyz/mod.rs dir/file combo, it is a namespace that can be used and exported.

Let's look into the other files:

quux00:~/rustlang/space-manatee/src$ cat abc/thing.rs 
extern crate time;

use time::Tm;

pub struct Thing1 {
    name: ~str,
    when: time::Tm,
}

pub fn new_thing1(name: ~str) -> ~Thing1 {
    ~Thing1{name: name, when: time::now()}
}

thing.rs pulls in the rustlang time crate and then defines a struct and constructor for it. It doesn't reference any other space-manatee code.

quux00:~/rustlang/space-manatee/src$ cat abc/wibble.rs 
use abc::thing;
use abc::thing::Thing1;

pub struct Wibble {
    mything: ~Thing1
}

pub fn make_wibble() -> ~Wibble {
    ~Wibble{mything: thing::new_thing1(~"cthulu")}
}

wibble.rs, however, does reference other space-manatee projects, so it uses the fully qualified namespace from the top of the hierarchy, but it does not have to explicitly "import" anything. It can find the thing namespace without a mod declaration because thing.rs is in the same directory.



OK, let's look into the xyz directory now.

quux00:~/rustlang/space-manatee/src$ cat abc/xyz/mod.rs 
pub mod waldo;

That just exports the waldo namespace in the same directory. What's in waldo?

quux00:~/rustlang/space-manatee/src$ cat abc/xyz/waldo.rs 
use abc::wibble::Wibble;

pub struct Waldo {
    magic_number: int,
    w: ~Wibble
}

The Waldo struct references the Wibble struct that is higher than it in the hierarchy. Notice there is no "import" via a mod statement - apparently going up the hierarchy requires no import.

So that's the supporting cast. Let's see how the main.rs program uses them:

quux00:~/rustlang/space-manatee/src$ cat main.rs 
extern crate time;

use abc::{thing,wibble};
use abc::thing::Thing1;
use abc::wibble::Wibble;
use abc::xyz::waldo::Waldo;

pub mod abc;

fn main() {
    let tg: ~Thing1 = thing::new_thing1(~"fred");
    println!("{}", tg.name);

    let wb: ~Wibble = wibble::make_wibble();
    println!("{}", wb.mything.name);

    let wdo = Waldo{magic_number: 42,
                    w: wibble::make_wibble()};
    println!("{:?}", wdo);
}

The only mod "import" main.rs had to do is of the abc namespace - which is in the same directory as main.rs. In fact, that is all you can import. If you try mod abc::thing the compiler will tell you that you aren't doing it right.

By importing abc, you are importing abc/mod.rs. Go back up and look at what abc/mod.rs does - it imports other modules, which in turn import other modules, so they all end up being imported into main.rs as addressable entities.



Once all those import references are set up, nothing special has to be done to compile and run it:

quux00:~/rustlang/space-manatee/src$ rustc main.rs
quux00:~/rustlang/space-manatee/src$ ./main
fred
cthulu
abc::xyz::waldo::Waldo{
  magic_number: 42,
  w: ~abc::wibble::Wibble{
    mything: ~abc::thing::Thing1{
      name: ~"cthulu",
      when: time::Tm{tm_sec: 21i32,
        tm_min: 26i32, tm_hour: 21i32, tm_mday: 17i32, tm_mon: 2i32,
        tm_year: 114i32, tm_wday: 1i32, tm_yday: 75i32, tm_isdst: 1i32,
        tm_gmtoff: -14400i32, tm_zone: ~"EDT", tm_nsec: 663891679i32
      }
    }
  }
}

(I hand formatted the Waldo output for easier reading.)

Sunday, March 16, 2014

Select over multiple Rust Channels


/* ---[ Channels in Rust ]--- */

The channel paradigm for CSP-based concurrency has received a lot of attention lately since it is the foundational concurrency paradigm in Go and Clojure has embraced it with core.async. It turns out that Rust, the new language from Mozilla, also fully embraces channel-based message passing concurrency.

Both Go and Clojure's core.async have a select operation that allows your code to wait on multiple channels and respond to the first one that is ready. This is based, at least conceptually, on the Unix select system call that monitors multiple file descriptors.

Rust also has a select operation. And it has a select! macro to make using it easier. Here's an example:

use std::io::Timer;

fn use_select_macro() {
    let (ch, pt): (Sender<~str>, Receiver<~str>) = channel();

    let mut timer = Timer::new().unwrap();
    let timeout = timer.oneshot(1000);
    select! (
        s = pt.recv() => println!("{:s}", s),
        () = timeout.recv() => println!("timed out!")
    );
}

Channels and Ports are now called Senders and Receivers in Rust. As with select in Go, if the Receiver called pt has a message come in before the 1 second timer goes off, its code block will execute. Otherwise, the timer's Receiver will be read from and its code block executed, printing "timed out".

Note that the select! macro uses parens, like a function call, not curly braces like a code block.

The select! macro is currently labelled experimental, since it has some limitations. One I hit this week is that it will fail (as in, not compile) if you embed the Receiver in a struct:


fn does_not_compile() {
    let (ch, pt): (Sender<~str>, Receiver<~str>) = channel();
    let a = A{c: ch, p: pt};

    let mut timer = Timer::new().unwrap();
    let timeout = timer.oneshot(1000);
    select! (
        s = a.p.recv() => println!("{:s}", s), 
        () = timeout.recv() => println!("time out!")
    );
}

This fails with error: no rules expected the token '.'. I've filed an issue for it here: https://github.com/mozilla/rust/issues/12902#issuecomment-37714663


/* ---[ Using the Rust Channel Select API ]--- */

The workaround is to use the Select API directly. Here's how you do it:


use std::comm::Select;
use std::io::Timer;

fn select_from_struct() {
    let (ch, pt): (Sender<~str>, Receiver<~str>) = channel();    
    let mut timer = Timer::new().unwrap();
    let timeout = timer.oneshot(1000);

    let a = A{c: ch, p: pt};

    let sel = Select::new();
    let mut pt = sel.handle(&a.p);
    let mut timeout = sel.handle(&timeout);
    unsafe { pt.add(); timeout.add(); }
    let ret = sel.wait();

    if ret == pt.id() {
        let s = pt.recv();
        println!("ss: {:?}", s);
    } else if ret == timeout.id() {
        let () = timeout.recv();
        println!("TIMEOUT!!");
    }
}

It's a little more code, but fairly straightforward to follow. You wrap your Receivers in a select Handle and them add them add them to the Receiver set via the add method (which must be wrapped in an unsafe block). Each handle gets an id so you can discover which one returned first.

Finally you wait. When the wait returns you check the return id and execute the appropriate block of code, which starts by call recv on the Receiver to get the incoming value (if any).

Sunday, January 26, 2014

Debugging Rust with gdb

[Update]: This blog post has been updated to Rust 0.11 as of mid-April 2014. The previous version was for Rust 0.9.

This blog entry outlines the current state of my understanding of how to use the gdb debugger for Rust programs. As of this writing, I'm using:

$ rustc -v
rustc 0.11-pre-nightly (e332287 2014-04-16 00:56:30 -0700)
host: x86_64-unknown-linux-gnu

$ gdb --version
GNU gdb (GDB) 7.6.1-ubuntu

A caveat before we begin: I'm not an expert in gdb and I'm (now less of) a newbie at Rust. I'm cataloging my findings for both my own future reference and to help anyone who needs to see an example in action. I'd welcome feedback on tips for better ways to do this.


/* ---[ GDB ]--- */

I won't give a gdb tutorial. There are lots of good ones on the web, such as these:


/* ---[ The setup ]--- */

The example I'll use is modified from an example from cmr in this Rust issue: https://github.com/mozilla/rust/issues/10350. I've chosen this one because it shows jumping into both a regular (named) function and an anonymous closure.

The codebase consists of two files, both in the same directory:

quux.rs:

pub fn quux00(x: || -> int) -> int {
    println!("DEBUG 123");
    x()
}

and bar.rs

extern crate quux;

fn main() {
    let mut y = 2;
    {
        let x = || {
            7 + y
        };
        let retval = quux::quux00(x);
        println!("retval: {:?}", retval);
    }
    y = 5;
    println!("y     : {:?}", y);
}

To compile them with debugging info included use the -g switch (this changed since version 0.9):

$ rustc -g --crate-type lib quux.rs
$ rustc -g -L . bar.rs

Start the bar program in gdb and let's set some breakpoints:

$ gdb bar
(gdb) break bar::main
Breakpoint 1 at 0x4042c0: file bar.rs, line 3.
(gdb) rbreak quux00
Breakpoint 2 at 0x43e760: file quux.rs, line 1.
int quux::quux00(int ())(int ());
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000004042c0 in bar::main at bar.rs:3
2       breakpoint     keep y   0x000000000043e760 in quux::quux00 at quux.rs:1

For the first breakpoint, I know the full path, so I just use break.

Sometimes correct full paths in Rust can be tricky, especially for functions that are parametized, so it can be easier to just use rbreak, which takes a regular expression. All functions that match the regex will have a breakpoint set.

The second breakpoint does have the word "quux00" in it, but it's been mangled. There's a way to change that in Rust, but let's move on for the moment.


/* ---[ Aside: rbreak ]--- */

In my playing around so far, I've been unable to use break to set a breakpoint for a function not in the file with the main method, so rbreak has been very useful.

The rbreak command is pretty powerful. According to the documentation, if you want to set a breakpoint for all functions, rbreak . will do the trick. You don't want to do this for a rust application, because there are hundreds of functions that get compiled in.

Instead, you'll want to limit the scope of the regex search by limiting the search to a single file with:

(gdb) rbreak bar.rs:.

But again I've only gotten this to work for the "main" file. If I type rbreak quux.rs:. it doesn't know what I'm talking about. Something for future research.


/* ---[ Let's debug ]--- */

So now we've got two breakpoints set, as indicated by the output from info breakpoints.

Let's start the debugging session:

(gdb) run
Starting program: /home/midpeter444/lang/rust/sandbox/debug/./bar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
... deleted a bunch of info about threads starting up ...
Breakpoint 1, bar::main () at bar.rs:3
3    fn main() {
(gdb) n
4        let mut y = 2;
(gdb) n
9            let retval = quux::quux00(x);
(gdb) list
4        let mut y = 2;
5        {
6            let x = || {
7                7 + y
8            };
9            let retval = quux::quux00(x);
10            println!("retval: {:?}", retval);
11        }
12        y = 5;
13        println!("y     : {:?}", y);
(gdb) p y
$1 = 2
(gdb) p x
$2 = {int ()} 0x7fffffffd4f8

Interesting that it seemed to skip lines 5-8 when I single stepped with n. But the x value was captured as an unnamed function, as you can see on the last line of the readout.

Now we are on line 8 (confirmed with the frame command below), so let's continue - our breakpoint on the quux00 function should now be tripped:

(gdb) frame
#0  bar::main () at bar.rs:9
9            let retval = quux::quux00(x);
(gdb) c
Continuing.

Breakpoint 2, quux::quux00 (x={int ()} 0x7fffffffd500) at quux.rs:1
1    pub fn quux00(x: || -> int) -> int {

Yes, it was tripped. Let's look around and single-step through it:

(gdb) frame
#0  quux::quux00 (x={int ()} 0x7fffffffd500) at quux.rs:1
1    pub fn quux00(x: || -> int) -> int {
(gdb) list
1    pub fn quux00(x: || -> int) -> int {
2        println!("DEBUG 123");
3        x()
4    }
(gdb) s
2        println!("DEBUG 123");
(gdb) p x
$4 = {int ()} 0x7fffffffd390

OK, we're inside the quux00 method. We stepped over the first instruction (the println) and inspected the x param, which is our anonymous Rust closure. Let's continue by stepping into the closure and see if that works:

(gdb) n
DEBUG 123
3        x()
(gdb) s
fn1356 () at bar.rs:6
6            let x = || {
(gdb) n
7                7 + y
(gdb) p y
$5 = 2
(gdb) n
bar::main () at bar.rs:10
10            println!("retval: {:?}", retval);

Excellent. That worked and we even had line numbers. Now we are back in the outer main fn. BTW, note that the "anonymous" closure has a name: fn1356 - remember that name, we'll come back to it later.

It's an easy walk to the finish line from here:

(gdb) list
5        {
6            let x = || {
7                7 + y
8            };
9            let retval = quux::quux00(x);
10            println!("retval: {:?}", retval);
11        }
12        y = 5;
13        println!("y     : {:?}", y);
14    }
(gdb) p retval
$3 = 9
(gdb) n
2    
(gdb) p y
$4 = 2
(gdb) c
Continuing.
retval: 9
y     : 5
[Inferior 1 (process 7007) exited normally]


/* ---[ Round 2: Set breakpoints on all methods in main file ]--- */

Let's start over and set breakpoints on all the functions in the bar.rs file:

$ gdb bar
(gdb) rbreak bar.rs:.
Breakpoint 1 at 0x4042c0: file bar.rs, line 3.
static void bar::main();
Breakpoint 2 at 0x404520: file bar.rs, line 6.
static int fn1356()();

Aah, there's the name again: fn1356. So that's another way to set a breakpoint on your closures. If we redo the session, we'll see that the breakpoint now gets tripped in the closure as it's being executed (from within the quux00 method):

(gdb) r
Starting program: /home/midpeter444/lang/rust/sandbox/debug/bar 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, bar::main () at bar.rs:3
warning: Source file is more recent than executable.
3    fn main() {
(gdb) c
Continuing.
DEBUG 123

Breakpoint 2, fn1356 () at bar.rs:6
6            let x = || {
(gdb) frame
#0  fn1356 () at bar.rs:6
6            let x = || {
(gdb) n
7                7 + y
(gdb) p y
$1 = 2


/* ---[ Round 3: Demangle function names ]--- */

In the Rust 0.9 version of this post I showed how rust mangled the function names, but that seems to have gone away. You can still explicitly specify not to mangle function names like so:

#[no_mangle]
pub fn quux00(x: || -> int) -> int {
    println("DEBUG 123");
    x()
}

I haven't tried anything really complicated with Rust in gdb yet, but hopefully these examples serve to get you going.

Telephonic Whispers in Rust, revisited


/* ---[ Hitting the big time ]--- */

My previous blog entry was a bit of a trifle: it showed an implementation of an toy CSP (Communicating Sequential Process) application in Rust, comparing it to an example implementation in Go by Rob Pike. I discovered, post-hoc, that it hit the Rust subreddit and Hacker News. There was some good discussion there and a few complaints.


/* ---[ Clarifications ]--- */

First, the complaints. Of course this example was tuned for how Go works: the example was created by Rob Pike. The point of his example, and my blog post, was that goroutines (Rust tasks) are lightweight (green threads) that are multiplexed onto native OS threads. The example is meant to show that you can spawn up tens of thousands to hundreds of thousands of goroutines (Rust tasks) pretty rapidly. That's not something you can do in Java, for example: I've tried on my Linux system and it cried uncle somewhere between 20,000 and 30,000 threads and hosed my system.

The Rust guide on Tasks says:

Rust can create hundreds of thousands of concurrent tasks
on a typical 32-bit system

So remembering the Pike example, I thought I'd try it with Rust. My primary intent was just to show that I figured out how to implement it in Rust and that it worked. When you are learning Rust, getting something to compile is an achievement worth celebrating. The Rust compiler is wonderfully strict - enforcing all its rules in order to provide deterministic, safe memory management with static analysis rather than requiring a managed runtime. (But it can provide a managed runtime with Garbage Collection if you want that model, which is one of things that makes Rust so intriguing.)

The simple timings in my earlier blog post were included only because the numbers were interestingly different. The reasons for that difference was the most interesting part of the community discussion and I certainly learned a few things I didn't know about Rust, so, despite the complaints, I'm glad I included it. There wasn't much point in doing formal extensive benchmarks, as some people suggested, because, as the other posters pointed out, the example is trifling and not something you would typically do.

Second, I called the thing "Chinese Whispers" only because that's what the esteemed Rob Pike called it and I was clearly doing a port of his work in the presentation I cited. The name makes no sense to me, but since there were charges of racism and suggestions of calling it "Broken Telephone", I'll now call it "Telephonic Whispers", since I'm not exactly sure what is "broken" about chaining a series of channels together and passing a message from one end to the other. Seems connected to me.

Third, I updated the blog with a few improvements. I shouldn't have described channels as "first class" in Rust, since that label seems to be reserved for things are a part of the core of a language. Rust channels are not built into the language in the way they are in Go. My point was that while these are in the std library and not in the core language, they are a central aspect of Rust's approach to concurrency. They are just as at hand out of the box as Go's channels are, unlike a language like Java where you have to construct them yourself from other concurrency constructs, such as a SynchronousQueue or a TransferQueue.


/* ---[ The interesting follow up: segmented stacks ]--- */

Patrick Walton profiled the Rust code and found that most of the time was spent allocating stacks. Go currently (I'm using version 1.1.2) uses something called "segmented stacks" rather than larger contiguous stacks. In the segmented stack model, you create small stacks for a thread/task/routine and when it needs more space you create another segment and keep track of them in some data structure, such as a linked list.

Since most of the work of the "Telephonic Whispers" example is spent creating a chain of 100,001 channels, it makes sense that systems that build larger stacks for each routine/task will take longer.

An advantage of a segmented stack model is that it allows you to quickly spawn up tasks and if they never need a lot of stack space, less memory is wasted. If the task needs more later, then it gets more on demand.

Rust used to have segmented stacks, but they were removed late in 2013, as announced here by Brian Anderson: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html. The Rust team found that the apparent advantages of compact memory caused performance problems and had compatibility issues with integrating with foreign code (typically C) and with LLVM libc optimizations.

And now, as pointed out in the "Chinese Whispers" community discussions, it looks like Go will be moving away from segmented stacks too. The primary issue they cite is what Brian Anderson above calls "stack thrashing": when you have heavy looping and that code is continually hitting a stack boundary, causing stack allocations to happen very frequently, thus significantly affecting performance.


/* ---[ Valid consideration: Number of native threads ]--- */

Another important point made in the "whispers" community discussion was about the underlying threading defaults: in Go version 1, only a single native OS thread is created even if you spawn multiple goroutines. To change that you have to manually set it via the runtime.GOMAXPROCS(int) function. So in the Pike example, only one thread is being used.

Rust v0.9 uses multiple threads by default and has to do synchronization between threads. So that might account for some of the slow down between the Go and Rust versions. But Walton's profiling showed that most of the time is the stack allocation, so this consideration is probably secondary.

I did a small experiment on this front by setting GOMAXPROCS to 8 (the number of CPUs on my system):

Then I did a small simple benchmark where I looped 50 times running the Go "whispers" code either with the default GOMAXPROCS setting or GOMAXPROCS=NumCPU (8). I ran each three times and saw consistent results. Here are two representative runs:

GOMAXPROCS = default:

$ time ./whispers
100001

real    0m5.569s
user    0m5.403s
sys     0m0.168s

GOMAXPROCS = NumCPU (8)

$ time ./whispers
100001

real    0m10.801s
user    0m20.413s
sys     0m6.007s

Using more CPUs causes the Go version to run about twice as slow. This would add support to the notion that this is part of the timing difference between Go and Rust for this example is due to the default number of threads that are spun up.


/* ---[ Final thoughts ]--- */

To conclude, I want to state something that is obvious by now and was also discussed the community discussion: if you go to the effort of spawning up lots of tasks in Rust or goroutines in Go, design your application in such a way that they do some amount of significant work. Spawning more tasks or routines can have overhead that you may not need to pay. As strncat says in the Reddit discussion, Rust does not have true coroutines. That doesn't mean you can't use tasks that way, but you'll need to be judicious about how you use them (and how many).

One example I'd like to implement in Rust is another port of some Rob Pike Go code: Lexical scanning and parsing use Goroutines: https://www.youtube.com/watch?v=HxaD_trXwRE