Sunday, January 26, 2014

Telephonic Whispers in Rust, revisited


/* ---[ Hitting the big time ]--- */

My previous blog entry was a bit of a trifle: it showed an implementation of an toy CSP (Communicating Sequential Process) application in Rust, comparing it to an example implementation in Go by Rob Pike. I discovered, post-hoc, that it hit the Rust subreddit and Hacker News. There was some good discussion there and a few complaints.


/* ---[ Clarifications ]--- */

First, the complaints. Of course this example was tuned for how Go works: the example was created by Rob Pike. The point of his example, and my blog post, was that goroutines (Rust tasks) are lightweight (green threads) that are multiplexed onto native OS threads. The example is meant to show that you can spawn up tens of thousands to hundreds of thousands of goroutines (Rust tasks) pretty rapidly. That's not something you can do in Java, for example: I've tried on my Linux system and it cried uncle somewhere between 20,000 and 30,000 threads and hosed my system.

The Rust guide on Tasks says:

Rust can create hundreds of thousands of concurrent tasks
on a typical 32-bit system

So remembering the Pike example, I thought I'd try it with Rust. My primary intent was just to show that I figured out how to implement it in Rust and that it worked. When you are learning Rust, getting something to compile is an achievement worth celebrating. The Rust compiler is wonderfully strict - enforcing all its rules in order to provide deterministic, safe memory management with static analysis rather than requiring a managed runtime. (But it can provide a managed runtime with Garbage Collection if you want that model, which is one of things that makes Rust so intriguing.)

The simple timings in my earlier blog post were included only because the numbers were interestingly different. The reasons for that difference was the most interesting part of the community discussion and I certainly learned a few things I didn't know about Rust, so, despite the complaints, I'm glad I included it. There wasn't much point in doing formal extensive benchmarks, as some people suggested, because, as the other posters pointed out, the example is trifling and not something you would typically do.

Second, I called the thing "Chinese Whispers" only because that's what the esteemed Rob Pike called it and I was clearly doing a port of his work in the presentation I cited. The name makes no sense to me, but since there were charges of racism and suggestions of calling it "Broken Telephone", I'll now call it "Telephonic Whispers", since I'm not exactly sure what is "broken" about chaining a series of channels together and passing a message from one end to the other. Seems connected to me.

Third, I updated the blog with a few improvements. I shouldn't have described channels as "first class" in Rust, since that label seems to be reserved for things are a part of the core of a language. Rust channels are not built into the language in the way they are in Go. My point was that while these are in the std library and not in the core language, they are a central aspect of Rust's approach to concurrency. They are just as at hand out of the box as Go's channels are, unlike a language like Java where you have to construct them yourself from other concurrency constructs, such as a SynchronousQueue or a TransferQueue.


/* ---[ The interesting follow up: segmented stacks ]--- */

Patrick Walton profiled the Rust code and found that most of the time was spent allocating stacks. Go currently (I'm using version 1.1.2) uses something called "segmented stacks" rather than larger contiguous stacks. In the segmented stack model, you create small stacks for a thread/task/routine and when it needs more space you create another segment and keep track of them in some data structure, such as a linked list.

Since most of the work of the "Telephonic Whispers" example is spent creating a chain of 100,001 channels, it makes sense that systems that build larger stacks for each routine/task will take longer.

An advantage of a segmented stack model is that it allows you to quickly spawn up tasks and if they never need a lot of stack space, less memory is wasted. If the task needs more later, then it gets more on demand.

Rust used to have segmented stacks, but they were removed late in 2013, as announced here by Brian Anderson: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html. The Rust team found that the apparent advantages of compact memory caused performance problems and had compatibility issues with integrating with foreign code (typically C) and with LLVM libc optimizations.

And now, as pointed out in the "Chinese Whispers" community discussions, it looks like Go will be moving away from segmented stacks too. The primary issue they cite is what Brian Anderson above calls "stack thrashing": when you have heavy looping and that code is continually hitting a stack boundary, causing stack allocations to happen very frequently, thus significantly affecting performance.


/* ---[ Valid consideration: Number of native threads ]--- */

Another important point made in the "whispers" community discussion was about the underlying threading defaults: in Go version 1, only a single native OS thread is created even if you spawn multiple goroutines. To change that you have to manually set it via the runtime.GOMAXPROCS(int) function. So in the Pike example, only one thread is being used.

Rust v0.9 uses multiple threads by default and has to do synchronization between threads. So that might account for some of the slow down between the Go and Rust versions. But Walton's profiling showed that most of the time is the stack allocation, so this consideration is probably secondary.

I did a small experiment on this front by setting GOMAXPROCS to 8 (the number of CPUs on my system):

Then I did a small simple benchmark where I looped 50 times running the Go "whispers" code either with the default GOMAXPROCS setting or GOMAXPROCS=NumCPU (8). I ran each three times and saw consistent results. Here are two representative runs:

GOMAXPROCS = default:

$ time ./whispers
100001

real    0m5.569s
user    0m5.403s
sys     0m0.168s

GOMAXPROCS = NumCPU (8)

$ time ./whispers
100001

real    0m10.801s
user    0m20.413s
sys     0m6.007s

Using more CPUs causes the Go version to run about twice as slow. This would add support to the notion that this is part of the timing difference between Go and Rust for this example is due to the default number of threads that are spun up.


/* ---[ Final thoughts ]--- */

To conclude, I want to state something that is obvious by now and was also discussed the community discussion: if you go to the effort of spawning up lots of tasks in Rust or goroutines in Go, design your application in such a way that they do some amount of significant work. Spawning more tasks or routines can have overhead that you may not need to pay. As strncat says in the Reddit discussion, Rust does not have true coroutines. That doesn't mean you can't use tasks that way, but you'll need to be judicious about how you use them (and how many).

One example I'd like to implement in Rust is another port of some Rob Pike Go code: Lexical scanning and parsing use Goroutines: https://www.youtube.com/watch?v=HxaD_trXwRE

No comments:

Post a Comment