Recently, while working on my current project ogonori, a Go client for the OrientDB database, I found that I had a defect in the code that encodes and decodes integer values in the way that the OrientDB binary network protocol requires (namely zigzag encoding, followed by encoding that output as a variable length integer).
After fixing the issue, first with the encoder and then with the decoder, I decided that I should do an exhaustive test of all 64 bit integers: start with MinInt64 (-9223372036854775808
), zigzag-encode it, varint encode it, then varint decode it and zigzag-decode it and you should get back the number you started with. Increment by 1 and try it again, until you reach MaxInt64 (9223372036854775807
).
(Note: I only have to use the Min/Max range of signed integers, since OrientDB is a Java database and only allows signed ints.)
I ran a small range of the possible 64-bit integer space and found that doing this exhaustive test was going to take a very long time. Since I have 8 CPUs on my system, I decided to first parallelize the test into 8 separate goroutines, each taking 1/8 of the total range:
With this code, I spawn 8 threads, running 10 goroutines. Eight of them do the encoding/decoding test and if any integer encode/decode fails the test it is written to "failure channel" of type chan string
, which the main goroutine monitors.
A sync.WaitGroup
(a counting semaphore) is created and shared among the goroutines. When each "range tester" finishes, it calls Done()
on the WaitGroup to decrement the semaphore. The final (nameless) goroutine waits until all "range tester" goroutines have finished and then closes the single shared failure channel.
Closing of the failure channel, causes the loop over that channel in the main goroutine to exit and the whole program finishes.
/* ---[ Performance Baseline ]--- */
I fired this up with the following smaller testrange:
ranges := []testrange{
{100000001, 150000001},
{200000001, 250000001},
{300000001, 350000000},
{400000001, 450000000},
{500000001, 550000000},
{600000001, 650000000},
{700000001, 750000000},
{800000001, 850000000},
}
and ran top
. To my surprise I was only using ~400% CPU, rather than ~800% (the max my system supports):
$ top -d1
PID USER PR ... S %CPU %MEM TIME+ COMMAND
1736 midpete+ 20 ... S 420.9 0.0 1:31.33 ogonori
I then looked at the CPU usage of each thread using the -H
option to top
and saw that my 8 range-tester goroutines were each using only about 50% CPU. And that there was a 9th thread that was also consistently using 40 to 50% CPU. My guess was that this was a GC thread.
$ top -d1 -H
PID USER PR ... S %CPU %MEM TIME+ COMMAND
1740 midpete+ 20 ... S 50.1 0.0 0:21.47 ogonori
1744 midpete+ 20 ... R 50.1 0.0 0:21.52 ogonori
1742 midpete+ 20 ... S 49.2 0.0 0:21.38 ogonori
1736 midpete+ 20 ... S 47.2 0.0 0:21.53 ogonori
1738 midpete+ 20 ... S 46.2 0.0 0:22.11 ogonori
1745 midpete+ 20 ... R 46.2 0.0 0:20.37 ogonori
1741 midpete+ 20 ... S 45.2 0.0 0:21.41 ogonori
1743 midpete+ 20 ... R 42.3 0.0 0:21.26 ogonori
1739 midpete+ 20 ... S 40.3 0.0 0:21.35 ogonori
1737 midpete+ 20 ... S 3.9 0.0 0:02.07 ogonori
So I have an algorithm that should be trivially parallelizable with no shared memory and no contention (in theory), but it was only using half the CPU available to it. Hmmm...
Next I ran the test on my system several times to get a baseline performance metric:
$ time ./ogonori -z # the -z switch tells the ogonori code to only this
# benchmark rather than the usual OrientDB tests
Run1: real 3m44.602s
Run2: real 3m42.818s
Run3: real 3m28.917s
Avg ± StdErr: 218.8 ± 5 sec
Then I remembered I had not turned off the CPU power saving throttling on my Linux system (it was set to ondemand
), so I ran the following script and repeated the benchmarks:
#!/bin/bash
for i in /sys/devices/system/cpu/cpu[0-7]
do
echo performance > $i/cpufreq/scaling_governor
done
$ time ./ogonori -z
Run1: real 2m12.605s
Run2: real 2m12.382s
Run3: real 2m13.172s
Run4: real 2m18.992s
Run5: real 2m17.538s
Run6: real 2m14.437s
Avg ± StdErr: 134.9 ± 1 sec
Wow, OK. So that alone gave me about a 60% improvement in throughput. Off to a good start.
/* ---[ Profiling the Code ]--- */
If you've never read Russ Cox's 2011 blog post on profiling a Go program, put it on your list - it is a treat to read.
Using what I learned there, I profiled the zigzagExhaustiveTest code to see how and where to improve it.
$ ./ogonori -z -cpuprofile=varint0.prof
I then opened the .prof file with golang's pprof tool and looked at the top 10 most heavily used functions:
$ rlwrap go tool pprof ogonori xvarint0.prof
# Using rlwrap gives you bash-like behavior and history
(pprof) top 10
171.48s of 255.92s total (67.01%)
Dropped 171 nodes (cum <= 1.28s)
Showing top 10 nodes out of 36 (cum >= 8.78s)
flat flat% sum% cum cum%
45.98s 17.97% 17.97% 45.98s 17.97% scanblock
25.63s 10.01% 27.98% 33.58s 13.12% runtime.mallocgc
19.20s 7.50% 35.48% 111.35s 43.51% g/q/o/o/b/varint.ReadVarIntToUint
14.94s 5.84% 41.32% 15.62s 6.10% bytes.(*Buffer).grow
12.44s 4.86% 46.18% 12.44s 4.86% runtime.MSpan_Sweep
11.87s 4.64% 50.82% 15.93s 6.22% bytes.(*Buffer).Read
11.33s 4.43% 55.25% 21.56s 8.42% bytes.(*Buffer).WriteByte
11.18s 4.37% 59.62% 11.18s 4.37% runtime.futex
10.13s 3.96% 63.57% 19.16s 7.49% bytes.(*Buffer).Write
8.78s 3.43% 67.01% 8.78s 3.43% runtime.memmove
(pprof) top10 -cum
110.32s of 255.92s total (43.11%)
Dropped 171 nodes (cum <= 1.28s)
Showing top 10 nodes out of 36 (cum >= 25.50s)
flat flat% sum% cum cum%
0 0% 0% 147.62s 57.68% runtime.goexit
2.94s 1.15% 1.15% 147.49s 57.63% main.func·018
19.20s 7.50% 8.65% 111.35s 43.51% g/q/o/o/b/varint.ReadVarIntToUint
0 0% 8.65% 77.81s 30.40% GC
45.98s 17.97% 26.62% 45.98s 17.97% scanblock
4.90s 1.91% 28.53% 38.48s 15.04% runtime.newobject
25.63s 10.01% 38.55% 33.58s 13.12% runtime.mallocgc
6.65s 2.60% 41.15% 31.39s 12.27% g/q/o/o/b/varint.VarintEncode
0 0% 41.15% 30.48s 11.91% System
5.02s 1.96% 43.11% 25.50s 9.96% encoding/binary.Read
We can see that a significant percentage of time (>30%) is being spent in GC, so the program is generating a lot of garbage somewhere - plus the cost of generating new heap data, which the runtime.mallocgc
figure tells me is at least 13% of the program run time.
Remember that there are four steps to my algorithm:
- zigzag encode (
varint.ZigzagEncodeUInt64
) - varint encode (
varint.VarintEncode
) - varint decode (
varint.ReadVarIntToUint
) - zigzag decode (
varint.ZigzagDecodeInt64
)
The zigzag encode/decode steps are simple bit manipulations, so they are fast. Typing web
at the pprof prompt launches an SVG graph of where time was spent. The zigzag functions don't even show up - they were dropped off as being too small (not shown here).
So I needed to focus on steps 2 and 3 which take (cumulatively) 43.5% and 12.3%, respectively.
Since varint.ReadVarIntToUint
is the biggest offender let's look at it in detail in the pprof tool:
I've marked the biggest time sinks with an arrow on the left side. Generally one should start with the biggest bottleneck, so let's rank these by cumulative time (2nd col):
-> 32.41s 111: err = binary.Read(&buf, binary.LittleEndian, &u)
-> 16.83s 73: n, err = r.Read(ba[:])
-> 15.93s 106: buf.WriteByte(y | z)
-> 14.82s 88: var buf bytes.Buffer
-> 8.53s 110: padTo8Bytes(&buf)
First, it is very interesting how expensive creating a bytes.Buffer is. But first we need to deal with binary.Read
.
Because I'm only ever passing in uint64's, the only real functionality I'm using in this function is:
*data = order.Uint64(bs)
/* ---[ Optimization #1 ]--- */
But it's even worse. If you look back at varint.ReadVarIntToUint
you'll see that I'm creating a bytes.Buffer
and copying bytes into it only so that I can pass that Buffer (as an io.Reader
) into the binary.Read
function:
err = binary.Read(buf, binary.LittleEndian, &u)
which then immediately copies all those bytes back out of the buffer:
if _, err := io.ReadFull(r, bs); err != nil {
return err
}
So this is nothing but wasteful data copying and the heap allocations for it.
binary.Read
also does a type switch where a good percentage of time is spent
2.01s 4.49s 151: switch data := data.(type) {
and, as stated, the only useful method ever called in it is:
--> 460ms 2.76s 167: *data = order.Uint64(bs)
So I should try just calling binary.LittleEndian.Uint64(bs)
directly.
Here's the revised varint.ReadVarIntToUint
function (with everything inlined for easier reading and profiling analysis):
This change also removes the padTo8Bytes
method that wrote one byte at a time to the bytes.Buffer
and took >3% of program time itself.
Now let's rerun the benchmarks:
Run 1: real 0m27.182s
Run 2: real 0m27.053s
Run 3: real 0m28.200s
Run 4: real 0m25.762s
Run 5: real 0m26.031s
Run 6: real 0m26.813s
Avg ± StdErr: 26.8 ± 0.4 sec
Outstanding! Throughput increased 5x (134.9/26.8). And using top
, I see that the goroutines are consuming nearly all available CPU:
$ top -d1
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12983 midpete+ 20 0 352496 5768 2736 R 763.7 0.0 1:35.64 ogonori
$ top -d1 -H
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13231 midpete+ 20 0 286960 5772 2744 R 97.5 0.0 0:22.51 ogonori
13225 midpete+ 20 0 286960 5772 2744 R 91.7 0.0 0:22.47 ogonori
13227 midpete+ 20 0 286960 5772 2744 R 90.7 0.0 0:23.09 ogonori
13232 midpete+ 20 0 286960 5772 2744 S 90.7 0.0 0:22.26 ogonori
13235 midpete+ 20 0 286960 5772 2744 R 90.7 0.0 0:09.72 ogonori
13230 midpete+ 20 0 286960 5772 2744 R 88.7 0.0 0:22.14 ogonori
13233 midpete+ 20 0 286960 5772 2744 R 73.1 0.0 0:22.70 ogonori
13228 midpete+ 20 0 286960 5772 2744 R 71.2 0.0 0:22.39 ogonori
13229 midpete+ 20 0 286960 5772 2744 R 70.2 0.0 0:23.09 ogonori
I also used pprof to profile this run, so let's examine compare the cumulative top10 before and after:
Before (reprinted from above):
(pprof) top10 -cum
110.32s of 255.92s total (43.11%)
Dropped 171 nodes (cum <= 1.28s)
Showing top 10 nodes out of 36 (cum >= 25.50s)
flat flat% sum% cum cum%
0 0% 0% 147.62s 57.68% runtime.goexit
2.94s 1.15% 1.15% 147.49s 57.63% main.func·018
19.20s 7.50% 8.65% 111.35s 43.51% g/q/o/o/b/varint.ReadVarIntToUint
0 0% 8.65% 77.81s 30.40% GC
45.98s 17.97% 26.62% 45.98s 17.97% scanblock
4.90s 1.91% 28.53% 38.48s 15.04% runtime.newobject
25.63s 10.01% 38.55% 33.58s 13.12% runtime.mallocgc
6.65s 2.60% 41.15% 31.39s 12.27% g/q/o/o/b/varint.VarintEncode
0 0% 41.15% 30.48s 11.91% System
5.02s 1.96% 43.11% 25.50s 9.96% encoding/binary.Read
After:
(pprof) top15 -cum
63680ms of 65970ms total (96.53%)
Dropped 33 nodes (cum <= 329.85ms)
Showing top 15 nodes out of 18 (cum >= 930ms)
flat flat% sum% cum cum%
2280ms 3.46% 3.46% 64470ms 97.73% main.func·018
0 0% 3.46% 64470ms 97.73% runtime.goexit
17760ms 26.92% 30.38% 34190ms 51.83% g/q/o/o/b/varint.ReadVarIntToUint
5890ms 8.93% 39.31% 26370ms 39.97% g/q/o/o/b/varint.VarintEncode
8550ms 12.96% 52.27% 16360ms 24.80% bytes.(*Buffer).Write
9080ms 13.76% 66.03% 11500ms 17.43% bytes.(*Buffer).Read
1460ms 2.21% 68.24% 7550ms 11.44% runtime.newobject
4370ms 6.62% 74.87% 6090ms 9.23% runtime.mallocgc
5650ms 8.56% 83.43% 5650ms 8.56% runtime.memmove
4580ms 6.94% 90.37% 4580ms 6.94% bytes.(*Buffer).grow
680ms 1.03% 91.41% 1630ms 2.47% bytes.(*Buffer).Reset
1500ms 2.27% 93.68% 1500ms 2.27% encoding/binary.littleEndian.Uint64
0 0% 93.68% 1030ms 1.56% GC
950ms 1.44% 95.12% 950ms 1.44% bytes.(*Buffer).Truncate
930ms 1.41% 96.53% 930ms 1.41% runtime.gomcache
More good news. In the previous version, GC was taking 30% of the total CPU time. Now, more than 90% of the time is now being spent in the two main workhorse methods: varint.ReadVarIntToUint
and varint.VarintEncode
. GC time has been reduced to 1.5%!
I suspect the reason that goroutines in the earlier code version only took 40-50% of a CPU is because GC was the contention point. Garbage Collection in golang is a stop-the-world affair, so all other threads are paused until it finishes. By reducing GC to only 1.5%, now the range-testing goroutines can spend far more time running - approaching 100%.
/* ---[ Optimization #2 ]--- */
Are there further improvements we can make? Since the program now spends 40% of its time in varint.VarintEncode
, let's look at that function in detail:
Almost 75% of the time in this function is spent writing to the io.Writer (a bytes.Buffer
). We write one byte at a time to it. Perhaps it would be better to write it all to a byte slice first and then issue one w.Write
.
The new code is then:
And the next round of benchmarks are:
real 0m38.899s
real 0m45.135s
real 0m38.047s
real 0m42.377s
real 0m32.894s
real 0m37.962s
real 0m38.926s
real 0m37.870s
Avg ± StdErr: 39.0 ± 1.2
Hmm, not good. It looks like this second revision caused my code to go backwards in performance by 30%. To be sure, I reverted the change and re-ran the benchmarks with only optimization #1 again: they returned to the ~25s/run timeframe I saw before. So it is true: this second change made things worse.
And the analysis of top
agreed: the goroutines were no long using 90%+ CPU:
$ top -d1
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22149 midpete+ 20 0 286960 5776 2744 R 593.9 0.0 1:06.66 ogonori
$ top -d1 -H
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22205 midpete+ 20 0 229620 7812 2744 R 74.6 0.0 0:10.68 ogonori
22201 midpete+ 20 0 229620 7812 2744 R 73.6 0.0 0:10.14 ogonori
22202 midpete+ 20 0 229620 7812 2744 S 71.6 0.0 0:10.77 ogonori
22207 midpete+ 20 0 229620 7812 2744 S 70.6 0.0 0:10.97 ogonori
22206 midpete+ 20 0 229620 7812 2744 R 68.7 0.0 0:11.14 ogonori
22204 midpete+ 20 0 229620 7812 2744 R 65.7 0.0 0:10.07 ogonori
22199 midpete+ 20 0 229620 7812 2744 R 56.9 0.0 0:10.98 ogonori
22203 midpete+ 20 0 229620 7812 2744 R 53.0 0.0 0:11.19 ogonori
22197 midpete+ 20 0 229620 7812 2744 R 43.2 0.0 0:11.17 ogonori
22200 midpete+ 20 0 229620 7812 2744 S 17.7 0.0 0:09.95 ogonori
22198 midpete+ 20 0 229620 7812 2744 S 3.9 0.0 0:00.77 ogonori
Let's look at the pprof data for revision #2:
(pprof) top10 -cum
59.31s of 86.44s total (68.61%)
Dropped 92 nodes (cum <= 0.43s)
Showing top 10 nodes out of 24 (cum >= 5.36s)
flat flat% sum% cum cum%
1.95s 2.26% 2.26% 74.02s 85.63% main.func·018
0 0% 2.26% 74.02s 85.63% runtime.goexit
20.71s 23.96% 26.21% 44.80s 51.83% g/q/o/o/b/varint.ReadVarIntToUint
5.57s 6.44% 32.66% 24.39s 28.22% g/q/o/o/b/varint.VarintEncode
15s 17.35% 50.01% 18.71s 21.65% bytes.(*Buffer).Read
5.64s 6.52% 56.54% 13.46s 15.57% runtime.makeslice
0 0% 56.54% 8.39s 9.71% GC
5.86s 6.78% 63.32% 8.23s 9.52% runtime.mallocgc
2.48s 2.87% 66.18% 7.82s 9.05% runtime.newarray
2.10s 2.43% 68.61% 5.36s 6.20% bytes.(*Buffer).Write
Now GC is back up to nearly 10% of the total running time. So let's look at the profile of the VarintEncode
function we changed:
(pprof) list VarintEncode
Total: 1.44mins
5.57s 24.39s (flat, cum) 28.22% of Total
. . 40://
290ms 290ms 41:func VarintEncode(w io.Writer, v uint64) error {
550ms 14.01s 42: bs := make([]byte, 0, 10)
170ms 170ms 43: for (v & 0xffffffffffffff80) != 0 {
2.04s 2.04s 44: bs = append(bs, byte((v&0x7f)|0x80))
320ms 320ms 45: v >>= 7
. . 46: }
680ms 680ms 47: bs = append(bs, byte(v&0x7f))
. . 48:
1.20s 6.56s 49: n, err := w.Write(bs)
120ms 120ms 50: if err != nil {
. . 51: return oerror.NewTrace(err)
. . 52: }
. . 53: if n != len(bs) {
. . 54: return fmt.Errorf("Incorrect number of bytes written. Expected %d. Actual %d", len(bs), n)
. . 55: }
200ms 200ms 56: return nil
. . 57:}
We can see that 58% of the time of this method is spent allocating new memory (the []byte
slice on line 42), thereby causing GC to take longer. Here's why - if you look at the implementation of bytes.Buffer
, you'll see that it has a fixed bootstrap
array it allocates to handle small buffers and another fixed byte array (runeBytes
) to handle writes to WriteByte
; both of these allow it to avoid memory allocation for small operations.
Since my test code is reusing the same bytes.Buffer
for each iteration, no new allocations were occurring during each call to varint.VarintEncode
. But with this second revision I'm creating a new byte slice of capacity 10 in each round. So this change should be reverted.
/* ---[ Lessons Learned ]--- */
When you have an algorithm that you think should be CPU bound and your threads are not using ~100% CPU, then you have contention somewhere. In many scenarios that will be IO wait. But if you have no IO in that portion of your app, then you either have hidden thread contention (mutexes) and/or you may have a lot of garbage collection happening, which pauses all your worker threads/goroutines while GC is happening. Use the pprof tool to determine where time is being spent.
For performance sensitive algorithms, you will want to be garbage free in the main path as much as possible.
Once you know where the time is going, you should generally go after the largest bottleneck first. There's always a primary bottleneck somewhere. Removing the bottleneck in one place causes it to move to another. In my case, I wanted that bottleneck to just be CPU speed (or as is often the case, the time to get data from main memory or a CPU cache into a register).
A big lesson learned here is to be wary of convenience methods in Go's standard library. Many are provided for convenience, not performance. The binary.Read(buf, binary.LittleEndian, &u)
call in my case is one such example. The third parameter to binary.Read
is of type interface{}
, so a type switch has to be done to detect the type. If your code is only ever passing in one type (uint64
in my case), then go read the stdlib code and figure out if there is a more direct method to call. That change contributed to a 5x throughput improvement in my case!
Next, be careful of too much data copying. While the io.Writer
is a nice interface, if you are working with byte slices and want to pass it to some stdlib method that requires io.Writer
, you will often copy the data into a bytes.Buffer
and pass that in. If the function you call copies those bytes back out to yet another byte slice, then garbage is being generated and time is being wasted. So be aware of what's happening in the methods you call.
Finally, always measure carefully before and after any attempted optimizations. Intuition about where bottlenecks are and what will speed things up are often wrong. The only thing of value is to measure objectively. To end I'll quote "Commander" Pike:
Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. --Rob Pike's 5 Rules of Programming
/* ---[ Misc Appendix Notes ]--- */
The overall int64 space will still take too long to run even with these improvements, so I've settled for sampling from the state space instead.
All benchmark comparisons done were statistically significant (p<0.01) using Student's t-test, as analyzed with this tool: http://studentsttest.com. The mean and standard errors were also calculated here.
I notice that even with my best optimization (#1), there is still a ninth thread using >70% CPU. I used kill -QUIT
on the program to get a stack dump of all the goroutines. I get 10 go routines - the 8 doing the fnRangeTester
work, one waiting on the WaitGroup
and the main goroutine which is waiting on the range failchan
line. So I'm not sure what that 9th thread is doing churning up 50-70% CPU. Anyone know how to tell?
[Update - 08-July-2015]
In the comments, Carlos Torres asked for the pprof line-by-line output of the ReadVarIntToUint
function after the first optimization. I did two profiling runs and compared the pprof outputs and they were both nearly identical. Here is one of them:
(pprof) list ReadVarIntToUint
Total: 1.13mins
ROUTINE ======================== g/q/o/o/b/varint.ReadVarIntToUint
18.20s 35.45s (flat, cum) 52.08% of Total
. . 25://
480ms 480ms 26:func ReadVarIntToUint(r io.Reader) (uint64, error) {
. . 27: var (
270ms 270ms 28: varbs []byte
120ms 3.84s 29: ba [1]byte
. . 30: u uint64
. . 31: n int
180ms 180ms 32: err error
. . 33: )
. . 34:
260ms 260ms 35: varbs = make([]byte, 0, 10)
. . 36:
. . 37: /* ---[ read in all varint bytes ]--- */
. . 38: for {
3.84s 15.72s 39: n, err = r.Read(ba[:])
530ms 530ms 40: if err != nil {
. . 41: return 0, oerror.NewTrace(err)
. . 42: }
10ms 10ms 43: if n != 1 {
. . 44: return 0, oerror.IncorrectNetworkRead{Expected: 1, Actual: n}
. . 45: }
3.21s 3.21s 46: varbs = append(varbs, ba[0])
980ms 980ms 47: if IsFinalVarIntByte(ba[0]) {
570ms 570ms 48: varbs = append(varbs, byte(0x0))
. . 49: break
. . 50: }
. . 51: }
. . 52:
. . 53: /* ---[ decode ]--- */
. . 54:
. . 55: var right, left uint
. . 56:
620ms 620ms 57: finalbs := make([]byte, 8)
. . 58:
. . 59: idx := 0
1.08s 1.08s 60: for i := 0; i < len(varbs)-1; i++ {
360ms 360ms 61: right = uint(i) % 8
20ms 20ms 62: left = 7 - right
230ms 230ms 63: if i == 7 {
. . 64: continue
. . 65: }
840ms 840ms 66: vbcurr := varbs[i]
900ms 900ms 67: vbnext := varbs[i+1]
. . 68:
120ms 120ms 69: x := vbcurr & byte(0x7f)
670ms 670ms 70: y := x >> right
670ms 670ms 71: z := vbnext << left
650ms 650ms 72: finalbs[idx] = y | z
780ms 780ms 73: idx++
. . 74: }
. . 75:
540ms 2.19s 76: u = binary.LittleEndian.Uint64(finalbs)
270ms 270ms 77: return u, nil
. . 78:}
If you compare it to the pprof before the optimization, the top half looks about the same, but the bottom half is dramatically different. For example, more than 30s was spent in binary.Read(&buf, binary.LittleEndian, &u)
in the original version. The replacement code, binary.LittleEndian.Uint64(finalbs)
, only takes up about 2 seconds of processing time.
The only remaining spot I see for any further optimization is the 15s (out of 35s) spent in r.Read(ba[:])
. The problem, however, is that with a varint you don't know how many bytes long it is in advance, so you have read and examine them one at a time. There is probably a way to optimize this, but I haven't attempted it yet.
hi there, nice writeup.
ReplyDeleteIf you can take stack dumps of the threads frequently, the 9th thread's responsibility will be known soon.
I am glad that I saw this post. It is informative blog for us and we need this type of blog thanks for share this blog, Keep posting such instructional blogs and I am looking forward for your future posts. Python Projects for Students Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account. Project Center in Chennai
DeleteIf you set GODEBUG you can get regular state dumps from the scheduler to see what goroutines are running
ReplyDeleteAfter the improvement was made on varint.ReadVarIntToUint, how did its detailed profile look in pprof?
ReplyDeleteI've posted it in an update. Thanks for your interest!
DeleteThanks for the update.
DeleteThere are 2 things I find interesting here.
1) I find surprising that it is so expensive (3.84s) to allocate a byte array of len 1;
120ms 3.84s 29: ba [1]byte
and
2) On line:
3.84s 15.72s 39: n, err = r.Read(ba[:])
I wonder how much time is attributed to just slicing the array, separate from the r.Read() call.
> I find surprising that it is so expensive (3.84s) to allocate a byte array of len 1;
DeleteAgreed. I'd like to understand this better. For example is the backing array is being created on the heap or the stack? If the heap, why? One of my performance frustrations with the io.Reader and io.Writer interfaces is that you always have to work with slices, never a single byte. And bytes.Buffer is a concrete type, not a interface.
humbled and Andrew -
ReplyDeleteThanks for the suggestions!
I've made a first attempt at understanding with the "ninth" thread is doing using stack dumps and setting the GODEBUG env var. The multiplexing of goroutines onto threads is not something I'm used from my JVM experience, so i'm still trying to get a handle on how to read the tea leaves. I'll post something if I make headway on it.
Have you tried using the builtin functions
ReplyDeletehttps://golang.org/src/encoding/binary/varint.go
Oh, man. It's amazing the gems that are present in the Go std library. Honestly, I never even thought to look for this there. But since the encoding is taken from protocol buffers, which is from Google and Go is from Google, it makes sense. Even so, I'm not sorry I did it on my own, just to have gotten back to doing some low-level bit fiddling, not something I get to do much at my day job :-).
DeleteAnd I can now do a perf comparison between my version and theirs. I'll put it on my TODO list.
Thanks for the heads-up!
Very detailed article, I am going to start learning Go soon. bookmarking it. Thanks!! Paul
ReplyDeleteJadi, pastinya hal tersebut akan membuat para pemain bermain sampai lupa waktu, namun bermain sampai lupa waktu merupakan suatu hal yang tidak baik. Karena jika para pemain terus bermain seperti itu, maka para pemain akan mengalami kekalahan dalam taruhan slot ini.
ReplyDeleteasikqq
http://dewaqqq.club/
http://sumoqq.today/
interqq
pionpoker
bandar ceme terbaik
betgratis
paito warna terlengkap
forum prediksi
Nice posting information I liked it
ReplyDeleteSanjary Kids is one of the best play school and preschool in Hyderabad,India. Give your child the best preschool experience by choosing the best playschool of Hyderabad in Abids. we provide programs like Play group,Nursery,Junior KG,Senior KG,and provides Teacher Training Program.
Preschool teacher training course in hyderabad
I believe there are many who feel the same satisfaction as I read this article!
ReplyDeleteHere you are:
I hope you will continue to have such articles to share with everyone!
คาสิโนออนไลน์
casino games
คา สิ โน ออนไลน์ ฟรี
baccarat
This is a very interesting reading value in the newspaper. In fact, I'm grateful to have given the chance to browse an informative article like this! In fact, I appreciate this article, thank you for sharing this information.
ReplyDeleteDedicatedHosting4u.com
Nice blog information provided by the author
ReplyDeleteSanjary Academy is the best Piping Design institute in Hyderabad, Telangana. It is the best Piping design Course in India and we have offer professional Engineering Courses like Piping design Course, QA/QC Course, document controller course, Pressure Vessel Design Course, Welding Inspector Course, Quality Management Course and Safety Officer Course.
Piping Design Course
Piping Design Course in Hyderabad
Piping Design Course in India
nice post
ReplyDeleteYaaron Studios is one of the rapidly growing editing studios in Hyderabad. We are the best Video Editing services in Hyderabad. We provides best graphic works like logo reveals, corporate presentation Etc. And also we gives the best Outdoor/Indoor shoots and Ad Making services.
Best video editing services in Hyderabad,ameerpet
Best Graphic Designing services in Hyderabad,ameerpet
Best Ad Making services in Hyderabad,ameerpet
Good Information
ReplyDelete"Pressure Vessel Design Course is one of the courses offered by Sanjary Academy in Hyderabad. We have offer professional
Engineering Course like Piping Design Course,QA / QC Course,document Controller course,pressure Vessel Design Course,
Welding Inspector Course, Quality Management Course, #Safety officer course."
Piping Design Course in India
Piping Design Course in Hyderabad
Piping Design Course in Hyderabad
QA / QC Course
QA / QC Course in india
QA / QC Course in Hyderabad
Document Controller course
Pressure Vessel Design Course
Welding Inspector Course
Quality Management Course
Quality Management Course in india
Safety officer course
data sgp
ReplyDeletedata sydney
datahk
syair sydney
syairsgp
datasgp
paito warna
http://warungsgp.com/
live hk 6d
live sydney
resolver
ReplyDeleteWe are an MRO parts supplier with a very large inventory. We ship parts to all the countries in the world, usually by DHL AIR. You are suggested to make payments online. And we will send you the tracking number once the order is shipped.
Nice blog, thank you so much for sharing such an amazing blog. Get the best Website Designing Services by our expert of OGEN Infosystem in Delhi, India.
ReplyDeleteWeb Design Company in Delhi
BA Revaluation Result 2019
ReplyDeleteAppslure is Best app development company in mumbai and you can get website development service at a very affordable price.
ReplyDeleteApp development company in mumbai
persentase kemenangan terbaik dan yang menawarkan harga terendah. Poker adalah salah satu permainan yang tergantung pada banyak faktor 98toto
ReplyDeleteI think you did an awesome job explaining it. Sure beats having to research it on my own. Thanks
ReplyDeleteBCOM 1st, 2nd & 3rd Year Time Table 2020
Thanks for sharing this information with us...
ReplyDeleteJava Course in Bangalore
paito warna terlengkap
ReplyDeletepaito warna cambodia
paito warna china
paito warna singapore
live draw hk
ยินดีต้อนรับสู่ g-win88.com เกมคาสิโนออนไลน์อันดับ 1 ที่ดีที่สุดในประเทศไทยและเอเชีย!
ReplyDelete⮚ รับโบนัสฟรี 100% สำหรับสมาชิกใหม่
⮚ โปรโมชั่น แทงบอลออนไลน์ รับค่าคอม x3
⮚ โปรโมชั่น ลูกค่าบาคาร่า รับคืนยอดเสีย ฟรีๆ 10%
ฝาก-ถอน โอนไว 24 ชั่วโมง⏰, ได้เงินชัว ไม่มีโกง 💯%, โปรโมชั่นดีดีไม่เหมือนที่อื่นแน่นอน
╔══════════════════╗
♛สมัครวันนี้เลย ADD LINE : https://lin.ee/1ZdZara
☎️Call Center บริการตลอด 24 ชม. 0929553889
💋💋รับสิทธิ์ ฟรี!! จำนวนจำกัด💋💋
╚══════════════════╝
We develop free teaching aids for parents and educators to teach English to pre-school children. For more info please visit here: English for children
ReplyDeleteWe are pleased to have you visit our site. English for kids
ReplyDeleteDisneyslot - Game Play -Slot Machine - Tembakikan - Agen Playtech - Joker123 - Kingkong - Casino Online
ReplyDeleteHanya Dengan Min Dp 10,000- dan Wd 50,000-, Anda Berkesempatan Meraih Keberuntungan/Kemenangan Di Disneyslot. Memudahkan Transaksi Melalui Bank BCA - BNI - MANDIRI - BRI - DANAMON - PULSA TELKOMSEL/XL dan OVO Payment.
:: Hot Promo News ::
• Big Bonus Deposit 50%
• Next Bonus Deposit 20%
• Bonus Cashback 5% Setiap Senin
Kontak Kami :
Whatsapp : +62813 9701 4667
Link Alternatif Disneyslot :
• http://156.67.217.134/disneyslot/
Yuk Jangan Tunggu Lagi Daftar Sekarang Juga Dan Nikmati Kemenangan/Keberuntungan Bersama Disneyslot!!Dapatkan Bonus Tertinggi SlotGame Hanya Di Disneyslot.com
#slotgame #agenjoker123 #bandarjudionline #agenkongkong #agenplaytech #situsgameslot #websitejudislot #agentembakikanonline #slotplaystar #agencasinoonline #agengpslot #bandarcasino #slotdisney #rajaslotgame #dewaslot #situsjoker #jackpot #promo #bonus #slot #machine #superbonus #sagaming #ebetcasino #asiagaming #allbet #sagaming #evocasino #baccarat #dragontiger #sicbo #tembakikan
ยินดีต้อนรับสู่ g-win88.com เกมคาสิโนออนไลน์อันดับ 1 ที่ดีที่สุดในประเทศไทยและเอเชีย!
ReplyDelete⮚ รับโบนัสฟรี 100% สำหรับสมาชิกใหม่
⮚ โปรโมชั่น แทงบอลออนไลน์ รับค่าคอม x3
⮚ โปรโมชั่น ลูกค่าบาคาร่า รับคืนยอดเสีย ฟรีๆ 10%
ฝาก-ถอน โอนไว 24 ชั่วโมง⏰, ได้เงินชัว ไม่มีโกง 💯%, โปรโมชั่นดีดีไม่เหมือนที่อื่นแน่นอน
╔══════════════════╗
♛สมัครวันนี้เลย ADD LINE : https://lin.ee/1ZdZara
☎️Call Center บริการตลอด 24 ชม. 0929553889
💋💋รับสิทธิ์ ฟรี!! จำนวนจำกัด💋💋
╚══════════════════╝
Disneyslot - Game Play -Slot Machine - Tembakikan - Agen Playtech - Joker123 - Kingkong - Casino Online
ReplyDeleteHanya Dengan Min Dp 10,000- dan Wd 50,000-, Anda Berkesempatan Meraih Keberuntungan/Kemenangan Di Disneyslot. Memudahkan Transaksi Melalui Bank BCA - BNI - MANDIRI - BRI - DANAMON - PULSA TELKOMSEL/XL dan OVO Payment.
:: Hot Promo News ::
• Big Bonus Deposit 50%
• Next Bonus Deposit 20%
• Bonus Cashback 5% Setiap Senin
Kontak Kami :
Whatsapp : +62813 9701 4667
Link Alternatif Disneyslot :
• http://156.67.217.134/disneyslot/
Yuk Jangan Tunggu Lagi Daftar Sekarang Juga Dan Nikmati Kemenangan/Keberuntungan Bersama Disneyslot!!Dapatkan Bonus Tertinggi SlotGame Hanya Di Disneyslot.com
#slotgame #agenjoker123 #bandarjudionline #agenkongkong #agenplaytech #situsgameslot #websitejudislot #agentembakikanonline #slotplaystar #agencasinoonline #agengpslot #bandarcasino #slotdisney #rajaslotgame #dewaslot #situsjoker #jackpot #promo #bonus #slot #machine #superbonus #sagaming #ebetcasino #asiagaming #allbet #sagaming #evocasino #baccarat #dragontiger #sicbo #tembakikan
Very detailed one to learn. Thanks will share it on our site.
ReplyDeleteVery informative post! This post gives truly quality information. I find that this post is really amazing. Thank you for this brief explanation and very nice information.
ReplyDeleteoffice.com/setup
mcafee.com/activate
This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
ReplyDeleteWeb Designing Course Training in Chennai | Certification | Online Course Training | Web Designing Course Training in Bangalore | Certification | Online Course Training | Web Designing Course Training in Hyderabad | Certification | Online Course Training | Web Designing Course Training in Coimbatore | Certification | Online Course Training | Web Designing Course Training in Online | Certification | Online Course Training
Incredible I love this!!!!!!!!!! hey all, I want to tell you that this is an awesome blog and which you have written in this blog, each and every word is also great. Nice dear!! Thanks to sharing with us! Thanks a lot, dear!!!!!!!!!
ReplyDeleteIndia VPS Hosing
Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.
ReplyDeleteSecurity Guard License
Ontario Security License
Security License Ontario
Security License
Thank you..
This post was really thinkable for me, please updates more information about this related blog. Visit Ogen Infosystem for quality web design and SEO Services in your budget.
ReplyDeleteWebsite Designing Company in Delhi
Feeling good to read such a informative blog, mostly i eagerly search for this kind of blog. I really found your blog informative and unique, waiting for your new blog to read.
ReplyDeleteDigital marketing Service in Delhi
SMM Services
PPC Services in Delhi
Website Design & Development Packages
SEO Services Packages
Local SEO services
E-mail marketing services
YouTube plans
NEET chemistry
ReplyDeleteIB chemistry
IGCSE chemistry
CBSE chemistry
MCAT
Web Development Company
ReplyDeleteMobile app development
Android app development company
ios app development
school software
ReplyDeleteschool management erp
e learning tools for education
online classrooms for teachers
lms online school
media entertainment it solutions in usa
ReplyDeleteemerging technology company in usa
artificial intelligence company in usa
blockchain technology company in usa
data science and analytics services in usa
buy juicy fruit online
ReplyDeletebuy gelato strain online
Buy dark star strain online
buy hawaiian skunk strain online
buy bc big bud strain leafly
buy auto flowering seeds online
buy brass knuckles vape recall 2018
buy alaskan thunder fuck online
buy cannabis seeds bank online
This site helps to clear your all query.
ReplyDeleteThis is really worth reading. nice informative article.BA 3rd Year Time Table
BA 2nd Year time table
Get best top grade AAA+ Cannabis products, medical Marijuana and Medical Cards.
ReplyDeletebuy marijuana-indica online
buy marijuana-hybrid online
buy marijuana-concentrates-cartridges-and-weed-concentrates online
buy marijuana-seeds online
buy marijuana-pre-rolls online
buy marijuana-accessories online
buy marijuana-accessories-and-vaporisers online
buy hemp-cbd-and-cbd-oils online
buy hemp-cbd-and-cbd-capsules-weed-concentrates online
"buy marijuana-grand-daddy-purple online
buy marijuana-indica online
buy marijuana-flowers online
ReplyDeletebuy marijuana-edibles online
buy cbd-oils online
buy vapes-and-carts online
buy accessories online
buy auto-flowering-seeds online
buy auto-flowering-seeds online
buy psychedelics online
“buy psychedelics online usa
buy cannabis-concentrates-online
Thanks for sharing.
ReplyDeleteOnline training
buy marijuana-flowers online
ReplyDeletebuy marijuana-edibles online
buy cbd-oils online
buy vapes-and-carts online
buy accessories online
buy auto-flowering-seeds online
buy auto-flowering-seeds online
buy psychedelics online
“buy psychedelics online usa
buy cannabis-concentrates-online
buy marijuana-edibles online
ReplyDeletebuy cbd-oils online
buy vapes-and-carts online
buy accessories online
buy auto-flowering-seeds online
buy granddaddy-purple online
buy psychedelics online
buy cannabis-concentrates-online
buy og-kush online
buy dmt-nn-dimethyltryptamine online
buy blue-cheese-weed online
buy purple-haze online
buy strawberry-cough online
buy black-diamond-kush online
buy blue-dream online
buy moon-rock online
buy blue-dream-feminized online
profile.php?id=100069707270977 online
buy marijuana-flowers/sativa/ online
buy marijuana-flowers/indica-strains/ online
buy marijuana-flowers/hybrid/ online
Decimal to Octal conversion in c
ReplyDeleteDecimal to binary c
Write a C program to calculate Simple Interest
C program to check whether a number is even or odd
Compound interest program in c
C programming reverse number
C program to check whether a number is palindrome or not
C program to check whether an alphabet is a vowel or consonant
Program to find square root of a number in C
C program to check whether a number is positive or negative
Congratulations, wonderful work, it's beautiful so much and good idea on site. CASINO ONLINE
ReplyDeletewhat led light color helps you sleep
ReplyDeleteHi, this is good article
ReplyDeleteinternship request letter | Internship completion letter | internship companies | internship resume objective | internship application letter | Internship with training | internship email | internship experience | What internship means | Internship acknowledgement
Thank you for writing such an interesting blog. Please add some more relevant topics to the conversation. We've come to read your blog because there's no other purpose for us to be here, as you well know. Well, if you get time, you must checkout my website DedicatedHosting4u.com
ReplyDeleteThe counters, set up at 16 centres on the way to the four shrines, will record visitors' details —
ReplyDeletename, place, contact number (also of team leaders/tour operators), vehicle registration
number, the destination shrine and the date of return. Their photographs will also be clicked.
for Visit Here Char Dham Yatra package cost from Dehradun
Get free shipping on qualified led shower lights Recessed Lighting or Buy Online Pick Up in Store today in the Lighting Department.
ReplyDelete
ReplyDeleteBusiness Analyst Certification Training
Django Course Online
ONLEI Technologies
ReplyDeleteInternship
Best Online Python Certification Course
Best Online Data Science Certification Course
Best Online Machine Learning Certification Course
Python Training
Machine Learning
Data Science
Digital Marketing
Python Training In Mohali
ReplyDeleteData ScienceTraining In Mohali
I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
ReplyDeleteBCom 1st Year Admit Card 2022
BCom 2nd Year Admit Card 2022
BCom 3rd Year Admit Card 2022
Superb blog and great post.Its truly supportive for me, anticipating for all the more new post. Continue Blogging!
ReplyDeleteeniac full form in computer
dvd full form
sit full form
pcc full form
iucn full form
full form of lcd
brics full form
tally erp full form
full form of ctbt
crpf full form
This comment has been removed by the author.
ReplyDeleteThe blog was useful in knowing about book life.Excellent blog thanks for sharing the valuable information..it becomes easy to read and easily understand the information.Create you own project from us me project centers in chennai
ReplyDeleteThank you for the informative post. It was thoroughly helpful to me. Keep posting more such articles and enlighten us.
ReplyDeletebest c online course
online oracle
c course
cheap website design in chennai
big data training online