c++ - perf annotated assembly seems off -



c++ - perf annotated assembly seems off -

i want measure amount of time c++ atomic fetch_add takes on different settings. wrote this:

atomic<uint64_t> x(0); (uint64_t = 0; < reps; i+=1g) { x.fetch_add(1); }

so if reps high enough, assume able average the fetch_add per sec happening. first, needed validate of time indeed spent within fetch_add, opposed loop overhead, example. ran perf this.

this assembly objdump:

400ed0: b8 00 b4 c4 04 mov $0x4c4b400,%eax 400ed5: 0f 1f 00 nopl (%rax) 400ed8: f0 83 05 7c 22 20 00 lock addl $0x1,0x20227c(%rip) 400edf: 01 400ee0: 83 e8 01 sub $0x1,%eax 400ee3: 75 f3 jne 400ed8 <_z10incrsharedv+0x8>

perf (for cycles event) says 100% of cycles go sub $0x1,%eax, opposed expect, lock addl $0x1,0x20227c(%rip) or jump. ideas why? accurate, or measurement artifact? in sec case, why perf systematically attribute latency sub line rather addl?

tl;dr: seek using :pp suffix, events processor can help give more accurate annotation data.

longer version:

in trying investigate behavior described, attempted utilize next more unrolled loop. think solves question extent.

(uint64_t = 0; < reps; i+=10) { x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); }

when using perf record -e cycles

the resulting perf annotate is:

: 0000000000400f00 <incr(std::atomic<unsigned long>&)>: 0.00 : 400f00: mov $0x3d0900,%eax 0.00 : 400f05: nopl (%rax) 0.00 : 400f08: lock addq $0x1,(%rdi) 10.93 : 400f0d: lock addq $0x1,(%rdi) 9.77 : 400f12: lock addq $0x1,(%rdi) 10.22 : 400f17: lock addq $0x1,(%rdi) 8.97 : 400f1c: lock addq $0x1,(%rdi) 10.39 : 400f21: lock addq $0x1,(%rdi) 9.87 : 400f26: lock addq $0x1,(%rdi) 10.48 : 400f2b: lock addq $0x1,(%rdi) 9.70 : 400f30: lock addq $0x1,(%rdi) 10.19 : 400f35: lock addq $0x1,(%rdi) 9.49 : 400f3a: sub $0x1,%rax 0.00 : 400f3e: jne

when alter number of calls fetch add together 5, there 5 hotspots identified. result suggests there systematic off-by-one instruction error in attributing cycles in case:

the perf wiki includes next warning:

"interrupt-based sampling introduces skids on modern processors. means instruction pointer stored in each sample designates place programme interrupted process pmu interrupt, not place counter overflows"

"the distance between 2 points may several dozen instructions or more if there taken branches."

so, looks should consider myself lucky annotation beingness off 1 ;).

update: intel processors have back upwards feature called pebs (precise event based sampling), makes correlating instruction pointer counter event lot less error prone see forum post.

you can access feature via perfas well, selected counters:

using perf record -e cycles:pp instead (notice :pp suffix) output annotate time is:

: 0000000000400f00 <incr(std::atomic<unsigned long>&)>: 0.00 : 400f00: mov $0x3d0900,%eax 0.00 : 400f05: nopl (%rax) 10.75 : 400f08: lock addq $0x1,(%rdi) 10.15 : 400f0d: lock addq $0x1,(%rdi) 10.00 : 400f12: lock addq $0x1,(%rdi) 9.22 : 400f17: lock addq $0x1,(%rdi) 10.21 : 400f1c: lock addq $0x1,(%rdi) 9.75 : 400f21: lock addq $0x1,(%rdi) 9.95 : 400f26: lock addq $0x1,(%rdi) 10.02 : 400f2b: lock addq $0x1,(%rdi) 10.18 : 400f30: lock addq $0x1,(%rdi) 9.75 : 400f35: lock addq $0x1,(%rdi) 0.00 : 400f3a: sub $0x1,%rax 0.00 : 400f3e: jne 400f08

which confirms hunch. solution may helpful in much trickier situations jumps.

c++ c++11 x86 perf

Comments

Popular posts from this blog

java - How to set log4j.defaultInitOverride property to false in jboss server 6 -

c - GStreamer 1.0 1.4.5 RTSP Example Server sends 503 Service unavailable -

Using ajax with sonata admin list view pagination -