c++ - perf annotated assembly seems off -
c++ - perf annotated assembly seems off -
i want measure amount of time c++ atomic fetch_add takes on different settings. wrote this:
atomic<uint64_t> x(0); (uint64_t = 0; < reps; i+=1g) { x.fetch_add(1); }
so if reps
high enough, assume able average the fetch_add
per sec happening. first, needed validate of time indeed spent within fetch_add, opposed loop overhead, example. ran perf this.
this assembly objdump:
400ed0: b8 00 b4 c4 04 mov $0x4c4b400,%eax 400ed5: 0f 1f 00 nopl (%rax) 400ed8: f0 83 05 7c 22 20 00 lock addl $0x1,0x20227c(%rip) 400edf: 01 400ee0: 83 e8 01 sub $0x1,%eax 400ee3: 75 f3 jne 400ed8 <_z10incrsharedv+0x8>
perf (for cycles event) says 100% of cycles go sub $0x1,%eax
, opposed expect, lock addl $0x1,0x20227c(%rip)
or jump. ideas why? accurate, or measurement artifact? in sec case, why perf systematically attribute latency sub
line rather addl
?
tl;dr: seek using :pp
suffix, events processor can help give more accurate annotation data.
longer version:
in trying investigate behavior described, attempted utilize next more unrolled loop. think solves question extent.
(uint64_t = 0; < reps; i+=10) { x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); x.fetch_add(1, order); }
when using perf record -e cycles
the resulting perf annotate is:
: 0000000000400f00 <incr(std::atomic<unsigned long>&)>: 0.00 : 400f00: mov $0x3d0900,%eax 0.00 : 400f05: nopl (%rax) 0.00 : 400f08: lock addq $0x1,(%rdi) 10.93 : 400f0d: lock addq $0x1,(%rdi) 9.77 : 400f12: lock addq $0x1,(%rdi) 10.22 : 400f17: lock addq $0x1,(%rdi) 8.97 : 400f1c: lock addq $0x1,(%rdi) 10.39 : 400f21: lock addq $0x1,(%rdi) 9.87 : 400f26: lock addq $0x1,(%rdi) 10.48 : 400f2b: lock addq $0x1,(%rdi) 9.70 : 400f30: lock addq $0x1,(%rdi) 10.19 : 400f35: lock addq $0x1,(%rdi) 9.49 : 400f3a: sub $0x1,%rax 0.00 : 400f3e: jne
when alter number of calls fetch add together 5, there 5 hotspots identified. result suggests there systematic off-by-one instruction error in attributing cycles in case:
the perf wiki includes next warning:
"interrupt-based sampling introduces skids on modern processors. means instruction pointer stored in each sample designates place programme interrupted process pmu interrupt, not place counter overflows"
"the distance between 2 points may several dozen instructions or more if there taken branches."
so, looks should consider myself lucky annotation beingness off 1 ;).
update: intel processors have back upwards feature called pebs (precise event based sampling), makes correlating instruction pointer counter event lot less error prone see forum post.
you can access feature via perf
as well, selected counters:
using perf record -e cycles:pp
instead (notice :pp
suffix) output annotate time is:
: 0000000000400f00 <incr(std::atomic<unsigned long>&)>: 0.00 : 400f00: mov $0x3d0900,%eax 0.00 : 400f05: nopl (%rax) 10.75 : 400f08: lock addq $0x1,(%rdi) 10.15 : 400f0d: lock addq $0x1,(%rdi) 10.00 : 400f12: lock addq $0x1,(%rdi) 9.22 : 400f17: lock addq $0x1,(%rdi) 10.21 : 400f1c: lock addq $0x1,(%rdi) 9.75 : 400f21: lock addq $0x1,(%rdi) 9.95 : 400f26: lock addq $0x1,(%rdi) 10.02 : 400f2b: lock addq $0x1,(%rdi) 10.18 : 400f30: lock addq $0x1,(%rdi) 9.75 : 400f35: lock addq $0x1,(%rdi) 0.00 : 400f3a: sub $0x1,%rax 0.00 : 400f3e: jne 400f08
which confirms hunch. solution may helpful in much trickier situations jumps.
c++ c++11 x86 perf
Comments
Post a Comment