There is this talk, CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior...", where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1 as an index. On a 64-bit system the array access block[i1] will then do *(block + i1). The issue is that block is a 64-bit pointer whereas i1 is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.
I would like to also show this with a simple example. So I have tried the ++i code with various signed and unsigned integers. The following is my test code:
#include <cstdint>
void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }
void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }
void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }
void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; } 
With g++ -c test.cpp and objdump -d test.o I get assembly listings like
this:
000000000000004e <_Z10test_int32v>:
  4e:   55                      push   %rbp
  4f:   48 89 e5                mov    %rsp,%rbp
  52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  5d:   90                      nop
  5e:   5d                      pop    %rbp
  5f:   c3                      retq   
To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.
The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:
- int8_t:- 4: c6 45 ff 00 movb $0x0,-0x1(%rbp) 8: 0f b6 45 ff movzbl -0x1(%rbp),%eax c: 83 c0 01 add $0x1,%eax f: 88 45 ff mov %al,-0x1(%rbp)
- uint8_t:- 19: c6 45 ff 00 movb $0x0,-0x1(%rbp) 1d: 80 45 ff 01 addb $0x1,-0x1(%rbp)
- int16_t:- 28: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 2e: 0f b7 45 fe movzwl -0x2(%rbp),%eax 32: 83 c0 01 add $0x1,%eax 35: 66 89 45 fe mov %ax,-0x2(%rbp)
- uint16_t:- 40: 66 c7 45 fe 00 00 movw $0x0,-0x2(%rbp) 46: 66 83 45 fe 01 addw $0x1,-0x2(%rbp)
- int32_t:- 52: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 59: 83 45 fc 01 addl $0x1,-0x4(%rbp)
- uint32_t:- 64: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp) 6b: 83 45 fc 01 addl $0x1,-0x4(%rbp)
- int64_t:- 76: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 7d: 00 7e: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
- uint64_t:- 8a: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp) 91: 00 92: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.
But for int8_t we load a byte (movb) into %rbp, then load and zero-pad it
to a long (movzbl) into the accumulator %eax. The addition (add) is
performed without any size specification because the overflow is not defined
anyway. The unsigned version directly uses instructions for bytes.
Either both add and addb/addw/addl/addq take the same number of
cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all
sizes or the 32-bit unit does the masking internally and therefore has a longer
latency.
I have looked for a table with latencies and found the one by
agner.org. There for
each CPU (using Sandy Bridge here) there is only one entry for ADD but I do
not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add instruction.
Does this mean that on x86 the ++i of non-native length integers is actually
faster for unsigned types because there are less instructions?
 
     
     
    