As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.