If you don't mind a little x86 inline assembly (GNU C syntax), you can take advantage of supercat's suggestion to use rotate-with-carry after an add to put the high 32 bits of the full 33-bit result into a register.
Of course, you usually should mind using inline-asm, because it defeats some optimizations (https://gcc.gnu.org/wiki/DontUseInlineAsm).  But here we go anyway:
// works for 64-bit long as well on x86-64, and doesn't depend on calling convention
unsigned average(unsigned x, unsigned y)
{
    unsigned result;
    asm("add   %[x], %[res]\n\t"
        "rcr   %[res]"
        : [res] "=r" (result)   // output
        : [y] "%0"(y),  // input: in the same reg as results output.  Commutative with next operand
          [x] "rme"(x)  // input: reg, mem, or immediate
        :               // no clobbers.  ("cc" is implicit on x86)
    );
    return result;
}
The % modifier to tell the compiler the args are commutative doesn't actually help make better asm in the case I tried, calling the function with y being a constant or pointer-deref (memory operand).  Probably using a matching constraint for an output operand defeats that, since you can't use it with read-write operands.
As you can see on the Godbolt compiler explorer, this compiles correctly, and so does a version where we change the operands to unsigned long, with the same inline asm.  clang3.9 makes a mess of it, though, and decides to use the "m" option for the "rme" constraint, so it stores to memory and uses a memory operand.
RCR-by-one is not too slow, but it's still 3 uops on Skylake, with 2 cycle latency.  It's great on AMD CPUs, where RCR has single-cycle latency. (Source: Agner Fog's instruction tables, see also the x86 tag wiki for x86 performance links).   It's still better than @sellibitze's version, but worse than @Sheldon's order-dependent version.  (See code on Godbolt)
But remember that inline-asm defeats optimizations like constant-propagation, so any pure-C++ version will be better in that case.