TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction.  No need to mess around with inline asm.
Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for.  (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)
GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms.  (Or in older versions, __uint128_t).  GCC doesn't implement this type on 32-bit platforms, though.
This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half.  (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)
MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64.  The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)
#define HAVE_FAST_mul64 1
#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }
#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux
  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include <intrin.h>
  #define mulhi64 __umulh
#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include <intrin.h>
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }
#else
# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif
For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). 
 From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):
     # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret
(or with clang -march=haswell to enable BMI2: mov rdx, rsi  /  mulx rax, rcx, rdi to put the high-half in RAX directly.  gcc is dumb and still uses an extra mov.)
For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):
test_var:
    umulh   x0, x0, x1
    ret
With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits.  But gcc amusingly uses shld (see the Godbolt link).
Unfortunately current compilers don't optimize @craigster0's nice portable version.  You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64.  i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results.  So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.
A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result.  So the __int128 / intrinsic version is 5 to 10 times cheaper in  latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.
Check it out on the Godbolt compiler explorer on the above link.
gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.