(Work in progress, power just went out here; posting now to save my work.)
Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.)  For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.
(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts.  But in C++, that's something for the compiler to figure out.  Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)
- 8086 to 286 - no instruction available. 
- 386: - bsf/- bsr: 10+3n cycles.  Worst-case:- 10+3*31 = 103c
 
- 486: - bsf(16 or 32-bit registers): 6-42 cycles;- bsr7-104 cycles (1 cycle less for 16-bit regs).
 
- P5 Pentium: - bsf: 6-42 cycles (6-34 for 16-bit);- bsr7-71 cycles.  (or 7-39 for 16-bit).  Non-pairable.
 
- Intel P6 and later: - bsr/- bsr: 1 uop with 1 cycle throughput, 3 cycle latency.  (PPro / PII and later).
 
- AMD K7/K8/K10/Bulldozer/Zen: - bsf/- bsrare slowish for a modern CPU.  e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
 
- Intel Haswell / AMD K10 : - lzcntintroduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before- tzcntand the rest of BMI1).
 For an input of 0, they return the operand-size, so they fully implement C++20- std::countl_zero/- countr_zerorespectively, unlike- bsr/- bsf.  (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents.  Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)
 - On AMD, they're fast, single uop for - lzcnt, with- tzcnttaking one more (probably a bit-reverse to feed the- lzcntexecution unit), so a nice win vs.- bsf/- bsr.  This is why compilers typically use- rep bsfwhen for- countr_zero/- __builtin_ctz, so it will run as- tzcnton CPUs that support it, but as- bsfon older CPUs.  They produce the same results for non-zero inputs, unlike- bsr/- lzcnt.
 - On Intel, same fast performance as - bsf/- bsr, even including the output dependency until Skylake fixed that; it's a true dependency for- bsf/- bsr, but false dependency for- tzcnt/- lzcntand- popcnt.
 
Fast algorithm with a bit-scan building block
But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine.  (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)
Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium.  (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)
bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD.  On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer).  Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.
I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask.  ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour.  (For example, GNU C documents it.)
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.
Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.
int set_leading_zeros(int x){
    int lz = __builtin_clz(x|1);                // clamp the lzcount to 31 at most
    int tmp = (x<<lz);                          // shift out leading zeros, leaving a 1 (or 0 if x==0)
    tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1);  // set the MSB in case x==0
    return tmp>>lz;                             // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
    int32_t lz = _lzcnt_u32(x);            // returns 0 to 32 
    uint32_t mask = _bzhi_u32(-1, lz);     // handles all 33 possible values, producing 0 for lz=32
    return x | ~mask;
}
On x86-64 you can
Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend.  Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.
Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg