That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.
You're looking for SSE4.1 blendvps (intrinsic _mm_blendv_ps). The other element sizes are _mm_blendv_pd and _mm_blendv_epi8. These use the high bit of the corresponding element as the control, so you can use a float directly (without _mm_cmp_ps) if its sign bit is interesting.
__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y)); // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask); // copy element from 2nd op where the mask is set
Note that I reversed a, b to b, a because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can write blend(a,b, mask) instead of having them backwards. Or give them meaningful names line ones and twos.
In other cases where your control operand is a constant, there's also _mm_blend_ps / pd / _mm_blend_epi16 (an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)
Performance
blendps xmm, xmm, imm8 is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap as andps. (https://uops.info/). pblendw is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2 vpblendd blends with dword granularity, an integer version of vblendps, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)
But variable blendvps is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.
The vblendvps version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA, adc, and cmov. (And AVX-512 vpternlogd which can do a bitwise blend as a single uop)
AMD has fully efficient handling of vblendvps, single uop (except for YMM on Zen1) with 2/clock throughput.
Without SSE4.1, you can emulate with ANDN/AND/OR
(x&~mask) | (y&mask) is equivalent to _mm_blendv_ps(x,y,mask), except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with _mm_srai_epi32(mask, 31).)
Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like -march=haswell or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)
For constant / loop-invariant a^b without SSE4.1
x ^ ((x ^ y) & mask saves one operation if you can reuse x ^ y. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.
Without AVX non-destructive 3-operand instructions, this way would need a movaps xmm,xmm register-copy to save b, but it can choose to destroy the mask instead of a. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use with y&mask, and destroy the mask with ANDN (~mask & x).
With AVX, vblendvps is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR if a^b can be pre-computed.
Blending with 0: just AND[N]
(Applies to integer and FP; the bit-pattern for 0.0f and 0.0 is all-zeros, same as integer 0.)
You don't need to copy a zero from anywhere, just x & mask, or x & ~mask.
(The (x & ~mask) | (y & mask) expression reduces to this for x=0 or y=0; that term becomes zero, and z|=0 is a no-op.)
For example, to implement x = mask ? x+y : x, which would put the latency of an add and blend on the critical path, you simplify to x += select y or zero according to mask, i.e. to x += y & mask; Or to do the opposite, x += ~mask & y using _mm_andn_ps(mask, vy).
This has an ADD and an AND operation (so already cheaper than blend on some CPUs, and you don't need a 0.0 source operand in another register). Also, the dependency chain through x now only includes the += operation, if you were doing this in a loop with loop-carried x but independent y & mask. e.g. summing only matching elements of an array, sum += A[i]>=thresh ? A[i] : 0.0f;
For an example of an extra slowdown due to lengthening the critical path unnecessarily, see gcc optimization flag -O3 makes code slower than -O2 where GCC's scalar asm using cmov has that flaw, doing cmov as part of the loop-carried dependency chain instead of to prepare a 0 or arr[i] input for it.
Clamping to a MIN or MAX
If you want something like a < upper ? a : upper, you can do that clamping in one instruction with _mm_min_ps instead of cmpps / blendvps. (Similarly _mm_max_ps, and _mm_min_pd / _mm_max_pd.)
See What is the instruction that gives branchless FP min and max on x86? for details on their exact semantics, including a longstanding (but recently fixed) GCC bug where the FP intrinsics didn't provide the expected strict-FP semantics of which operand would be the one to keep if one was NaN.
Or for integer, SSE2 is highly non-orthogonal (signed min/max for int16_t, unsigned min/max for uint8_t). Similar for saturating pack instructions. SSE4.1 fills in the missing operand-size and signedness combinations.
- Signed: SSE2
_mm_max_epi16 (and corresponding mins for all of these)
- SSE4.1
_mm_max_epi32 / _mm_max_epi8; AVX-512 _mm_max_epi64
- Unsigned: SSE2
_mm_max_epu8
- SSE4.1
_mm_max_epu16 / _mm_max_epu32; AVX-512 _mm_max_epu64
AVX-512 makes masking/blending a first-class operation
AVX-512 compares into a mask register, k0..k7 (intrinsic types __mmask16 and so on). Merge-masking or zero-masking can be part of most ALU instructions. There is also a dedicated blend instruction that blends according to a mask.
I won't go into the details here, suffice it to say if you have a lot of conditional stuff to do, AVX-512 is great (even if you only use 256-bit vectors to avoid the turbo clock speed penalties and so on.) And you'll want to read up on the details for AVX-512 specifically.