A signed overflow will happen if (and only if):
- the signs of both inputs are the same, and
- the sign of the sum (when added with wrap-around) is different from the input
Using C-Operators: overflow = ~(a^b) & (a^(a+b)).
Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:
__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );
// normal result (possibly wraps around)
__m128i res = _mm_add_epi32( a, b );
// If result saturates, it has the same sign as both a and b
__m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
// saturation happened if inputs do not have different signs,
// but sign of result is different:
__m128i sign_xor = _mm_xor_si128( a, b );
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));
return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps( res ),
_mm_castsi128_ps(saturated),
_mm_castsi128_ps( overflow ) ) );
}
If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a.
Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).
And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).
Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).
Test case / demo: https://godbolt.org/z/v1bsc85nG