Here's two different ways that I could potentially do shift left by >= 64 bits with SSE intrinsics. The second variation treats the (shift == 64) case specially, and avoiding one SSE instruction, but adding the cost of an if check:
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;
   r = _mm_slli_si128( a, 8 ) ; // a << 64
   r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
   return r ;
}
inline __m128i shiftLeftGte64ByBits( const __m128i & a, const unsigned shift )
{
   __m128i r ;
   r = _mm_slli_si128( a, 8 ) ; // a << 64
   if ( shift > 64 )
   {
      r = _mm_sll_epi64( r, _mm_set_epi32( 0, 0, 0, shift - 64 ) ) ;
   }
   return r ;
}
I was wondering (roughly) how the cost of this if() check compares with the cost of the shift instruction itself (perhaps relative to the time or number of cycles required for a normal ALU shift left instruction).