PTX is an intermediary representation for compiling C/C++ GPU code into, eventually, individual micro-architecture's SASS assembly language. Thus it is not supposed to be encumbered by specific holes/gaffs/flukes/idiosyncrasies in the actual instruction sets of specific nVIDIA GPU micro-architectures.
Now, PTX has an instruction for counting the number leading zeros in a register: clz. Yet - it lacks a corresponding ctz instruction, which counts the number trailing zeros. These operations are 'symmetric' and one would certainly expect to see either both or none in an instruction set - again, especially if its abstract and not bound to what's available on a specific piece of hardware. Popular CPU architectures have had both for many years.
Strangely enough, the CUDA header device_functions.h declares the function
 * \brief Find the position of the least significant bit set to 1 in a 32 bit integer.
 *
 * [etc.]
 *
 * \return Returns a value between 0 and 32 inclusive representing the position of the first bit set.
 * - __ffs(0) returns 0.
 */
__DEVICE_FUNCTIONS_DECL__ __device_builtin__ int                    __ffs(int x);
This function:
- has almost the same semantics as count-trailing-zeros - only differing on an all-zero input.
 - does not translate into a single PTX instruction, but rather two: bitwise negation, then a 
clz. - is also missing its potential counterpart, 
__fls- find last set. 
So, why is that? Why is an apparently obvious-to-have instruction missing from PTX, and a "fake builtin" that's almost identical to it present in the headers?