I made my first approach with vectorization intrinsics with SSE, where there is basically only one data type __m128i. Switching to Neon I found the data types and function prototypes to be much more specific, e.g. uint8x16_t (a vector of 16 unsigned char), uint8x8x2_t (2 vectors with 8 unsigned char each), uint32x4_t (a vector with 4 uint32_t) etc. 
First I was enthusiastic (much easier to find the exact function operating on the desired data type), then I saw what a mess it was when wanting to treat the data in different ways. Using specific casting operators would take me forever. The problem is also addressed here. I then came up with the idea of an union encapsulated into a struct, and some casting and assignment operators.
struct uint_128bit_t { union {
        uint8x16_t uint8x16;
        uint16x8_t uint16x8;
        uint32x4_t uint32x4;
        uint8x8x2_t uint8x8x2;
        uint8_t uint8_array[16] __attribute__ ((aligned (16) ));
        uint16_t uint16_array[8] __attribute__ ((aligned (16) ));
        uint32_t uint32_array[4] __attribute__ ((aligned (16) ));
    };
    operator uint8x16_t& () {return uint8x16;}
    operator uint16x8_t& () {return uint16x8;}
    operator uint32x4_t& () {return uint32x4;}
    operator uint8x8x2_t& () {return uint8x8x2;}
    uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;}
    uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;}
};
This approach works for me: I can use a variable of type uint_128bit_t as an argument and output with different Neon intrinsics, e.g. vshlq_n_u32,  vuzp_u8, vget_low_u8 (in this case just as input). And I can extend it with more data types if I need.
Note: The arrays are to easily print the content of a variable.
Is this a correct way of proceeding?
Is there any hidden flaw?
Have I reinvented the wheel?
(Is the aligned attribute necessary?)
 
     
     
    