I have code that calls ismember(A,B) some 2^20 times on various gpuArrays A and B, where A is a non-sparse matrix with several million integer entries with sorted rows and B is a non-sparse sorted vector of a few thousand distinct integer entries. If it helps, with linear indexing A(:) can be had in sorted form.
For sorted (integer) non-gpu arrays, the fastest option is builtin('_ismemberhelper',a,b), ismembc is slower, both of which are much faster than ismember (since they omit all the checks), cannot operate with gpuArrays and are still slower than ismember on gpuArrays. That is, in terms of speed:
ismember on GPU > builtin('_ismemberhelper',a,b) > ismembc() > ismember on CPU
Now, I have looked in the main ismember.m file to see what code it uses, but all I have been able to find that relates is this:
    else %(a,b, are some other class like gpuArray, syb object)
    lia = false(size(a));
    if nargout <= 1
        for i=1:numelA
            lia(i) = any(a(i)==b(:));   % ANY returns logical.
        end
    else
        for i=1:numelA
            found = a(i)==b(:); % FIND returns indices for LOCB.
            if any(found)
                lia(i) = true;
                found = find(found);
                locb(i) = found(1);
            end
        end
    end
end
(Other seemingly relevant parts of the code used functions like unique and sortrows, which do not support gpuArrays.) It sure not only does not look right for gpu accelerated code, but it also, as expected, does not even come close to the performance of ismember for gpuArrays. Thus:
(Question 1) Is the routine for the GPU-accelerated version of ismember openly accessible (like ismember.m is)?
(Question 2) More importantly, is there a function/algorithm that would be faster than the GPU-accelerated ismember for my specific case (sorted integer-valued arrays of the aforementioned sizes).
I am currently using MATLAB 2014b and GTX 460 with 1 GB of VRAM.
