I have two 2D tensors, A and B. I would like to write a function find_indices(A, B) which returns a 1D tensor that contains the indices of rows in A which also appears in B. Also, the function should avoid using for loop for parallelization. For example:
import torch
A = torch.tensor([[1, 2, 3], [2, 3, 4], [3, 4, 5]]).cuda()
B = torch.tensor([[1, 2, 3], [2, 3, 6], [2, 5, 6], [3, 4, 5]]).cuda()
indices1 = find_indices(A, B)  # tensor([0, 2])
indices2 = find_indices(B, A)  # tensor([0, 3])
assert A[indices1].equal(B[indices2])
Assume that:
- All the rows in 
AandBare unique. - Rows in 
AandBare both sorted. So the same two rows appear in the same order inAandB. len(A)andlen(B)are ~200k.
I have tried this method from https://stackoverflow.com/a/60494505/17495278:
values, indices = torch.topk(((A.t() == B.unsqueeze(-1)).all(dim=1)).int(), 1, 1)
indices = indices[values!=0]
# indices = tensor([0, 2])
It gives accurate answer for small scale input. But for my use case, it takes >100 GB memory and raises CUDA out of memory error. Is there another way to achieve this with reasonable memory cost (say under 1 GB)?