I have four 1D np.arrays: x1, y1, x2, y2, where x1 and y2 has the same length, also x2 and y2 has the same length, since they are corresponding x and y values for a dataset. len(x1) and len(x2) are always different. Let's assume len(x1) > len(x2) for now. These two arrays always have common values, but in a special way: the values are not exactly the same, only within a tolerance (because of numerical errors, etc.). Example with tolerance = 0.01:
x1 = np.array([0, 1.01, 1.09, 1.53, -9.001, 1.2, -52, 1.011])
x2 = np.array([1, 1.1, 1.2, 1.5, -9, 82])
I want to keep only the common values (in the tolerance manner). Use the shorter array for reference, which is x2 in this case. The first value in x2 is 1, and has a corresponding value in x1, which is 1.01. Next: 1.2 has also a corresponding value in x2, 1.2. The value 1.5 has no corresponding value, because 1.53 is out of tolerance, so filter it out, etc..
The full result should be:
x1 = np.array([1.01, 1.09, -9.001, 1.2])
x2 = np.array([1, 1.1, -9, 1.2])
To bring this one step further, based on filtering the x values this way I want to filter the y values for the same indices for both datasets, so in other words I want to find the longest common subsequence of two datasets. Note that ordering is important here because of the connection with the y values (it doesn't matter if we argsort x, and reindex x and y with that first).
What I have tried based on this answer:
def longest_common_subseq(x1, x2, y1, y2, tol=0.02):
# sort them first to keep x and y connected
idx1 = np.argsort(x1)
x1, y1 = x1[idx1], y1[idx1]
idx2 = np.argsort(x2)
x2, y2 = x2[idx2], y2[idx2]
# here I assumed that len(x2) < len(x1)
idx = (np.abs(x1[:,None] - x2) <= tol).any(axis=1)
return x1[idx], x2[idx], y1[idx], y2[idx]
the y values can be arbitrary in this case, only the shapes must match with x1 and x2. For example:
y1 = np.array([0, 1, 2, 3, 4, 5, 6, 7])
y2 = np.array([-1, 0, 3, 7, 11, -2])
Trying to run the function above raises
IndexError: boolean index did not match indexed array along dimension 0.
I understand: The index array's length is wrong because x1 and x2 have different length, and so far I couldn't do it. Is there a nice way to achieve this?
EDIT:
If multiple values are inside the tolerance, the closest should be selected.