I am developing a python application that at its core performs the following task:
Internally, have a list of items, say, A. Go through another list of items, say B, and for every item b in list B, find the corresponding item a in A. There should always be such an item - if there is no match, the program shall fail.
Naturally, it seems that this would be a perfect application example for a binary search in a hash-ordered data structure.
The problem is that the items in the lists are complicated objects. For now, assume that each entry a or b is in itself a list of approximately 10 vectors (sometimes more, sometimes less) with such shape:
vector = [ id, status, px, py, pz ]
where id and status are integer values, and px, py and pz are floating point values, such that an item might look like this:
aExample = [ [ 2, -1, 0.5, 0.7, 0.9 ],
[ -1, -1, -0.4, -0.6, -0.8 ],
[ 25, 2, 1.1, 1.3, -1.7 ],
[ 24, 2, 1.2, 1.1, 1.6 ],
[ -24, 2, 0.9, 0.8, 2.1 ],
[ 11, 1, 1.2, 1.3, 2.6 ],
[ -11, 1, 1.4, 1.2, 2.4 ],
[ 13, 1, 1.8, 1.6, 2.1 ],
[ -11, 1, 3.2, 0.1, 3.6 ] ]
The list A stores a couple of hundred thousand of such entries, the list B stores a couple of ten thousand.
To add more complication,
- A match requires the number of vectors to be the same, but not their ordering
- A match requires all
idandstatusvalues to match - A match does not require perfect agreement of the floating point values
px,py,pz, but requires approximate agreement within some permille-level tolerance in order to be unique.
Now, as you might imagine, I'm having a hard time coming up with a data structure that will be efficient and maintainable.
The only feasible way at this point seems to me that I should try to first categorize my data in the values that require hard matches, and then performing a standard search for the "best soft match" within each category. But I'm curious: Are there any hashing algorithms that support this sort of hybrid approach of requiring some parts of the data to match exactly and others only approximately in order to produce a similar hash? Or do you have any other suggestion how to tackle this problem efficently?