Your desired output uses different values and column name compare to your sample dataframe constructor. I use your desired output dataframe for testing.
Logic:
For each sublist of links, we need to find the row index(I mean index of the dataframe, NOT columns index) of the first overlapped sublist. We will use these row indices to slice by .loc on counts95 to get corresponding values of column index. To achieve this goal we need to do several steps:
- Compare each sublist to all sublists in
link. List comprehension is
fast and efficient for this task. We need to code a list
comprehension to create boolean 2D-mask array where each subarray
contains True values for overlapped rows and False for non-overlapped(look at the step-by-step on this
2D-mask and check with column links you will see clearer)
- We want to compare from top to the current sublist. I.e. standing
from current row, we only want to compare backward to the top.
Therefore, we need to set any forward-comparing to
False. This is
the functionality of np.tril
- Inside each subarray of this 2D-mask the position/index of
True is
the row index of the row which the current sublist got overlapped. We need to find
these positions of True. It is the functionality of np.argmax.
np.argmax returns the position/index of the first max element of the array. True is considered as 1 and False as 0. Therefore,
on any subarray having True, it correctly returns the 1st overlapped row index. However, on all False subarray, it returns 0. We will handle all False subarray later with where
- After
np.argmax, the 2D-mask is reduce to 1D-mask. Each element of
this 1D-mask is the number of row index of the overlapped sublist.
Passing it to .loc to get corresponding values of column index.
However, the result also wrongly includes row where subarray of
2D-mask contains all False. We want these rows turn to NaN. It is
the functionality of .where
Method 1:
Use list comprehension to construct the boolean 2D-mask m between each list of links and the all lists in links. We only need backward-comparing, so use np.tril to crush upper right triangle of the mask to all False which represents forward-comparing. Finally, call np.argmax to get position of first True in each row of m and chaining where to turn all False row of m to NaN
c95_list = counts95.links.tolist()
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
counts95['linkoflist'] = (counts95.loc[np.argmax(m, axis=1), 'index']
.where(m.any(1)).to_numpy())
Out[351]:
index level0 links linkoflist
0 616351 25 [1, 2, 3, 4, 5] NaN
1 616352 30 [23, 45, 2] 616351.0
2 616353 35 [1, 19, 67] 616351.0
3 6457754 100 [14, 15, 16] NaN
4 6566666 200 [1, 14] 616351.0
5 6457754 556 [14, 1] 616351.0
Method 2:
If you dataframe is big, comparing each sublist to only top part of links makes it faster. It probably 2x faster method 1 on big dataframe.
c95_list = counts95.links.tolist()
m = [[any(x in l2 for x in l1) for l2 in c95_list[:i]] for i,l1 in enumerate(c95_list)]
counts95['linkoflist'] = counts95.reindex([np.argmax(y) if any(y) else np.nan
for y in m])['index'].to_numpy()
Step by Step(method 1)
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
Out[353]:
array([[False, False, False, False, False, False],
[ True, False, False, False, False, False],
[ True, False, False, False, False, False],
[False, False, False, False, False, False],
[ True, False, True, True, False, False],
[ True, False, True, True, True, False]])
argmax returns position both first True and first False of all-False row.
In [354]: np.argmax(m, axis=1)
Out[354]: array([0, 0, 0, 0, 0, 0], dtype=int64)
Slicing using the result of argmax
counts95.loc[np.argmax(m, axis=1), 'index']
Out[355]:
0 616351
0 616351
0 616351
0 616351
0 616351
0 616351
Name: index, dtype: int64
Chain where to turn rows corresponding to all False from m to NaN
counts95.loc[np.argmax(m, axis=1), 'index'].where(m.any(1))
Out[356]:
0 NaN
0 616351.0
0 616351.0
0 NaN
0 616351.0
0 616351.0
Name: index, dtype: float64
Finally, the index of the output is different from the index of counts95, so just call to_numpy to get the ndarray to assign to the column linkoflist of counts95.