Suppose following list of lists with strings (not necessarily letters):
[['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
Each item in the list represents categorical data with underlying order like letters from the alphabet. Each string has a precursor and a successor (except for the first and last one) As you can see, some of the items ['a', 'b', 'd', 'f'] are nearly complete. This particular item does not contain the letter e for example. The item ['b', 'd', 'e', 'f'] does contain the letter e (presumably in correct order) but not the letter a. Together the items do contain the information about the underlying sequence of strings but none of the items alone can provide this information. I should mention that the letters are just an exammple. Otherwise, sorting would be easy.
I would like to obtain the unique sorted items based on alignment (alignment in the sense of sequence alignment of those lists. Like so:
['a', 'b', 'd', 'e', 'f']
I am sure this is a common problem which has been solved before but I have a hard time finding similar cases. This SO thread deals with a similar issue but the ground truth is known. Here I would like to find the underlying order of strings.
Unfortunately, the longest sequence is not guaranteed to start with e.g. 'a'
I had a look at difflib but I am not sure if this is the right toolbox. Any hints are appreciated.
EDIT:
I found a solution based on NetworkX
import networkx as nx
l = [['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
# get tuples (start, stop)
new_l = []
for item in l:
for index, subitem in enumerate(item):
if len(item) > 1:
if index < len(item)-1:
new_l.append((subitem, item[index+1]))
# create a graph using those tuples
uo_graph = nx.DiGraph()
for item in new_l:
uo_graph.add_edge(*item)
[item for item in nx.topological_sort(uo_graph)]
Out[10]: ['a', 'b', 'd', 'e', 'f']
Would be interesting if there are more pythonic solutions to this kind of problem. Especially, it would be interesting to know how to apply a check if there are multiple solutions.