Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
- if
targetdoesn't exist in the list then we ignore that list completely - if the list length is less than
Mthen we ignore the list completely - if the list is exactly length
Mbuttargetis not in theMthposition then we ignore it (but we count it iftargetis in theMthposition) - if the list length,
L, is longer thanMandtargetis in thei=Mposition(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
This is super simplified but, in reality, my actual list-of-lists
- consists of a few billion lists of positive integers (between 1 and 10,000)
- each list can be as short as 1 element or as long as 500 elements
NandMcan be as small as 1 or as big as 100
My questions are:
- Is there an efficient data structure that would allow for fast queries assuming that
NandMwill always be less than 100? - Are there efficient algorithms or relevant area of research for performing this kind of analysis for various combinations of
NandM?