Consider the following dataset:
set.seed(50)
d = matrix(rbinom(1000, 1, 0.9), ncol = 20)
Each row corresponds to an object, and each column corresponds to a measurement of the object. For example, rows could be individuals in a study and columns could be repeated measures through time. In this case, measurements are TRUE/FALSE indicating presence or absence of an object.
I am looking for an algorithm that will allow me to identify the
maximal collection of rows that have n coincident observations.
In other words, I'm looking for a way to filter for the rows that all
have n TRUE values in the same columns. A member of the group can
have more than n TRUE values though.
The trivial example: the rows with 20 (all) TRUE values are captured by
which(apply(d, 1, all))
which identifies rows 3, 10, 12, 24, 36, 39, 48, 50. Similarly, it's
easy to identify all the unique sequences and identify groups that share
the same observations:
unique.series = d[which(!duplicated(d)),]
groups = vector("list", nrow(unique.series))
for(i in seq_along(groups))
groups[[i]] = which(apply(d, 1, function(x)
identical(x, unique.series[i,])))
But what if I want all groups with 19 or more observations? For example,
groups 3 (rows 3, 10, 12, 24, 36, 39, 48, 50) and 21
(rows 23, 32, 40) only differ by observation 9
(group 3 has an observation, but group 21 does not). How can I
programmatically identify series that partially match, i.e. contain some
subset of matching observations? This seems like a subsequence matching
problem, but it's a bit more abstract because the subsequences don't need to be continuous.
One way might be to use regular expressions, but I can't get it to work right:
unique.strings = lapply(
apply(unique.series, 1, function(x)
which(as.logical(x))),
paste,
collapse = ","
)
reg.strings = paste0("^", lapply(
apply(unique.series, 1, function(x)
sprintf("(%d)", which(as.logical(x)))),
paste, collapse = "+(,[0-9],)*"), "$")
lapply(unique.strings, grep, x = unique.strings) # NOT CORRECT
I would appreciate any alternative algorithms, regex-based or other.