This is what I could provide so far:
fun1 <- function (S) {
n <- ncol(S)
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
k <- 1L
for (j in 1:(n - 1)) {
x <- scan(text = S[, j], what = integer(), sep = "/", quiet = TRUE)
for (i in (j + 1):n) {
y <- scan(text = S[, i], what = integer(), sep = "/", quiet = TRUE)
count <- tabulate(.colSums(x == y, 2L, length(x) / 2L) + 1L)
z[k, ] <- count / sum(count)
k <- k + 1L
}
}
z
}
It looks bad as it has a double loop nest written in R, but the innermost kernel is extremely efficient by using scan, .colSums and tabulate. The total number of iterations is choose(ncol(S), 2), not too many for your 150-column matrix. I can replace fun1 by an Rcpp version if you want.
## your data
S <- structure(c("0/0", "1/1", "1/2", "0/0", "0/0", "0/1", "1/1",
"0/3", "0/0", "0/0", "0/0", "0/1", "1/1", "2/2", "0/0", "1/1",
"0/1", "2/2", "0/0", "0/0"), .Dim = c(5L, 4L), .Dimnames = list(
NULL, c("A", "B", "C", "D")))
fun1(S)
# 0 1 2
#A&B 0.2 0.2 0.6
#A&C 0.2 0.4 0.4
#A&D 0.2 0.4 0.4
#B&C 0.4 0.4 0.2
#B&D 0.2 0.4 0.4
#C&D 0.6 0.0 0.4
Performance
Ha, when I actually test my function on a 15000 x 150 matrix I found that:
- I could move
scan out of the loop nest for speedup, that is, I could scan the character matrix into an integer matrix in one go;
scan(text = blabla) takes forever, while scan(file = blabla) is fast, so it could be worth reading data from a text file;
- working with a text file is sensitive to the format of the file so writing a robust code is tricky.
I produced a version fun2 with file access, and a version fun3 using Rcpp for the loop nest. It turns out that:
- reading from file is indeed better (but we have to provide the file in "/" separated format);
- Rcpp implemenation of the loop is rewarding.
I came back and posted them here (see revision 2), and I saw user20650's starting with strsplit. I excluded strsplit from my option when I started, because I think operation with string can be slow. Yes, it is slow, but still faster than scan. So I wrote a fun4 using strsplit and a corresponding fun5 with Rcpp (see revision 3). Profiling says that 60% of the execution time is spent in strsplit so it is indeed a performance killer. Then I replaced strsplit, unlist, as.integer and matrix with a single, simpler C++ implementation. It yields a 10x boost!! Well, this is reasonable if you think about it carefully. By using atoi (or strtol) from C library <stdlib.h>, we can directly translate strings into integers, so all string operations are eliminated!
Long story short, I only provide the final, fastest version.
library(Rcpp)
cppFunction("IntegerMatrix getInt (CharacterMatrix Char) {
int m = Char.nrow(), n = Char.ncol();
IntegerMatrix Int(2 * m, n);
char *s1, *s2;
int i, *iptr = &Int(0, 0);
for (i = 0; i < m * n; i++) {
s1 = (char *)Char[i]; s2 = s1;
while(*s2 != '/') s2++; *iptr++ = atoi(s1);
s2++; *iptr++ = atoi(s2);
}
return Int;
}")
cppFunction('NumericMatrix pairwise(NumericMatrix z, IntegerMatrix Int) {
int m = Int.nrow() / 2, n = Int.ncol();
int i, j, k, *x, *y, count[3], *end; bool b1 = 0, b2 = 0;
double M = 1 / (double)m;
for (k = 0, j = 0; j < (n - 1); j++) {
end = &Int(2 * m, j);
for (i = j + 1; i < n; i++, k++) {
x = &Int(0, j); y = &Int(0, i);
count[0] = 0; count[1] = 0; count[2] = 0;
for (; x < end; x += 2, y += 2) {
b1 = (x[0] == y[0]);
b2 = (x[1] == y[1]);
count[(int)b1 + (int)b2]++;
}
z(k, 0) = (double)count[0] * M;
z(k, 1) = (double)count[1] * M;
z(k, 2) = (double)count[2] * M;
}
}
return z;
}')
fun7 <- function (S) {
## separate rows using Rcpp; `Int` is an integer matrix
n <- ncol(S)
Int <- getInt(S)
m <- nrow(Int) / 2
## initialize the resulting matrix `z`
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
## use Rcpp for pairwise summary
pairwise(z, Int)
}
Let's generate a random 15000 x 150 matrix and try it.
sim <- function (m, n) {
matrix(sample(c("0/0", "0/1", "1/0", "1/1"), m * n, TRUE), m, n,
dimnames = list(NULL, 1:n))
}
S <- sim(15000, 150)
system.time(oo <- fun7(S))
# user system elapsed
# 1.324 0.000 1.325
Oh this is lightening fast!
Is it possible to exclude the values "0/0" between two pairs to get the proportions? i.e. when A and B are compared exclude when A=B= 0/0 and get the proportions for the rest?
This kind of adaptation is straightforward at C / C++ level. Just an addition if test.
## a new C++ function `pairwise_exclude00`
cppFunction('NumericMatrix pairwise_exclude00(NumericMatrix z, IntegerMatrix Int) {
int m = Int.nrow() / 2, n = Int.ncol();
int i, j, k, *x, *y, count[3], size, *end;
bool b1 = 0, b2 = 0, exclude = 0;
double M;
for (k = 0, j = 0; j < (n - 1); j++) {
end = &Int(2 * m, j);
for (i = j + 1; i < n; i++, k++) {
x = &Int(0, j); y = &Int(0, i);
count[0] = 0; count[1] = 0; count[2] = 0; size = 0;
for (; x < end; x += 2, y += 2) {
b1 = (x[0] == y[0]);
b2 = (x[1] == y[1]);
exclude = (x[0] == 0) & (x[1] == 0) & b1 & b2;
if (!exclude) {
count[(int)b1 + (int)b2]++;
size++;
}
}
M = 1 / (double)size;
z(k, 0) = (double)count[0] * M;
z(k, 1) = (double)count[1] * M;
z(k, 2) = (double)count[2] * M;
}
}
return z;
}')
## re-define `fun7` with a new logical argument `exclude00`
fun7 <- function (S, exclude00) {
## separate rows using Rcpp; `Int` is an integer matrix
n <- ncol(S)
Int <- getInt(S)
m <- nrow(Int) / 2
## initialize the resulting matrix `z`
ref2 <- combn(colnames(S), 2)
ref1 <- paste(ref2[1, ], ref2[2, ], sep = "&")
z <- matrix(0, choose(n, 2), 3L, dimnames = list(ref1, 0:2))
## use Rcpp for pairwise summary
if (exclude00) pairwise_exclude00(z, Int)
else pairwise(z, Int)
}
Using the example S in your question:
fun7(S, TRUE)
# 0 1 2
#A&B 0.3333333 0.3333333 0.3333333
#A&C 0.3333333 0.6666667 0.0000000
#A&D 0.3333333 0.6666667 0.0000000
#B&C 0.5000000 0.5000000 0.0000000
#B&D 0.3333333 0.6666667 0.0000000
#C&D 0.7500000 0.0000000 0.2500000