After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be
- (30x) Choose an appropriate data representation -- matrix, rather than data.frame
- (1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
- Structure for loops as *applyfunctions (to emphasize code structure, simplify memory management, and provide type consistency)
- (2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix
for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.
I put your code into a function
f0 <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        diff <- abs(y-rowMeans(x.after))
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}
and here we have
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
##   user  system elapsed 
##  1.065   0.000   1.066 
Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.
m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
##   user  system elapsed 
##  0.036   0.000   0.037 
identical(res0, res0.m)
##[1] TRUE
That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column
f1 <- function(x) {
     y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}
The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)
f2 <- function(x) {
    totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, x)
    which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))
The timings are
> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
  expr      min       lq   median       uq      max neval
 f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924   100
 f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042   100
 f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030   100
@flodel points out that vapply can be faster (and provides type safety)
f3 <- function(x) {
    totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, numeric(1), x)
    which.max(totaldiff)
}
and that 
f4 <- function(x)
    which.max(colSums(abs((apply(x, 2, sample) - x))))
is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings
>     microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
    expr      min       lq   median       uq       max neval
   f0(m) 32.35600 32.88326 33.12274 33.25946  34.49003   100
   f1(m) 22.21964 23.41500 23.96087 24.06587  24.49663   100
 f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667   100
   f2(m) 20.76128 21.52786 22.66352 22.79101  69.49891   100
 f2.c(m) 21.16423 21.57205 22.94157 23.06497  23.35764   100
   f3(m) 20.17755 21.41369 21.99292 22.10814  22.36987   100
   f4(m) 10.10816 10.47535 10.56790 10.61938  10.83338   100
where the ".c" are compiled versions and 
Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.