I was originally running a PCA to reduce a large number of correlated measures (>10 behaviours) down to fewer variables (in PCA I used the first and second principal components). But this is not appropriate (similar situation to this OP) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC model instead of PCA to do this (available through the PTAk package in R) - see Leibovici (2010) for details.
My data is stored as a data.frame object, where each row is for one individual, that can be sampled multiple times in a year and across their lifetimes.
Sample of my data (data available here):
individual  beh1   beh2     beh3   beh4    year
11979       0      0.0333   0      0       2014
12026       0.176  0.0882   0.441  0.0882  2014
12435       0.405  0.189    0      0.243   2014
12524       0      0        1      0       2014
12625       0      0        0      0       2014
12678       0      0        0      0       2014
To use the PTAk package, the data needs to be converted into an array. The code to do this is:
my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))
where x is the number of rows, y is the number of columns, and z is the number of arrays.
My general question:
Which components of my
data.frameshould correspond to which measures in thearray?
My initial guess would be that x should correspond to the number of individuals sampled (i.e., the number of rows in the original data.frame), but I am not sure what the y and z components should be.
Like this:
my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))
where x is 5393 individuals, y is the number of variables (e.g., 4 behaviours), and z is the number of years (9 years).
This generates 9 arrays with each individual’s record as the rows, and each variable as a column (identifier, 4 behaviours, and the year of sampling).  In theory each array would correspond to a certain year of sampling, but that is currently not the case.
My question in detail:
If this is the correct formatting for my
array, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are inarray1, only 2009 inarray2, etc.)?
Alternatively, if my formatting is wrong, what is the correct array format for my data and question?
For example, should I group the data into arrays according to the behaviour (beh1, beh2, etc.), so the code looks like:
my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))
where there would be three columns per array corresponding to the identifier, value for the behaviour, and year of observation? If this is the proper formatting, how would I ensure that the arrays are divided based on the behaviours rather than the identifier and/or year columns?