So I have a large data frame with several different categories, a simplified example is below (The true dataset has 10+ different Tissues, 15+ different unique celltypes with variable length names per tissue, and thousands of genes). The Tissue columns are formatted as factors.
GENENAME    Tissue1     Tissue2     Tissue3
Gene1       CellType_AA CellType_BB CellType_G
Gene2       CellType_AA CellType_BB       <NA>
Gene3       CellType_AA       <NA>        <NA>
Gene4       CellType_AA CellType_BB CellType_G
Gene5             <NA>        <NA>  CellType_G
Gene6             <NA>  CellType_BB CellType_H
Gene7       CellType_AC CellType_BD CellType_H
Gene8             <NA>        <NA>  CellType_H
Gene9       CellType_AC CellType_BD       <NA>
Gene10            <NA>  CellType_BB       <NA>
Gene11            <NA>  CellType_BD CellType_H
Gene12      CellType_AC       <NA>        <NA>
Gene13            <NA>  CellType_E  CellType_I
Gene14      CellType_F  CellType_E  CellType_I
Gene15      CellType_F  CellType_E        <NA>
What I am trying to do is return a subset based on CellTypes present in multiple tissues, and ignore unnecessary columns when I do so. Additionally, I want to use wildcards (in the the example below, CellType_A*, in order to pick up both CellType_AA and CellType_AB), and ignore the other columns when I only specify some of the columns. I want the function to be easily reusable for different combinations of celltypes, so added a seperate variable for each column.
To do this I set up the function below, setting the default value of each variable as "*", thinking that then it would treat any of those columns as valid if I don't specify an input.
Find_CoEnrich <- function(T1="*", T2="*", T3="*"){
  subset(dataset, 
         grepl(T1, dataset$Tissue1)
         &grepl(T2, dataset$Tissue2)
         &grepl(T3, dataset$Tissue3)
         ,select = GENENAME
  )  
}
However when I run the function on only a single column, to test it
Find_CoEnrich(T1="CellType_AA")
It will return only the following:
   GENENAME
1     Gene1
4     Gene4
instead of
1     Gene1
2     Gene2
3     Gene3
4     Gene4
Skipping any rows which contain an NA in another column. Even more mysteriously, if I try with the wildcard, it seemingly ignores the rest of the string and just returns only those rows which have values in every row, even if they don't match the rest of the string, sich as Gene14:
Find_CoEnrich(T1="CellType_A*")
   GENENAME
1     Gene1
4     Gene4
7     Gene7
14   Gene14
I am pretty sure it is the presence of the NA's in the table that is causing problems, but have spent a long time trying to correct this and am running out of patience. If anyone can help it would be much appreciated.
 
    