I need a way to filter my data based on the target_id. Because I have a set of 1600 target_id values that have no consistent name and another set of that contain the word 'comp', I thought it might be easiest to create a new column with a value based on the value in target_id. I have a dataframe with a million rows that looks like this (just grabbed random rows to show the gist of it):
      sample_id          target_id l ength eff_length est_counts     tpm
159  SRR3884838C           CR1_Mam   2204       2005           0           0
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19    0.734541
162  SRR3884838C        DERV2a_LTR    335        136           7     11.5213
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4     32.3604
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514       134.088
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5      1.20115
I need to create a new column ("class") that has a value of 1 for sample_ids that contain the 'comp' and 0 for all others. Is this possible? The data has 40 samples (SRR3884838 --> SRR3884878) and each sample has the same set of target_ids, one set of non-uniform target names, and then another set that all contain comp. Example (with tpm column removed for formatting reasons)
 sample_id          target_id       length   eff_length      est_counts class
159  SRR3884838C           CR1_Mam   2204       2005           0           0        
160  SRR3884838C         CYRA11_MM    617        418           0           0
161  SRR3884838C          DERV2a_I   5989       5790          19           0
162  SRR3884838C        DERV2a_LTR    335        136           7           0
1094236 SRR3884878C comp78901_c0_seq3_1 1115     916       113.4           1
1094237 SRR3884878C comp85230_c0_seq1_1 1201     1002      514             1
1094238 SRR3884878C comp56944_c0_seq1_1 2484     2285      10.5            1
I tried using the merge function by first creating a new data frame that had a class column with the correct value for one set of target_ids with the probably incorrect expectation that it would create the new column in which instance where one of the target_ids is listed , but when I did that it deleted the eff_length column and messed with the format of the data. All the examples I've found where users create a new column based on another columns value used numbers and I'm not sure how to do it using the string comp. Here's what I did:
total <- merge(data frameA,data frameB,by="target_id")
were df A was my original data and df B looked like the above example with the class column.
 
     
    