I have a data frame with sequences of peptides in the row "ID". I have the sequences grouped into many groups with around 2-10 rows per group. The groups contain some peptides that align almost perfectly (up to 4 differences in characters) and others that are completely different. I want to subset my data frame and create a new one with only the "unique" values: meaning, only the values from each group that are different from one another. If there are few aligning sequences, I want the one remaining to be the longest one (I have created a column for "character_number"). I thought about using elseif and the function adist() with a cutoff of <6 (less than 6 differences - only the max(charecter_count) will be taken), but I have no idea how to start. any ideas will be appreciated!
| index | id | Description | charecter_count | 
|---|---|---|---|
| 3 | AAGKGPLATGGIAA | vlad12 | 14 | 
| 4 | AAGKGPLATGGIAASGKK | vlad12 | 18 | 
| 5 | AAKAQYRAAALLGAAVPG | bla872 | 18 | 
| 6 | AAKPKVAKAKKVVVKKK | plm123 | 17 | 
| 7 | AAPAPAAAPAPAPAAAPEP | bbaala | 19 | 
| 8 | AAPAPAAAPAPAPAAAPEPE | bbaala | 20 | 
| 9 | AAPAPAAAPAAAPAPAPEPER | bbaala | 21 | 
| 443 | ILVRYTQPAPQVSTPT | cvacba | 16 | 
| 444 | ILVRYTQPAPQVSTPTL | cvacba | 17 | 
| 736 | NPSLPPPERPAAEAMC | cvacba | 16 | 
here for example, I would want a new data frame with rows: 4 (3 is basically the same but shorter), 5,6,9,444,736 (here they both have the same description but different sequences)
using:
adist(all_peptides$id[3],all_peptides$id[4]>  I get 4, which is below by desired cutoff so I would like it to select only 4.
however,  adist(all_peptides$id[444],all_peptides$id[736])> is 16, so I would like both to b included in the new data frame. however, I don't know how to implement this on a larger scale (compare all sequences from the same group etc).
 
    