I am new to data.table, and am trying to switch over.
I have 2 data.tables (variable_sites and dt_bam) and want to use variable_sites$POS (call this refPOS) to perform a function using data from dt_bam. To get the variable read_base in the summary table, I want to find a row in dt_bam where refPOS is less than pos + qwidth and extract a character from the string dt_bam$seq based on the difference between refPOS and pos
I have it working for one single value of refPOS but don't really know how to sapply a vector of refPOSs in the data.table syntax. Any help is appreciated.
Here is my code:
dt_bam<-data.table(qname=lst[[1]],rname=lst[[2]],strand=lst[[3]],pos=lst[[4]],qwidth=lst[[5]],cigar=lst[[6]],
                   seq=as.character(lst[[7]]))
refPOS<-1000140 # renamed POS so not to confuse with pos
summ_tab <-  dt_bam[refPOS < pos +qwidth & refPOS >pos,
                    .(locus_pos=refPOS,read_base = substr(seq,abs(refPOS-pos),abs(refPOS-pos)))] 
# sapply(variable_sites[,POS],) then the individual values from variable_sites[POS] become refPOS
expected output, as below but one row for every row in dt1 variable_sites[,POS]:
    refPOS read_base
1: 1000140         C
Here is some sample data:
> head(variable_sites)
    CHR     POS REF
1: chr1 1013855   G
2: chr1 1045080   G
3: chr1 1051873   C
4: chr1 1083795   C
5: chr1 1091327   C
6: chr1 1091421   T    
> head(dt_bam)
                qname rname strand     pos qwidth  cigar
1: SRR709972.27609810  chr1      + 1000135    101   101M
2: SRR709972.27609810  chr1      - 1000145    101   101M
3: SRR709972.23678227  chr1      + 1000545    101 91M10S
4: SRR709972.23678227  chr1      - 1000632    101   101M
5: SRR709972.11643848  chr1      + 1000651    101   101M
6: SRR709972.18299955  chr1      + 1000669    101   101M
                                                                                                     seq
1: GCCGCGGGGTGTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGC
2: GTGTGAACCCGGCTCCGCATTCTTTCCCACACTCGCCCCAGCCAATCGACGGCCGCGCTCCTCCCCCGCTCGCTGTCAGTCACGCCTCGGCTCCGGGCGCG
3: CGAGCCTCGGTCTCGAGCCTCTTGGCTTCCTCCGCCCTTCCCCACTCCGGTCCCGGTTTGGGCCCTGCTCTGTCTCCGAGTTTGATCCGACCCCGCCTCGC
4: CGACACCGGCTCGGCCTCCGGGGGTCCCCCCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTG
5: GGGGGTCCCACCCTCAGGTGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCGGCCGGGTCGGCAGGCG
6: TGTGCGGCCTGGAGCACGGAGGGCTGCAGAAAGCCTTGGGAGCGACAGAGCCGGGGGAAGGTTGGCTGCCGGGTCGGCAGGCGGGAGGGCGGAGTCAGCGG
> dput(head(variable_sites))
setDT(structure(list(CHR = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1"), POS = c(1013855L, 1045080L, 1051873L, 1083795L, 1091327L, 
1091421L), REF = c("G", "G", "C", "C", "C", "T")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame")))
 
    