I have two dfs and looking for an way to select (and count) rows of df1 based on rows in df2.
This is my df1:
      Chromosome  Start position  End position Reference Variant  reads  \
0       chr1       109419841     109419841         C       T      1
1       chr1       197008365     197008365         C       T      1
   variation reads  % variation                 gDNA nomencl  \
0                1          100  Chr1(GRCh37):g.109419841C>T
1                1          100  Chr1(GRCh37):g.197008365C>T
            cDNA nomencl    ...    exon transcript ID          inheritance  \
0  NM_013296.4:c.-258C>T    ...       2   NM_013296.4  Autosomal recessive
1  NM_001994.2:c.*143G>A    ...     UTR   NM_001994.2  Autosomal recessive
  test type                      Phenotype male coverage male ratio covered  \
0   Unknown  Deafness, autosomal recessief             0                  0
1   Unknown          Factor 13 deficientie             0                  0
  female coverage female ratio covered ratio M:F
0               1                    1       0.0
1               1                    1       0.0
df1 has these columns:
Chromosome                10561 non-null object
Start position            10561 non-null int64
End position              10561 non-null int64
Reference                 10415 non-null object
Variant                   10536 non-null object
reads                     10561 non-null int64
variation reads           10561 non-null int64
% variation               10561 non-null int64
gDNA nomencl              10561 non-null object
cDNA nomencl              10446 non-null object
protein nomencl           9997 non-null object
classification            10561 non-null object
status                    10561 non-null object
gene                      10560 non-null object
Sanger sequencing list    10561 non-null object
exon                      10502 non-null object
transcript ID             10460 non-null object
inheritance               8259 non-null object
test type                 10561 non-null object
Phenotype                 10380 non-null object
male coverage             10561 non-null int64
male ratio covered        10561 non-null int64
female coverage           10561 non-null int64
female ratio covered      10561 non-null int64
and this is df2:
 Chromosome  Startposition  Endposition    Bases  Meancoverage  \
0       chr1       11073785     11074022  27831.0    117.927966
1       chr1       11076901     11077064  11803.0     72.411043
   Mediancoverage  Ratiocovered>10X  Ratiocovered>20X Genename Componentnr  \
0            97.0               1.0               1.0   TARDBP           1
1            76.0               1.0               1.0   TARDBP           2
  PositionGenes          PositionGenome                       Position
0      TARDBP.1  chr1.11073785-11074022  comp.1_chr1.11073785-11074022
1      TARDBP.2  chr1.11076901-11077064  comp.2_chr1.11076901-11077064  
I want to select all rows from df1 that have in df2
- the same value for 'Chromosome'
- df1['Start position'] >= df2.Startposition
- df1['End position'] <= df2.Endposition.
If these three criteria are met in the same row of df2, I want to select the corresponding row in df1.
I already fused the three columns 'Chromosome','Startposition' and 'Endposition' in 'PositionGenome' to generate a lambda function but coundn't come up with anything.
Thus, hope you can help me ...
