I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38        NC_008066.1     9204    9987    48395.347656
MSTRG.36        NC_008066.1     7582    8265    47979.933594
MSTRG.33        NC_008066.1     5899    7437    43807.781250
MSTRG.49        NC_008066.1     14732   15872   26669.763672
MSTRG.38        NC_008066.1     8363    9203    19514.273438
MSTRG.34        NC_008066.1     7439    7510    16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1      697     778     SRX6359746.5505370/2    8       +
JQ673480.1      744     824     SRX6359746.5505370/1    8       -
JQ673480.1      1712    1791    SRX6359746.2565519/2    27      +
JQ673480.1      3445    3525    SRX6359746.7028440/2    23      -
JQ673480.1      4815    4873    SRX6359746.6742605/2    37      +
JQ673480.1      5055    5092    SRX6359746.5420114/2    40      -
JQ673480.1      5108    5187    SRX6359746.2349349/2    24      -
JQ673480.1      7139    7219    SRX6359746.3831446/2    22      +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
    gene=row[0],row[1],row[2],row[3],row[4]
    expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
    read=row[0],row[1],row[2]
    MCNgenes.append(read)
#find read counts
for r in expGenes:
    #include FPKM in the dictionary
    d[r[0]]=[r[4]]
    regionCount=0
    #set start and stop points based on transcript file
    chr=r[1]
    start=int(r[2])
    stop=int(r[3])
    #print("start:",start,"stop:",stop)
    for row in MCNgenes:
        if start < int(row[1]) < stop:
            regionCount+=1
    d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
 
     
    