I am working on finding a pretty and pythonic way to find open reading frames in a DNA sequence. I have found many implementations online that make use of indexing, flags and other such ugliness.
I am pretty sure that a regular expression implementation can be created, but I am bad with regex. The general idea is that I want to split a string of DNA sequence by 'ATG', 'TAG', 'TGA' and 'TAA'. But I do not want to split on overlapping regions, for example the sequence 'ATGA' should be split into 'ATG','A'. Basically go from left to right in one of each of the three frames.
edit for clarity: As said in the comments, a sequence such as ATGATTTTGA should be split into ATG, TTT, TGA despite the presence of TGA (which is in the non-zero frame)
edit2: this is how I have implemented it without regular expressions using the list comprehension splitting linked. I hate the use of flags though.
def find_orf(seq):
    length = 0
    stop = ['TAA','TGA','TAG']
    for frame in range(3):
        orfFlag, thisLen = None, 0
        splitSeq = [seq[start+frame:start+frame+3] for start in range(0,len(seq),3)]
        for codon in splitSeq:
            if codon == 'ATG':
                orfFlag = True
                thisLen += 1
            elif orfFlag and codon in stop:
                orfFlag = None
                if thisLen > length:
                    length = thisLen
            else:
                thisLen += 1
    return length
 
     
     
     
    