I currently working on CS50 problem set https://cs50.harvard.edu/x/2021/psets/6/dna/
The problem simply tell us to find some DNA sequence that repeated consecutively in a txt file and match the total length with the person in csv file.
This is the code i currently work (not complete yet):
import re, csv, sys
def main(argv):
    # Open csv file
    csv_file = open(sys.argv[1], 'r')
    str_person = csv.reader(csv_file)
    
    nucleotide = next(str_person)[1:]
    
    # Open dna sequences file
    txt_file = open(sys.argv[2], 'r')
    dna_file = txt_file.read()
    
    str_repeat = {}
    str_list = find_STRrepeats(str_repeat, nucleotide, dna_file)
        
def find_STRrepeats(str_list, nucleotide, dna):
    for STR in nucleotide:
        groups = re.findall(rf'(?:{STR})+', dna)
        if len(groups) == 0:
            str_list[STR] = 0
        else:
            str_list[STR] = groups
    print(str_list)
if __name__ == "__main__":
   main(sys.argv[1:])
Output (from the print(str_list)):
{'AGATC': ['AGATCAGATCAGATCAGATC'], 'AATG': ['AATG'], 'TATC': ['TATCTATCTATCTATCTATC']}
But as you can see, the value in the dictionary also store consecutively. If i want to use len function in str_list[STR] = len(groups) it will result 1 for each key in dictionary. Because i want to find how many time (total length) that DNA repeated, and store it as value in my dict.
So, I want it to store separately. Kind of like this:
{'AGATC': ['AGATC', 'AGATC', 'AGATC', 'AGATC'], 'AATG': ['AATG'], 'TATC': ['TATC', 'TATC', 'TATC', 'TATC', 'TATC']}
What should i add to my code so they can separate with a coma like that? or maybe there's some condition i can add to my ReGex code groups = re.findall(rf'(?:{STR})+', dna) ?
I don't wanna change the whole of ReGex code. Because i think is already useful to found largest length of string that repeat consecutively. And i proud to myself can get it without help because i'm beginner with python. Please. Thank you.
 
    