I have a csv file containing around 1.4 million image links, which I want to download. I want to remove repeated links from the csv, then assign a unique filename to each one (there is an ID from the image link which I am using).
Some of the images have been downloaded and I have saved their links in a text file.
completed_file = 'downloaded_links.txt'
if os.path.isfile(completed_file):
    with open(completed_file) as f:
        downloaded = f.read().split('\n')[:-1]
else:
    downloaded = []
main_file_name = 'all_images.csv'
with open(main_file_name) as f:
    a = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)]
This is the loop where I am filtering the links
from random import randint
import re
h = []  # list of filtered dicts
seen = set()  # unique names
seen_links = set()  #unique links
for i in a:
    if i['image_url'] in downloaded:
        continue
    if i['image_url'] in seen_links:
        continue
    seen_links.add(i['image_url'])
    my_name = re.search(r'img=(.*?)&', i['image_url'], re.I).groups()[0]
    while my_name in seen:
        temp = my_name.split('.jpg')
        my_name = temp[0] + str(randint(1, 9)) + '.jpg'
    seen.add(my_name)
    di = {'name': my_name, 'image_url': i['image_url']}
    h.append(di)
The loop does exactly what I want (skip already downloaded links and assign unique filenames to the new ones), but It is taking more than 3 hours to do so. What can I do to speed it up or some logic to rewrite it in a way so it runs faster?
This is how I write to downloaded_links.txt
with open(completed_file, 'w') as f:  #downloaded is the list containing downloaded links
    for i in downloaded:
        f.write(f'{i}\n')
 
    