I'm looking for a solution to get full-size images from a website.
By using the code I recently finished through someone's help on stackoverflow, I was able to download both full-size images and down-sized images.
What I want is for all downloaded images to be full-sized.
For example, some image filenames have "-625x417.jpg" as a suffix, and some images don't have it.
https://www.bikeexif.com/1968-harley-davidson-shovelhead (has suffix) https://www.bikeexif.com/harley-panhead-walt-siegl (None suffix)
If this suffix can be removed, then it'll be a full-size image.
https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg (Scraped) https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg (Full-size image's filename if removed: -625x417)
Considering there's a possibility that different image resolutions exist as filenames, So it needed to be removed in a different size too.
I guess I may need to use regular expressions to filter out '- 3digit x 3digit' from below.
But I really don't have any idea how to do that.
If you can do that, please help me finish this. Thank you!
images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
             selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
Full Code:
import requests
import parsel
import os
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
for page in range(1, 310):
    print(f'======= Scraping data from page {page} =======')
    url = f'https://www.bikeexif.com/page/{page}'
    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)
    containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')
    for v in containers:
        old_title = v.xpath('.//div[2]/h2/a/text()').get()
        
        if old_title is not None:
            title = old_title.replace(':', ' -').replace('?', '')
        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)
        os.makedirs( os.path.join('bikeexif', title), exist_ok=True )
        response_article = requests.get(url=title_url, headers=headers)
        selector_article = parsel.Selector(response_article.text)
        # Need to get full-size images only
        # (* remove if suffix exist, such as -625x417, if different size of suffix exist, also need to remove)
        images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
                    selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
        print('len(images_url):', len(images_url))
        for img_url in images_url:
            response_image = requests.get(url=img_url, headers=headers)
            filename = img_url.split('/')[-1]
            
            with open( os.path.join('bikeexif', title, filename), 'wb') as f:
                f.write(response_image.content)
                print('Download complete!!:', filename)
 
    