0

Goal: Download CSV files from a website link directly to the file download.

I have gone through dozens of threads using different methods to download CSV files. Every method leaves me with the same broken format of a excel file that does not contain the original information but some code.

I have tried using these methods with other links from other websites and it has worked perfectly, making me think there is something different about these excel files from this specific website that causes the problem.

My current code (one of many different versions, all yielding same result):

import requests
import shutil
import datetime
import csv

req = requests.get('https://cranedata.com/publications/download/mfi-daily-data/issue/2020-09-11/.csv', stream=True)
url_content = req.content
if req.status_code == 200:
    print(req.status_code == requests.codes.ok)
    print(requests.Response.content)
    csv_file = open('MFID200911 .csv', 'wb')
    csv_file.write(url_content)
    csv_file.close()

I do not believe there is an issue as I have 200 and true as outputs for req and req.status_code == requests.codes.ok

This yields a excel file that looks like this:https://prnt.sc/ugx7bv

Instead of the one I see when manually downloading the file from the website: https://prnt.sc/ugx7u4

My end goal is to download all the CSV files in a loop as only the date changes on the link, however right now I just need to get one file to download correctly.

Edit: This is the code after implementing the loop

 web = Browser()
web.go_to('https://cranedata.com/')
web.type(username , into='username')
web.type(password , into='password')
web.click('Login' , tag='login')

sdate = date(2009, 1, 1)   # start date
edate = date(2020, 9, 15)   # end date
delta = edate - sdate       # as timedelta
dates = [datetime.datetime(2009,4,6)+datetime.timedelta(dval) for dval in range(delta.days+1)];


for dateval in dates:
    web.go_to('https://cranedata.com/publications/download/mfi-daily-data/issue/' +dateval.strftime('%Y-%m-%d') + '/csv')
Qfin
  • 39
  • 1
  • 7
  • 1
    You are downloading the HTML page for not logged access to the file. If you open this URL from an inconito window, and try to see the source code of the login page, you will see exactly the same info. – Daniel Labbe Sep 14 '20 at 09:21
  • ok so despite it giving me 200 and true as outputs when checking if I have access, all I have is access to a login promt? – Qfin Sep 14 '20 at 09:29
  • Yes, you've got the access to the "custom" login page. – Daniel Labbe Sep 14 '20 at 09:37
  • Do you have any suggestion to how I approach figuring out which method I should use to access the site through python? @Daniel Labbe – Qfin Sep 14 '20 at 11:36

1 Answers1

0

You can use twill or mechanize packages, as exemplified here to get the file directly after login.

Or you can use an automation tool, such as web bot to simulate a user navigation:

from webbot import Browser 
username = 'your_username'
password = 'your_password'
web = Browser()
web.go_to('https://cranedata.com/') 
web.type(username , into='username')
web.type(password , into='password') 
web.click('Login' , tag='login')
web.go_to('https://cranedata.com/publications/download/mfi-daily-data/issue/2020-09-11/.csv')
Dharman
  • 30,962
  • 25
  • 85
  • 135
Daniel Labbe
  • 1,979
  • 3
  • 15
  • 20
  • 1
    Thank you so much! This works. Exactly what I was looking for! I guess with this method I will have to manually copy the files from downloads to my file destination? Not that that's a problem – Qfin Sep 14 '20 at 15:30
  • So I am experiencing some problems with my chrome crashing due to overload. I am trying to download about 4000 files so I guess with this method I need to slow it down somehow. Any suggestions? @Danie Labbe I edited my original with the loop if that is helpful – Qfin Sep 14 '20 at 21:15
  • I would use the sleep method to create an interval between calls. – Daniel Labbe Sep 14 '20 at 21:48
  • from time import sleep sleep(10) # waits 10 seconds – Daniel Labbe Sep 14 '20 at 21:49
  • 1
    Thank you once again! This solved that issue as well @Daniel Labbe – Qfin Sep 15 '20 at 14:47