1

I am using the following script to authenticate logging into LinkedIn and then using Beautiful Soup to scrape the HTML.

The login authenticates with no issue (I see my account info) but when I try to load the page I get a "fs.config({"failureRedirect})" error.

import cookielib
import os
import urllib
import urllib2
import re
import string
import sys
from bs4 import BeautifulSoup

username = "MY USERNAME"
password = "PASSWORD"

ofile = open('Text_Dump.txt', "wb")

cookie_filename = "parser.cookies.txt"

class LinkedInParser(object):

    def __init__(self, login, password):
        """ Start up... """
        self.login = login
        self.password = password

        # Simulate browser with cookies enabled
        self.cj = cookielib.MozillaCookieJar(cookie_filename)
        if os.access(cookie_filename, os.F_OK):
            self.cj.load()
        self.opener = urllib2.build_opener(
            urllib2.HTTPRedirectHandler(),
            urllib2.HTTPHandler(debuglevel=0),
            urllib2.HTTPSHandler(debuglevel=0),
            urllib2.HTTPCookieProcessor(self.cj)
        )
        self.opener.addheaders = [
            ('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
                           'Windows NT 5.2; .NET CLR 1.1.4322)'))
        ]

        # Login
        title = self.loginPage()

        sys.stderr.write("Login"+ str(self.login) + "\n")

        #title = self.loadTitle()
        ofile.write(title)

    def loadPage(self, url, data=None):
        """
        Utility function to load HTML from URLs for us with hack to continue despite 404
        """
        # We'll print the url in case of infinite loop
        # print "Loading URL: %s" % url
        try:
            if data is not None:
                response = self.opener.open(url, data)
            else:
                response = self.opener.open(url)
            return ''.join(response.readlines())
        except:
            # If URL doesn't load for ANY reason, try again...
            # Quick and dirty solution for 404 returns because of network problems
            # However, this could infinite loop if there's an actual problem
            return self.loadPage(url, data)

    def loginPage(self):
        """
        Handle login. This should populate our cookie jar.
        """
        html = self.loadPage("https://www.linkedin.com/")
        soup = BeautifulSoup(html)
        csrf = soup.find(id="csrfToken-postModuleForm")['value']

        login_data = urllib.urlencode({
            'session_key': self.login,
            'session_password': self.password,
            'loginCsrfParam': csrf,
        })

        html = self.loadPage("https://www.linkedin.com/uas/login-submit", login_data)

        return

    def loadTitle(self):
        html = self.loadPage("https://www.linkedin.com/")
        soup = BeautifulSoup(html)
        return soup.get_text().encode('utf-8').strip()

parser = LinkedInParser(username, password)
ofile.close()

The script for the login came from: Logging in to LinkedIn with python requests sessions

Any thoughts?

Community
  • 1
  • 1

1 Answers1

1

your syntax is wrong

first - the crsf is an input field not a div tag / inspect element and you will see

second - to find a tag with a specified attribute and value you need to use .find('type_of_tag' :{'tag_attribute':'value'})

third to access the value of a specific attribute's value within the specified tag you need to use bracket syntax or .get()

here is your code that you have to replace

html = self.loadPage("https://www.linkedin.com/")
soup = BeautifulSoup(html)
csrf = soup.find('input', {"name" : "csrfToken"})
csrf_token = csrf['value']
print csrf_token
Dap
  • 2,309
  • 5
  • 32
  • 44
  • Thanks so much for the answer, just a little confused on what you mean in relation to the tag. I am looking to use .get_text() to scrape the whole page, how do those two additional lines pull that? –  Feb 03 '15 at 05:49
  • the tag is input tag not a div tag. the attribute your trying to target is name="csrfToken" or id="csrfToken-login". You can use either one. but you need to find the input that with the attribute id that equals csrfToken-login. copy and past teh code i gave you and replace your soup.find. print csrf_token after those lines and you can see output from terminal – Dap Feb 03 '15 at 05:54
  • Ahh I gotcha - but even replacing that line isnt working for me. Do you have a block where it is fully authenticating and grabbing for you? –  Feb 03 '15 at 06:01
  • did you print the variable? – Dap Feb 03 '15 at 06:09
  • Yup I get an ajax token printed out –  Feb 03 '15 at 06:11
  • isn't that the token you wanted? you may have to strip out the ajax part out of the string – Dap Feb 03 '15 at 06:13
  • I'm looking to authenticate the login so I can load another LinkedIn page - I don't want the token, rather I'm looking for the HTML script from the page? Thanks for all your help here Dap –  Feb 03 '15 at 06:14
  • your code for me returns TypeError: 'NoneType' object has no attribute '__getitem__' as an error on ln. the code i gave you resolves that part. not sure of your question in this case. sorry man – Dap Feb 03 '15 at 06:19
  • No worries, thanks again Dap really appreciate all the help man. –  Feb 03 '15 at 06:22
  • ok I think i understand your question. so you're trying to get the page after login? i believe you need mechanize package for that. i came across this post that i think may help you out http://stackoverflow.com/questions/20039643/how-to-scrape-a-website-that-requires-login-first-with-python – Dap Feb 03 '15 at 21:56