Scraping A Webpage With JSOUP and Given An SSL Error. Is This A Site Specific Issue? (JSOUP Works On Other Websites)
I'm trying to run a scrape, I run scrapes like this all the time, but this one failed. Normally I use jsoup to connect to a webpage, and then grab what I want on the page. This one appears to be trying to do an ssl handshake or something and failing.
I found this page with a similar issue, but, I think the op is having that issue on all jsoup scrapes, where mine is specific to this one website. https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7 I have tried multiple pages on this site and all have the same issue. However, all other sites that I have tried don't have this issue at all and scrape normally.
I tried installing the latest version of java and restarting the pc, this didn't lead to the ssl connecting successfully. I also tried going onto Firefox and downloading the certification. That didn't seem to have the same pathway as described in the answer.
"more info" > "security" > "show certificate" > "details" > "export.."
I think this issue might be caused by a separate problem as the scraper works just fine on other websites. This is why I created this as a separate question as opposed to a comment on that one.
Here is what happened when I tried to download the cert. Instead of show certificate there is a view certificate, and it doesn't have the details option nor an export option. Trying to get the .cert file, no prompt
Am I doing something wrong that is prompting a handshake or is this some sort of functionality that disallows scraping on this website? I tried to scrape the pricing off of this webpage: https://www.strack.de/de/shop/?idm=1162&d=1&idmp=94&spb=MTQ7NzQ7MTI0OzEyMzY7
I used JSOUP to try to scrape this page. It gave me an error. When I googled it, it seems to be an error that people get when trying to connect to servers.
It gave me this error:
Exception in thread "main" javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:732) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:707) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:297) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:286) at scrapetestforstack.de.ScrapeTestForStackDe.main(ScrapeTestForStackDe.java:81) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 15 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 21 more C:\Users\LeonardDME\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1 BUILD FAILED (total time: 0 seconds)
Here is the code that I am trying to do.
//Phase 3 Scrape The URL for Urls
Document doc = Jsoup.connect(URL).get();
   title = doc.title();
                    
         TitleFixer = title.replaceAll(" ", "");
         title = TitleFixer.replaceAll("|", "");
         TitleFixer = title.replaceAll("|", "");
        title = TitleFixer.replaceAll(";", "");
   
  //Set file writing stuff 1
            GimmeAName = ("C:\\Users\\LeonardDME\\Documents\\NetBeansProjects\\ScrapeTestForStackDe\\Urls\\" + title + ".csv");    
    File f = new File(GimmeAName);
    FileWriter fw = new FileWriter(f);
    PrintWriter out = new PrintWriter(fw);    
            StuffToWrite = URLArray[counter];
            // fetch the document over HTTP               
                 Elements spangrabbers = doc.getElementsByClass("art_orginal_preis142790");
                   for (Element spangrab : spangrabbers)
        {
        //System.out.println("New Span: ");
        //System.out.println(spangrab);
        holder2 = spangrab.text();
        //System.out.println(holder2);
        SpanHolderArray[SpanHolderCounter] = holder2;
        SpanHolderCounter++;
        }
            // get all links in page
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                // get the value from the href attribute
                checker = link.attr("href");
                if (checker.contains("http"))
                {
                }
                else if(checker.contains("javascript"))
                {
                }
                else if(checker.contains("style"))
                {
                }
                else
                {
                    counter++;
                    if(LinkContorter == null && LinkContorter.isEmpty())
                    {
                        //do nothing
                    }
                    else
                    {
                        System.out.println(LinkContorter);
                        out.print(LinkContorter);
                        out.print(",");
                        out.print("\n");
                        //Flush the output to the file
                        out.flush();
                    }
                }
            }
            System.out.println(counter);
        //Close the Print Writer
       out.close();       
       //Close the File Writer
       fw.close();
Is it possible that a few of you could try to scrape this site, and see if you get the same result as me? I suspect that there might be some safegaurd against scraping, but, I don't want to abandon the task unless I know that to be the case for sure. I also used to scrape this same website a few months ago in February or March without an issue.
 
     
    


