scraping a website with the node R

Cathi · April 28, 2015, 5:58pm

Hello,

i doent have so much experience with R, but i saw it is possible to scrape a website with a password. So my question is, have the nodes in Knime the same functionality like the standalone tool R?

Or is there a Option in Knime to scrape data from a site with a password without an API?

thanks a lot for your answers!

greetings

qqilihq · May 2, 2015, 1:02pm

Hi Cathi,

you can scrape data from websites using a combination of the Palladian and the XML nodes, using XPath expressions (you can find an example workflow on the examples server -> 099_Community -> 10_Palladian -> Palladian_01 Parse a webpage), however the Palladian HttpRetriever node does currently not support authentication. What kind of authentication mechanism does your source require?

Best,
Philipp

Cathi · May 7, 2015, 10:53pm

hey,

thanks for the answer and Sorry for my late answer.

its a login for our backend. so i need a username and a password to get the data. is this possible? or do you know another way?

Ellert_van_Koperen · May 8, 2015, 3:33pm

Odd realy, that palladian does not support https or authentication. Curl does.

It might be possible to get around your issue by using curl in an external script. Not pretty though, and palladian should simply support this.

qqilihq · May 8, 2015, 7:05pm

The Palladian nodes do support HTTPS. They currently do not support authentication, and this is simply a matter of time resources and the fact that I develop the nodes in my free time beside (paid) projects :)

However, I'm willing to add authentication support in the future, and that was the reason for my question @Cathi, to understand what functionality is required by Palladian's users:

Does the source you want to access require a form-based authentication (i.e. login via HTML page), or is it a HTTP-standard authentication via a pop-up dialog? Having a concrete example would help to streamline our efforts in further developments.

Have a nice weekend,
Philipp

qqilihq · May 22, 2015, 9:01pm

Update: This is implemented now and I just have to merge the code back. Should be available during the next week. Stay tuned :)

Cathi · May 24, 2015, 7:53am

Hey Philipp,

thats really nice! :)

Is it via a HTML Page or a pop- up? or both?

thanks a lot!

Have a nice weekend

Cathi

qqilihq · May 24, 2015, 11:22am

Hi Cathi,

in a nutshell, we added support for proper cookie handling and storage, the ability to specify arbitrary HTTP headers and to send content via the HttpRetriever via different HTTP methods (previously, the node only supported GETs). This functionality will allow you to model all kind of browser- and HTTP-based authentication flows.

The updated nodes will be available tomorrow. And I'll post an example workflow during the next week.

Best,
Philipp

Jonnyblacklabel · May 26, 2015, 12:12pm

Hey Philipp,

i just wanted to thank you and your team for your great work. You just solved some of my Problems!

Best,

Johannes

qqilihq · May 26, 2015, 2:39pm

Hi Johannes,

great, glad to hear :)

Thanks for the feedback,
Philipp

Cathi · May 26, 2015, 9:59pm

Hey Philipp,

i am agree with johannes, you do a great job!!

I tried it and it works fine.

I ve got just one probleme in one workflow. My login is working but then i get an error "Execute failed: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: 1".

its appear when my url is too long, i think. like:

"https: //xxxxxxxx.de/gp/associates/network/reports/report.html?__mk_de_DE=xxxxxxtag=&reportType=earningsReport&program=all&deviceType=all&periodType=preSelected&preSelectedPeriod=yesterday&startDay=25&startMonth=4&startYear=2015&endDay=25&endMonth=4&endYear=2015".

Maybe it is really simple, but i doent know the solution :(.

thanks a lot cathi

qqilihq · May 27, 2015, 1:11am

Hi Cathi,

glad to hear that you were able to set it up. I suspect that there might still be some bug, which I would really like to solve. Could you help me with some more detailed information? Can you describe the workflow you're running ? Does the new HttpRetriever node produce the error? What input data do you hand to the HttpRetriever exactly?

~~Could you please enable DEBUG log level (see preferences → KNIME → KNIME GUI). Then re-run the node which produces the error and post the stack trace as shown in the console tab?~~

[edit] I was able to reproduce the problem with your sample URL. Fix will be following tomorrow!

[edit2] Commited the fix. Thank you for catching that bug! Please update your KNIME plugins tomorrow and afterwards make sure that the console reads "Palladian version 0.6.0-SNAPSHOT (build 2015-05-26 22:57:29)". Then your problem should be solved :) If you encounter any further issues, please let me know!

Best,
Philipp

Jonnyblacklabel · May 27, 2015, 11:17am

Hey Philipp,

i also get an error on a https:// Url. Something with the SSL Certificate i guess.

I already tried to import the certificate into the java keystore but that didn't help.

Exception javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated

Best,

Johannes

qqilihq · May 27, 2015, 5:10pm

Hi Johannes,

I'll have to investigate that issue. I'll get back here.

Best,
Philipp

qqilihq · June 1, 2015, 7:04pm

Johannes,

I've added a configuration option to the next nightly build version of the nodes (thus available by tomorrow), which explicitly allows to accept self-signed SSL certificates. I strongly assume, that this should also fix the problem you described above. The configuration can be found in the "Advanced" section of the HttpRetriever node: "Accept self-signed certificates".

Please let me know if that solves your problem.

Philipp

PS: I will see whether I can add support to import custom certificates in the future, but this is a larger task for which I currently do not have any temporal resources.

Jonnyblacklabel · June 4, 2015, 11:13am

Hi Philipp,

thank you very much for your help. You did solve my Problem. Everything works just fine now.

Best,

Johannes

crsparks · June 29, 2015, 7:12pm

When I drag HttpRetriever onto an empty workflow and right-click "configure...", I get the following error:

"The dialog cannot be opened for the following reason: No Column in spec compatible to "StringValue"."

Does "HttpRetriever" have additional node/workflow requirements?

CRS

qqilihq · June 29, 2015, 9:41pm

The HttpRetriever requires one input table (first port) with at least one String column which contains the URL(s) to access.

Philipp

boraster · July 4, 2015, 11:17pm

Hi Philipp,

How can I use the new functions of the Retriever node to login to my Facebook account?

I suppose I need to use the POST method but I don't know how to present credentials.

A little help will be great :)

Bora

system · June 2, 2023, 9:31pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.