i doent have so much experience with R, but i saw it is possible to scrape a website with a password. So my question is, have the nodes in Knime the same functionality like the standalone tool R?
Or is there a Option in Knime to scrape data from a site with a password without an API?
you can scrape data from websites using a combination of the Palladian and the XML nodes, using XPath expressions (you can find an example workflow on the examples server -> 099_Community -> 10_Palladian -> Palladian_01 Parse a webpage), however the Palladian HttpRetriever node does currently not support authentication. What kind of authentication mechanism does your source require?
The Palladian nodes do support HTTPS. They currently do not support authentication, and this is simply a matter of time resources and the fact that I develop the nodes in my free time beside (paid) projects :)
However, I'm willing to add authentication support in the future, and that was the reason for my question @Cathi, to understand what functionality is required by Palladian's users:
Does the source you want to access require a form-based authentication (i.e. login via HTML page), or is it a HTTP-standard authentication via a pop-up dialog? Having a concrete example would help to streamline our efforts in further developments.
in a nutshell, we added support for proper cookie handling and storage, the ability to specify arbitrary HTTP headers and to send content via the HttpRetriever via different HTTP methods (previously, the node only supported GETs). This functionality will allow you to model all kind of browser- and HTTP-based authentication flows.
The updated nodes will be available tomorrow. And I'll post an example workflow during the next week.
I ve got just one probleme in one workflow. My login is working but then i get an error "Execute failed: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: 1".
its appear when my url is too long, i think. like:
glad to hear that you were able to set it up. I suspect that there might still be some bug, which I would really like to solve. Could you help me with some more detailed information? Can you describe the workflow you're running ? Does the new HttpRetriever node produce the error? What input data do you hand to the HttpRetriever exactly?
Could you please enable DEBUG log level (see preferences → KNIME → KNIME GUI). Then re-run the node which produces the error and post the stack trace as shown in the console tab?
[edit] I was able to reproduce the problem with your sample URL. Fix will be following tomorrow!
[edit2] Commited the fix. Thank you for catching that bug! Please update your KNIME plugins tomorrow and afterwards make sure that the console reads "Palladian version 0.6.0-SNAPSHOT (build 2015-05-26 22:57:29)". Then your problem should be solved :) If you encounter any further issues, please let me know!
I've added a configuration option to the next nightly build version of the nodes (thus available by tomorrow), which explicitly allows to accept self-signed SSL certificates. I strongly assume, that this should also fix the problem you described above. The configuration can be found in the "Advanced" section of the HttpRetriever node: "Accept self-signed certificates".
Please let me know if that solves your problem.
Philipp
PS: I will see whether I can add support to import custom certificates in the future, but this is a larger task for which I currently do not have any temporal resources.