Hi all, the KNIME node Webpage retriever does fail with a null pointer exception if I want to download a specific html (https://www.bmel.de/SiteGlobals/Forms/Suche/DE/Pressemitteilungssuche/Pressemitteilungssuche_Formular.html) as XML Column. Here a third party node from MMI (Clean HTML Retriever) seems to have no issue with.
I furthermore have the problem that I often don’t receive the website with the first execution but with the second or third - my browser does not show that behavior.
With the GET node I was als not able to receive an XML with the accept option, but it shows also the many try to download issue
I added an example workflow (Get is not included).
html_example.knwf (119.2 KB)
A comment on
If you look at what is given with https://www.bmel.de/robots.txt you will see this website officially does not allow scraping the requested webpage, so it might already be a miracle you get it after a few attempts. Each domain will put their scraping rules in this robots.txt and if you scrape nicely you follow these rules.
If you still want to scrape this page, you need to mimic your workflows as a normal user (as if it uses “your browser”) as much as possible. So if the useragent you define for the worklow has to look like the one your browser is using. Your IP-address, the other part of what identifies you as requester of a webpage, will be more difficult to mimic unless you go to a provider with services for switching proxies (Search on “proxy scraping”).
Scraping means you have to try to be nice to a domain (don’t flood them with your requests), otherwise you might get banned for some time (e.g. a week). When that happens even you on your browser will not be able to view the webpage (speaking from own experience when I knew nothing about this at all).
thanks a lot for mentioning that. I was not aware that I must not download this website. The information there is ment to inform the general public and I was just doing it to work around their lazyness to update their own rss feed they provide.
I was checking the robot.txt and did not find a setting you described. I understood they just want to make sure it is not downloaded too often. But I am no expert in that^^.
However, the reason why I opened the ticket was to inform about a NULL pointer exception in the node - I would prefer a nicer error message
You are right, a null pointer exception is not informative. My Java-skills a pretty close to zero, so I’m afraid I cannot help you on that.
The combination of HTTP Retriever and HTML Parser nodes seem to work all the time without any problems.
Also if you want to “scrape” the website, the Selenium nodes would do that for you. It seems the page contains some event based content which cannot be collected by the retriever nodes as @JanDuo said. But Selenium nodes can do that for you since they are browser based.
html_example.knwf (19.2 KB)
thanks for mentioning the Selenium nodes. I did use them in the past but had issues as well. Nowadays, I want to focus on native KNIME nodes and hope they are well maintained on the long run. Furthermore, I don’t need to care about paying for liscenses.
Since, the one from MMI runs perfectly fine I guess KNIME cann achieve the same or a better error message at least
The HTTP Retriever and the HTML Parser nodes are from Palladian extension which is free. I mentioned them since you have to re-execute Clean HTML Retriever node to make it work in this particular case. The nodes from Palladian don’t have this issue.
tnx for reporting this issue together with workflow example. It has been added to existing ticket (Internal reference: AP-14916) which aims to solve same/similar behavior with Webpage Retriever.