html parsing as xml fails

laval · August 29, 2020, 9:51am

Hi all, the KNIME node Webpage retriever does fail with a null pointer exception if I want to download a specific html (https://www.bmel.de/SiteGlobals/Forms/Suche/DE/Pressemitteilungssuche/Pressemitteilungssuche_Formular.html) as XML Column. Here a third party node from MMI (Clean HTML Retriever) seems to have no issue with.

I furthermore have the problem that I often don’t receive the website with the first execution but with the second or third - my browser does not show that behavior.

With the GET node I was als not able to receive an XML with the accept option, but it shows also the many try to download issue

I added an example workflow (Get is not included).

Best,
Lars

html_example.knwf (119.2 KB)

JanDuo · August 29, 2020, 1:33pm

Hi @laval,

A comment on

If you look at what is given with https://www.bmel.de/robots.txt you will see this website officially does not allow scraping the requested webpage, so it might already be a miracle you get it after a few attempts. Each domain will put their scraping rules in this robots.txt and if you scrape nicely you follow these rules.

If you still want to scrape this page, you need to mimic your workflows as a normal user (as if it uses “your browser”) as much as possible. So if the useragent you define for the worklow has to look like the one your browser is using. Your IP-address, the other part of what identifies you as requester of a webpage, will be more difficult to mimic unless you go to a provider with services for switching proxies (Search on “proxy scraping”).

Scraping means you have to try to be nice to a domain (don’t flood them with your requests), otherwise you might get banned for some time (e.g. a week). When that happens even you on your browser will not be able to view the webpage (speaking from own experience when I knew nothing about this at all).

laval · September 3, 2020, 2:27pm

Hi @JanDuo,
thanks a lot for mentioning that. I was not aware that I must not download this website. The information there is ment to inform the general public and I was just doing it to work around their lazyness to update their own rss feed they provide.
I was checking the robot.txt and did not find a setting you described. I understood they just want to make sure it is not downloaded too often. But I am no expert in that^^.

However, the reason why I opened the ticket was to inform about a NULL pointer exception in the node - I would prefer a nicer error message

Best,
Lars

JanDuo · September 3, 2020, 2:59pm

Hi @laval

You are right, a null pointer exception is not informative. My Java-skills a pretty close to zero, so I’m afraid I cannot help you on that.

armingrudd · September 5, 2020, 3:34am

Hi @laval,

The combination of HTTP Retriever and HTML Parser nodes seem to work all the time without any problems.

Also if you want to “scrape” the website, the Selenium nodes would do that for you. It seems the page contains some event based content which cannot be collected by the retriever nodes as @JanDuo said. But Selenium nodes can do that for you since they are browser based.

html_example.knwf (19.2 KB)

laval · September 5, 2020, 11:42am

Hi @armingrudd,

thanks for mentioning the Selenium nodes. I did use them in the past but had issues as well. Nowadays, I want to focus on native KNIME nodes and hope they are well maintained on the long run. Furthermore, I don’t need to care about paying for liscenses.

Since, the one from MMI runs perfectly fine I guess KNIME cann achieve the same or a better error message at least

Best,
Lars

armingrudd · September 5, 2020, 4:10pm

The HTTP Retriever and the HTML Parser nodes are from Palladian extension which is free. I mentioned them since you have to re-execute Clean HTML Retriever node to make it work in this particular case. The nodes from Palladian don’t have this issue.

ipazin · September 7, 2020, 9:37am

Hello @laval,

tnx for reporting this issue together with workflow example. It has been added to existing ticket (Internal reference: AP-14916) which aims to solve same/similar behavior with Webpage Retriever.

Br,
Ivan

laval · September 7, 2020, 12:40pm

@ipazin

Great! thank you!

Best wishes,
Lars