Parsing a String with HtmlParser

I am retrieving a URL with the Palladian HttpRetriever which is returning a jsonp that looks like this (truncated):

jsonp121("<div class=\"skin-box-bd\"> ...

You can see inside the jsonp is HTML that gets munged back into the parent HTML. Note that the whole thing inside the brackets is treated as a string with the quotes and ampersands having been escaped. It's this retrieved jsonp that contains all of the juicy data that I need to parse - the parent HTML is just a shell.

What I need to do is strip off the "json121" string headers, un-escape the quotes and ampersands, add back html-head-body tags, and then parse the remaining HTTP.

In other words, what I would like to do is this:

HttpRetriever (output HttpResultCell) --> JavaSnippet (output String) --> HttpParser (output XML)

Unfortunately if HttpParser is passed a String it assumes this string is the path to a local file - so this workflow won't work. If I first save the output string from the JavaSnippet as a file then HttpParser works just fine, but this is a very clunky way of doing it.

I think the solution is to:

1. Have an intermediate node that converts a String back to an HttpResultCell, or

2. Have an HttpParser that will directly parse a string

A super-simple fix might be to add a selector to the HttpParser that tells the node to treat the string as a file or as an HTTP Result. But perhaps you have a better idea?

Thanks!

Hi there,

there is already one solution (which is not quite obvious though):

Install the "KNIME File Handling Nodes" and use the "String to Binary Objects" node to convert your string input to a binary blob. This blob can then be parsed through the HtmlParser. (example is attached)

The reason, that string input is treated as file path has historical reasons (and b/c auf backwards compatibility to existing workflow I don't want to break this), but I will consider adding a more convenient way in the future.

Best,
Philipp

PS: If you're extracting data from JS-heavy sites which pull in content through AJAX/XHR, you might also want to have a look at our Selenium nodes, which are currently in beta. Feedback/bugreports appreciated!

I installed the KNIME File Handling Nodes and the Strings to Binary Objects node works great! I was worried at first because there is a lot of Chinese in the string but encoding it as UTF-16 worked without any trouble.

I also downloaded the Selenium nodes. These will definitely come in handy!

Thanks for the help!