HTML parser incorrectly normalizes XML tags

Dear all,

Hope you can help me with the following.

When I use the HttpRetriever to request information from an API server, I sometimes receive some sort of "empty" XML tag that represents both the opening and closing XML tag. Here is an example <prism:pageRange /> (see point 1 below).

It seems that the HTML parser notes in KNIME are "normalizing" these type of "empty" XML tags, however it seems that this is not always correctly done if I use the current HtmlParser. It somehow thinks it's now the parent of the next tag (see point 2 below). The old deprecated NekoHtlmParser seems to have no problems "normalizing" these "empty" XML tags corretly (see point 3 below).

How come the HtmlParser node is causing this problem and how can I best solve this? Should I simply use the NekoHtlmParser instead?

Many thanks in advance,

Ruben

 

1. Retrieved result via Web Browser (Chrome):

    <entry>
      <prism:url>***</prism:url>
      <dc:title>***</dc:title> 
      <prism:pageRange /> 
      <prism:doi>***</prism:doi> 
    </entry>

2. Parsed result via HtmlParser (Palladian for KNIME 1.6.100.v201607071900)

    <entry ...>
        <prismU00003Aurl>***</prismU00003Aurl>
        <dcU00003Atitle>***</dcU00003Atitle>
        <prismU00003Apagerange>
            <prismU00003Adoi>***</prismU00003Adoi>
        </prismU00003Apagerange>
    </entry>

3. Parsed result via NekoHtmlParser

    <entry ...>
        <prism:url>***</prism:url>
        <dc:title>***</dc:title>
        <prism:pagerange>
        </prism:pagerange>

        <prism:doi>***</prism:doi>
    </entry>

 

Dear Ruben,

looks like you're processing XML data? I would rather recommend using KNIME's integrated XML parser instead of the HTML parsers. The latter are made for sanitizing and parsing websites which are usually not in a valid XML structure.

Best regards,
Philipp

Hi Philipp,

Thank you for your quick response.

I am indeed processing XML data however, I first need to retrieve it from a server via an API call via an URL call. For example: http://xisbn.worldcat.org/webservices/xid/isbn/9783527406647?method=getEditions

Perhaps I simply doing it the wrong way, but the only way I know how to retreive URL API's in KNIME is via the HttpRetriever node. And the only way I'm able to process the "result" from the HttpRetriever, is by parsing it first with the HttpParser or NekoHttpParser. I've already tried to find a way around the HttpParser, but no luck so far.

I've been doing this process since I started using KNIME (early 2014) and I never had this problem before. Although this is the first time I encounter these self closing XML tags :(

Hope you can help and/or have tips.

Again, may thanks in advance!

Ruben

Hi Ruben,

I would recommend the following workflow which uses the native XML parser. You can still use the HttpRetriever node, followed by a HttpResultDataExtractor to get the result string, which you can then parse into an XML column using the "String to XML" node.

Line Plot

Hope this solves your issue!

Kind regards,
Philipp

Hi Philipp,

Thanks for your help! This solves the problem.

Kind regards,

Ruben

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.