Reading out XML tags

jgawin · March 8, 2023, 10:11am

Hello all,
I am still new to the KNIME world and am trying to achieve the following. I have a large XML file with any number of XML tags. I would like to read this. So far I have read in the XML file with the XML reader and walked over it with XPath. Unfortunately, this always returns the value of the XML tag and not the tag itself. Do you have any idea what I am doing wrong here or how I can achieve this?

Here is an excerpt of the XML file as an example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<product product-id="12345">
        <tag1>value1</tag1>
        <tag2>value2</tag2>
        <tag3>value3</tag3>
        <custom-attributes>
            <custom-attributes attribute-id="tag4">Value4</custom-attributes>
            <custom-attributes attribute-id="tag5">value5</custom-attributes>

What I am trying to read out are tag1-tag5. The tags are also on two different levels. I hope that was understandable.

Thanks a lot

AlexanderFillbrunn · March 8, 2023, 10:43am

Hi @jgawin,
The following XPath will extract all values:

/product//*[starts-with(@attribute-id, 'tag') or starts-with(local-name(), 'tag')]/text()

The return type must be String Cell and you need to check “Multiple Rows” at the bottom of the XPath’s configuration, as seen in the attached screenshot.

Kind regards,
Alexander

jgawin · March 8, 2023, 10:52am

Thank you for the quick answer. Does this also work if I don’t know the tag names? The XML file is very large and I just want to read out all the tags that could be contained there.

AlexanderFillbrunn · March 8, 2023, 11:07am

Hi,
Well, you need to know something about them. If you want to get all the leaf elements in your structure, that is all elements without children, you can use the following XPath query:

/product//*[count(.//*) = 0]/text()

Kind regards,
Alexander

jgawin · March 9, 2023, 10:03am

Unfortunately, the solution doesn’t quite work for me yet. But maybe I’m still doing something wrong. I’ll try to define it more clearly again.
I have a very large XML file with any number of nodes, as in the example above. Unfortunately, I don’t know the names of the nodes. I only know the top node product. I would like to read out everything below it. If I look again at the first example of mine and I run XPath over it, I get these results:
value1
value2
value3
value4
value5

but I need the result:
tag1
tag2
tag3
tag4
tag5

Is there a way to retrieve this via some kind of wildcard? I hope that was more understandable

Thanks a lot.

AlexanderFillbrunn · March 9, 2023, 10:41am

Hi,
You need to figure out some rules to apply. The problem is that you have some nodes and some attributes that you want to extract. How do you know that is not a tag element? Do they all start with “tag”? Then you can just enter 2 queries in the XPath node:

/product//*[starts-with(@attribute-id, 'tag')]/@attribute-id
/product//*[starts-with(local-name(), 'tag')]/local-name()

The first one finds all attributes named “attribute-id” with a value that starts with “tag” and extracts their value, and the second one finds all nodes where the name starts with “tag” and extracts their node name. Once you got those two lists, you can join them back together in KNIME, e.g. using Ungroup, Column Splitter, Column Rename, Concatenate.
Kind regards,
Alexander

system · June 7, 2023, 10:42am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.