I have inconsistent result with XML processing

Thiemo.Kellner · April 18, 2024, 2:39pm

Hi

I am trying to process an XML file, but some XPath processing does not work as expected. Sometimes, the result of the very same process of one XPath node is good in places, in other it is not.

Good result:

Bad result:

Config:

XML snippet:

I hubbed a testcase.

I am not sure, whether I am missing something or whether there is a bug (would hugely surprise me).

Kind regards

Thiemo

takbb · April 26, 2024, 6:47pm

Hi @Thiemo.Kellner ,

I’m not sure if it is a bug, but I cannot see that the behaviour is desirable, and I cannot see a way to avoid what is happening using just a single XPath node.

What it looks like is that all of the various attributes are returned , and then they effectively get shifted to the top of the table rather than returning “blanks”.

If, for example, I restrict the XPath to just acting on row 4 of your input data, it returns this:

What is evident to me is that all non-null values have been shifted upwards regardless of which row they should be associated with!

The only way I see at the moment of returning attributes associated with their “Columns” (and I mean the “Column” elements in your xml data) is to first using XPath to return a collection of column and ungroup these. Then do your further XPath on these returned elements:

Note in the following, this is your original xpath but now at the /column/ level rather than /table/column/ level, as it is processing the “column” xml generated by the previous xpath node:

My thoughts - it feels like a bug as I can’t see a config option that fixes it, and it isn’t a correct or useful result!

Have you tried using XML to Json, then JSON Path selecting each of the required data elements as a Collection query like so:

And then using Ungroup

If there is a bug in the XPath node, you may find that the JSON Path is better supported.

mwiegand · April 26, 2024, 8:45pm

Hi @Thiemo.Kellner and @takbb,

ahh, XML, that brings back some nice memories. What you experienced Thiemo caused me quite some pain until I understood what needs to be done.

Takbb already pointed towards the solution:

I cannot see a way to avoid what is happening using just a single XPath node.

You need to follow the “Divide et impera” principle or in other words, try not to do everything t once. Use one XPath node to extract the first axis / elements as nodes, then another XPath to get their data or again XML-data.

That way you can also parallelize the extraction and ensure data coherency. Another trick would be to extract the data as a collection, always ensuring return missing cell on empty string is checked.

To add a bit more context. If an XPath does not match, it does not give back a result. Hence, causing data disjoints. Here is an example of the issue you experienced:

Here is the example workflow:

Cheers
Mike