Nested XML parsing with XPath?

serendipitytech · April 3, 2019, 8:14pm

I’ve been working on this problem for a while now, reading several of the topic on here, but I haven’t made much progress here.

I’ve got XML files that contain resource metadata the structure is pasted below for reference. Each resource is a collection of nested tags
imsmd:lom
then under that are categories:

and within those category tags are the data I’m looking for, but is in still more nested tags:
imsmd:title
imsmd:langstringValeria la vaca</imsmd:langstring>
</imsmd:title>
From this, the output I’m looking to get is in two columns
title | Valeria la vaca

Further in I have things like this:
imsmd:taxonpath
imsmd:source
imsmd:langstringword count</imsmd:langstring>
</imsmd:source>
imsmd:taxon
imsmd:entry
imsmd:langstring126</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>

From which I’m looking for output like this:
word count | 126

So, I’m trying to build a key/value pair type out put, all tied to the identifier that starts out the resource tag:

Each resource will can have a different number of entries. They will all have a title entry under the “general” tag, but under the classification tag, they could have any number of entries, so I was wanting to build the key value pair setup to cover whatever might be included.

As an alternative, I’ve considered extracting only the specific data I’m looking for, however, when building the Xpath reader, it appears to tie to a specific number of levels in, which is not consistent: for example, this one looks for the 6th entry:
/dns:resource/dns:metadata/imsmd:lom/imsmd:classification/imsmd:taxonpath[6]/imsmd:taxon/imsmd:entry/imsmd:langstring

I’m kinda lost at how to parse through these kinds of files that aren’t structured consistently. Any pointers would be greatly appreciated!

Full resource example:

<?xml version="1.0" encoding="UTF-8"?>
<resource href="passages/123.html" identifier="123" type="webcontent" xmlns="http://www.imsglobal.org/xsd/imscp_v1p1" xmlns:imsmd="http://www.imsglobal.org/xsd/imsmd_v1p2" xmlns:imsqti="http://www.imsglobal.org/xsd/imsqti_v2p1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <metadata>
        <imsmd:lom>
            <imsmd:general>
                <imsmd:title>
                    <imsmd:langstring>resource title</imsmd:langstring>
                </imsmd:title>
            </imsmd:general>
            <imsmd:classification>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>passage type</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>Reading</imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>word count</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>192</imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>passage category</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>
                            </imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>LanguageID</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>2</imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>LanguageEquivalentID</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>E26426</imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
                <imsmd:taxonpath>
                    <imsmd:source>
                        <imsmd:langstring>LastModifiedDate</imsmd:langstring>
                    </imsmd:source>
                    <imsmd:taxon>
                        <imsmd:entry>
                            <imsmd:langstring>2018-08-24T12:41:54Z</imsmd:langstring>
                        </imsmd:entry>
                    </imsmd:taxon>
                </imsmd:taxonpath>
            </imsmd:classification>
        </imsmd:lom>
    </metadata>
    <file href="passages/123.html">
    </file>
    <file href="style/123.css">
    </file>
    <file href="images/664c1e120c5a43d7afd63b64361f1a12_W.png">
    </file>
</resource>

Martyna · April 8, 2019, 7:51am

Hi,

I played a bit around using your example. I think it’s possible using two xpath nodes + further table manipulation.
Is this something you want to get out of your xml file?

serendipitytech · April 8, 2019, 12:49pm

That’s awesome! That’s the exact format I’m looking for, with a couple little adjustments -
The resource title would be another key value pair of “header” “entry”
And in the column where “title” is, would be the identifier value from the initial resource tag.

Martyna · April 8, 2019, 1:07pm

Good to hear!
Is the identifier in your example then “123”?

Martyna · April 8, 2019, 1:44pm

Like that?

serendipitytech · April 8, 2019, 3:01pm

Yes, exactly! I have been trying every permutation of the XPath tool to get there, but I’m just not getting it right.

Martyna · April 8, 2019, 3:08pm

Actually its not only the Xpath node. Some additional manipulation was necessary to get the title to header+entry and I am also not sure if its the easiest way to do that but at least its a workaround.

For the solution please check the workflow below.
example_workflow.knar (47.1 KB)

serendipitytech · April 9, 2019, 9:04pm

That is so awesome! I keep making the same mistake of trying to do everything with one tool instance, seeing it broken down definitely makes more sense.

Next I’ll see if I can get this to loop through a directory of XML files I’ve had some success with getting that to work on some projects, but seems slightly different each time I try to do that.

Thank you so much for taking the time to work this out, I so greatly appreciate it!

system · April 16, 2019, 9:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.