I’ve been working on this problem for a while now, reading several of the topic on here, but I haven’t made much progress here.
I’ve got XML files that contain resource metadata the structure is pasted below for reference. Each resource is a collection of nested tags
imsmd:lom
then under that are categories:
and within those category tags are the data I’m looking for, but is in still more nested tags:
imsmd:title
imsmd:langstringValeria la vaca</imsmd:langstring>
</imsmd:title>
From this, the output I’m looking to get is in two columns
title | Valeria la vaca
Further in I have things like this:
imsmd:taxonpath
imsmd:source
imsmd:langstringword count</imsmd:langstring>
</imsmd:source>
imsmd:taxon
imsmd:entry
imsmd:langstring126</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
From which I’m looking for output like this:
word count | 126
So, I’m trying to build a key/value pair type out put, all tied to the identifier that starts out the resource tag:
Each resource will can have a different number of entries. They will all have a title entry under the “general” tag, but under the classification tag, they could have any number of entries, so I was wanting to build the key value pair setup to cover whatever might be included.
As an alternative, I’ve considered extracting only the specific data I’m looking for, however, when building the Xpath reader, it appears to tie to a specific number of levels in, which is not consistent: for example, this one looks for the 6th entry:
/dns:resource/dns:metadata/imsmd:lom/imsmd:classification/imsmd:taxonpath[6]/imsmd:taxon/imsmd:entry/imsmd:langstring
I’m kinda lost at how to parse through these kinds of files that aren’t structured consistently. Any pointers would be greatly appreciated!
Full resource example:
<?xml version="1.0" encoding="UTF-8"?>
<resource href="passages/123.html" identifier="123" type="webcontent" xmlns="http://www.imsglobal.org/xsd/imscp_v1p1" xmlns:imsmd="http://www.imsglobal.org/xsd/imsmd_v1p2" xmlns:imsqti="http://www.imsglobal.org/xsd/imsqti_v2p1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<imsmd:lom>
<imsmd:general>
<imsmd:title>
<imsmd:langstring>resource title</imsmd:langstring>
</imsmd:title>
</imsmd:general>
<imsmd:classification>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>passage type</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>Reading</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>word count</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>192</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>passage category</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>
</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>LanguageID</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>2</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>LanguageEquivalentID</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>E26426</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
<imsmd:taxonpath>
<imsmd:source>
<imsmd:langstring>LastModifiedDate</imsmd:langstring>
</imsmd:source>
<imsmd:taxon>
<imsmd:entry>
<imsmd:langstring>2018-08-24T12:41:54Z</imsmd:langstring>
</imsmd:entry>
</imsmd:taxon>
</imsmd:taxonpath>
</imsmd:classification>
</imsmd:lom>
</metadata>
<file href="passages/123.html">
</file>
<file href="style/123.css">
</file>
<file href="images/664c1e120c5a43d7afd63b64361f1a12_W.png">
</file>
</resource>