XPath query help needed

I’m trying to use XPath to extract sections from HTML snippets like that shown below:

<?xml version='1.0' encoding='UTF-8'?>
<div class="large-12 columns" xmlns="http://www.w3.org/1999/xhtml">
    <h3>Overview</h3>
    <p>Siemens Totally Integrated Administrator (TIA) fails to properly set the module search path to be used by a privileged Node.js component, which can allow an unprivileged Windows user to run arbitrary code with SYSTEM privileges. The PCS neo administration console is reported to be affected as well.</p>
    <h3>Description</h3>
    <p>Siemens TIA runs a privileged Node.js component. The Node.js server fails to properly set the module search path. Because of this, Node.js will look for modules in the <code>C:\node_modules\</code> directory when the server is started. Because unprivileged Windows users can create subdirectories off of the system root, a user can create this directory and place a specially-crafted <code>.js</code> file in it to achieve arbitrary code execution with SYSTEM privileges when the server starts.</p>
    <h3>Impact</h3>
    <p>By placing a specially-crafted JS file in the <code>C:\node_modules\</code> directory, an unprivileged user may be able to execute arbitrary code with SYSTEM privileges on a Windows system with the vulnerable Siemens TIA or PCS neo administration console software installed.</p>
    <h3>Solution</h3>
    <h4>Apply an update</h4>
    <p>This issue is addressed in TIA Administrator <a href="https://support.industry.siemens.com/cs/ww/en/view/114358/">V1.0 SP2 Upd2</a>. PCS neo administration console users should apply the mitigations described in <a href="https://support.industry.siemens.com/cs/ww/en/view/109771524">Industrial Security in SIMATIC PCS neo</a>.</p>
    <p>For more details see Siemens Security Advisory <a href="https://cert-portal.siemens.com/productcert/pdf/ssa-428051.pdf">SSA-428051</a>.</p>
    <h3>Acknowledgements</h3>
    <p>This vulnerability was reported by Will Dormann of the CERT/CC.</p>
    <p>This document was written by Will Dormann.</p>
</div>

I want to extract all the content between h3 tags, for example the ‘Overview’ section bracketed by the h3 tags Overview and description.

The query:

//*[preceding-sibling::h3[. = 'Overview'] and following-sibling::h3[. = 'Description']]

works in some of the online testers but not with the XPath node.

What do I need to change to make it work in the KNIME XPath node?

P.S. KNIME 3.7.2

Hi acommons,

it seems that you have to use the prefix of the namespace which is set automatically in the XPath Node in the Namespace tab (defaults to “dns”). So your query would be:

//*[preceding-sibling::dns:h3[. = 'Overview'] and following-sibling::dns:h3[. = 'Description']]

Hope this helps :slight_smile:

Regards Lars

2 Likes

Hi Lars,

That does it! Many thanks, a nice simple solution :smiley:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.