Using XPath functions in the XPath query field of the XPath node

gchatzip · March 23, 2014, 1:40pm

Hello ,

So i am wondering how/if it is possible to use XPath functions in the XPath node that KNiME is providing ,

In order to clarify my situation and what i want lets consider the following example :

        <h2 class="foo"> bar </h2>
        <p> some text .. </p>

        <h2 class="chapter">Chapter One</h2>
        <p>This is a truly fascinating chapter.</p>

        <h2 class="chapter">Chapter Two</h2>
        <p>A worthy continuation of a fine tradition.</p>

What i want to do is to select a tag based on it's text contents , lets say in the above example i would like to get the <h2> tags that contain the word chapter.

By using XPath regular expressions some like the following query would give me the wanted results (other alternatives exist but would still need me to use built-in functions like text() ):

//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or section)

If i input the above expression in the XPath nodes i get the following error :

"WARN XPath XPath query cannot be parsed."

Is there a way to work with XPath built-in functions within the KNiME environment ?

Many thanks ,

George

thor · March 23, 2014, 10:10pm

KNIME doesn't support an XPath extensions, only plain XPath 1.0. So neither h:h2 nor re:test will work. Something like //h2[contains(text(), 'Chapter') or contains(text(), 'Section')] should do.

gchatzip · March 26, 2014, 5:50pm

Dear thor ,

Thanks a lot for your reply i got to the point that i wanted now !

I just assumed that KNiME XPath node is not supporting XPath built-in functions at all. But im happy to see that you can do 1.0 functionallity.

Thanks again ,

George

boraster · November 7, 2014, 3:02pm

Dear Thor,

I have recently joined the Knime community and I have been trying to extract data from several webpages. However, Xpath node does not work with [contains(text(), 'Chapter') or contains(text(), 'Section')] kind of functions, as you stated in your last post above.

Can you please tell me if I am doing something wrong or I have missed something in the whole argument?

I have another problem with Html Parser node which I use in conjunction with the Xpath node: the web site I try to parse is in Turkish and contains characters like ş, ç, ü, ö, etc. These characters appear as a square-shaped character meaning the character is missing or broken. Is there any way to correct this situation?

By the way, the website I parse declares it has UTF-8 coding in the source page.

Thanks in advance.

Bora

thor · November 7, 2014, 5:26pm

Usually this is a problem with the namespaces in the document and in the XPath expression. If you post the input file the node's configuration we can have a look at it.

boraster · November 10, 2014, 8:38am

Hi Thor,

please take a look at the attached workflow. What I am trying to accomplish is to isolate hotels located in Ankara from the source of the site. I don't know reallt know what should I do.

By the way, I am looking forward to get your suggestion on the HTML PArser Node' character coding problem. You may see in the output table Turkish characters such as ş, ç, ğ, ü, appear as a square-shaped symbol. Since all being repsresented with the same symbol, I have no chance to replace them later.

Thanks.

tur.zip

thor · November 10, 2014, 12:54pm

The <a> element does not contain any text in your example workflow. The test you are looking for is enclosed in other XML elements, so an XPath for text in <a> won't find any matches. You have to use an XPath such as "//dns:*[contains(text(), '...')]".

The HTML Parser is part of the Palladian community contributions, I cannot tell what the problem in their node is. You have to ask the question in their forum.

boraster · November 10, 2014, 4:20pm

thanks a lot, thor.

jonathan.schwarz · August 20, 2015, 11:36am

Hi,

I am struggling with extracting all text from xml excluding certain nodes like <script> using xpath.

XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<html>
    <head>
        <div>
            <h1>Headtitle</h1>
            <div>
                <script>var grandpa;</script>
                <p>Head Paragraph</p>
            </div>
        </div>
        <script type="text/javascript">var newpeh;</script>
    </head>
    <body>
        <div>
            <h1>Main title</h1>
            <div>
                <p>Main paragraph</p>
                <script type="text/javascript">var grandson;</script>
                <style>teststlye</style>
                <p>Paragraph</p>
            </div>
        </div>
        <script>var child;</script>
    </body>
</html>

I tried following xpath: //body//text()[not(parent::script)]

But this does not work because //body//text() (or //body/*//text()) only shows me "Main title" which is only the text from the frist node.

Does anyone have an idea on that?

Thanks a lot in advance!

Kind regards,

Jonathan

jonathan.schwarz · August 20, 2015, 1:27pm

Got it :)

I did not recognize the option with output in a CollectionCell, and in my case I needed to take care for the namespace:

bodyText

//dns:body/*//text()[not(parent::dns:script|parent::dns:style|parent::dns:noscript)]

String(CollectionCell)