So i am wondering how/if it is possible to use XPath functions in the XPath node that KNiME is providing ,
In order to clarify my situation and what i want lets consider the following example :
<h2 class="foo"> bar </h2>
<p> some text .. </p>
<h2 class="chapter">Chapter One</h2><p>This is a truly fascinating chapter.</p><h2 class="chapter">Chapter Two</h2><p>A worthy continuation of a fine tradition.</p>
What i want to do is to select a tag based on it's text contents , lets say in the above example i would like to get the <h2> tags that contain the word chapter.
By using XPath regular expressions some like the following query would give me the wanted results (other alternatives exist but would still need me to use built-in functions like text() ):
//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or section)
If i input the above expression in the XPath nodes i get the following error :
"WARN XPath XPath query cannot be parsed."
Is there a way to work with XPath built-in functions within the KNiME environment ?
KNIME doesn't support an XPath extensions, only plain XPath 1.0. So neither h:h2 nor re:test will work. Something like //h2[contains(text(), 'Chapter') or contains(text(), 'Section')] should do.
I have recently joined the Knime community and I have been trying to extract data from several webpages. However, Xpath node does not work with [contains(text(), 'Chapter') or contains(text(), 'Section')] kind of functions, as you stated in your last post above.
Can you please tell me if I am doing something wrong or I have missed something in the whole argument?
I have another problem with Html Parser node which I use in conjunction with the Xpath node: the web site I try to parse is in Turkish and contains characters like ş, ç, ü, ö, etc. These characters appear as a square-shaped character meaning the character is missing or broken. Is there any way to correct this situation?
By the way, the website I parse declares it has UTF-8 coding in the source page.
Usually this is a problem with the namespaces in the document and in the XPath expression. If you post the input file the node's configuration we can have a look at it.
please take a look at the attached workflow. What I am trying to accomplish is to isolate hotels located in Ankara from the source of the site. I don't know reallt know what should I do.
By the way, I am looking forward to get your suggestion on the HTML PArser Node' character coding problem. You may see in the output table Turkish characters such as ş, ç, ğ, ü, appear as a square-shaped symbol. Since all being repsresented with the same symbol, I have no chance to replace them later.
The <a> element does not contain any text in your example workflow. The test you are looking for is enclosed in other XML elements, so an XPath for text in <a> won't find any matches. You have to use an XPath such as "//dns:*[contains(text(), '...')]".
The HTML Parser is part of the Palladian community contributions, I cannot tell what the problem in their node is. You have to ask the question in their forum.