I'm new to this forum and KNIME so this will be a noob question. I've searched all the topics regarding xpath but haven't found an answer that would help me solve my problem. I'm probably missing a tiny detail.
I'm trying to parse a website and extract all the links on it - much like in the KNIME forum example, only much simpler. My workflow consists of:
table creator node - htmlparser node - xpath node
My xpath query is very simple: //a/@href . I've tested it outside of KNIME and it returns the correct results, however in KNIME I get a question mark in the output. My output file of the xpath node consists of the url I want to parse (from the table), the XML document (from parser node) and XML Result - where I get a question mark.
Can anyone shed some light on where I went wrong? I would really appreciate it.
Best regards, Ziga
Try Retriever node before the parser node.
this can also happen if you didn't add the correct namespace(s) in the configuration. You can either add one or more there (one of the necessary ones for websites is even in the node description) or use something like [local-name()="a"], which is a more flexible approach, but can get awkward quickly.
Also, there shouldn't be a problem with your particular example, but here's one vital information I didn't find in the node description: Knime only supports XPath 1.0.
Hi, thanks for quick replies.
@boraster - I've tried that, it gives the same result. Maybe it's the node config but I doubt it.
@Marlin - thanks, I'll look into that. However I don't believe it to be the case. I've tested it by taking the a query from the node description. In the xpath node description it says:
For the example when querying XHTML documents with the XPath Query:
the following namespace must be defined:
I've put in this query and added a namespace like it says in description and unselected the incorporate namespace of the root element option. And it still returns the question mark. Any suggestions?
Can you post your workflow? Or at least the input file/address and the XPath you are using.
There are a lot of ways this node can fail, and many of them fail in this manner...
- Are you sure what you're working with is actually xhtml, not some other html variant?
- Have you tried my suggestion for a workaround to test if the problem is namespace-related?
- Are there actually links on the site you're testing? (I've made mistakes like this far too often...)
- Do you get any related messages in the console?
Of course there's always the possibility that the xpath node chokes on the file encoding or something, but let's assume that the bad interaction is its only flaw for now...
thanks for your help. I was able to resolve the problem. I don't know why, but my original query must have been worng. A simple //dns:a works. If someone knows why my original query was incorrect I'd really like to find out, but right now I'm just happy I got it to work.
Best regards, Ziga
I've decided to use my original thread eventhough the main problem has been resolved. There is another thing that's causing me problems in regards to Xpath/XML.
When I extract links from a site I can ungroup them and put them in rows, however they have xml tags with them. They're not pure links. I used a link extractor node (Palladian) but that only extracts part of the link and not complete links. I checked if the links are complete in the xml and they are if I copy them manually into a browser they work however link extractor won't extract them. Any ideas on how to remove the xml tags from the links to extract them.
<?xml version="1.0" encoding="UTF-8"?>
<a href="http://www.XXX.com/ISKALNIK_Navodila_za_uporabo_exported_10-38-01.html" target="_blank" xmlns="http://www.w3.org/1999/xhtml">Pomoč</a>
I just need the link nothing more. Why is the xml included?
That's probably because this tag is required according to the standard.
Maybe you could try a second xpath after the ungroup, with output type string?
Hi, thanks for the reply. Yes I did that, i basically get the same result as with the URL extractor node.
Ugh... ok, I didn't expect that...
But I think I found a way: xml cells are accessible by the String Replacer, so if the "clean" methods don't work, the dirty ones can help. A String Replacer with pattern "(?Us)<\?.*?>\s*(.*)" and replacement "$1" worked for me.
Meaning: match unicode, let the dot match line terminators; match the xml tag in a lazy way and then some whitespace; pack everything else into group 1 and return it.