I am trying to use the "XML Reader" and "XPath" components to parse Swissprot XML files. After failing for a while I found that, strangely, parsing the original XML files with top level tags that contain atttibutes does not work, but after removing the attributes the files can be parsed. Does anyone know why & how this could be overcome?
Any hints & help would be greatly appreciated!
Cheers, Andreas
==========
The original file with attributes in the top level tags cannot be parsed: (XPath: /uniprot/entry/accession; XPath query: accession)
Could you post or send me two small files demonstrating the problem or even a workflow? This has very likely to do with the namespace declarations in the first file with requires the XPath expressions to be namespace aware.
The original file that does not work with the attached workflow is P21524.xml, the one that works is test.xml. The only differences between the two files are in the <uniprot> and <entry> tags in the first couple of lines.
It's as I suspected. The attribute in the root element is the so-called namespace for the document (see http://en.wikipedia.org/wiki/XML_namespace if you are interested in details). Each element then resides in the namespace "http://uniprot.org/uniprot" and if you want to access certain elements you also have to address them with this namespace. The XML Reader and XPath nodes have an option at the bottom to assign a so-called "prefix" to the root namespace ("dns" by default). You now need to use this prefix in all XPath expressions, e.g. /dns:uniprot/dns:entry/dns:accession.
If you delete the attribute the namespace is also removed and XPath expressions without prefixes work (again) because everything belongs to the default (empty) namespace then.
I found very usefull Thor post, however I´m trying to do something similar with a sbml file (sbml is an xml file for sistems biology).
I have to parse all KEGG database to obtain data fron reactions that ihave to extract from the xml text.
Each file corresponds to a metabolic pathway from an specific organism.
I load the files list from a "List Files" Node and then the idea i sto make a loop that creates a table with the reactions info.
So I use a "TableRow To Variable Loop Start" connected to a XML Reader and a Variable Based File Reader The XML Reader then connected to the Loop End Node.
I use "/sbml/model/listOfreactions/reactions" as a Xpath Query and "/dns:sbml/dns:model/dns:listOfreactions/dns:reactions" as Prefix of root´s namespace.
The prefix of the root's namespace is not a path, it's a simple identifier that is then used in the XPath expression. So just enter dns (e.g.) for the prefix and prefix every path segment in your expression with "dns:" (your prefix is in fact the correct XPath expression). If the XML document does not have any namespace then you can ignore the dns and prefixes altogether.
I quickly tested your example and according to the Node Description of the XML Reader the XPath used offers only limited functionality.
The source code of the XPath used in the XML Reader doesn't seem to do something regarding brackets. Therfore I suggest for cases like this to read the XML using the XML Reader and then using the XPath node.