Dear all,
I am really really beginner on KNIME analytics platform, and my work is to extract some part of my xml files (certains metadata) and then to create a table with these selectionned data from my XML file. I tried some answers from the forum, but I failed, and it doesn’t work. Could someone help me to find the right workflow please?
I know that I have to start with : XML reader --> Xpath Node but then I am not sure how to continue.
The XPath node can create a data table based on a XML, so what exactly is your issue?
It is much convenient for us to have a look at your input and desired output. Maybe you can provide an example XML and show us what kind of output you need.
Hi @armingrudd,
Thank you for you reply. Sure, I give you an example about an XML file I have with some information I want to extract and an example of a final table I would like with a column containing the name of my object and the information extracted from the XML file.
Thank you very much for your help,
Melanie
Easy case, Just the elements you want then click on “Add XPath” button. If there are multiple instances of the element, select “Multiple Rows” in the new window, if not leave it as it is and press OK in both windows.
Thank you very much, indeed it is very easy for one file. This is a first step. Then I will have to upload a lot of xml files and extract all the same objects I need and include them into one table (File 1/File 2/File 3…). I will try and I’ll come back if I need.
Okay, I still need help please. I have several issues.
How to write the right query
I want to get all the element “oai_pmh” written in my XML file to compile all the attributes “code” in my table. It works if I select one element then I click on “Add Xpath”. But there are a lot of these elements “oai_pmh” in my XML file. So which query should I write to get all of them without clicking one by one?
I want to take the name of an element in my XML file, in my example the name of the 3rd node “oai_pmh”. If I want to get the name of my root, it is ok with name(/*), but I can’t do the same for the 3rd node “oai-pmh”. I tried with “name(/source_oai_pmh/source/)” but it doesn’t wotk. Any idea?
Select the path > remove the identifier number of the element in the path > select “Multiple Rows” in the configuration window of the path.
e.g. /source_oai_pmh/source/oai_pmh[1]/@code to /source_oai_pmh/source/oai_pmh/@code
Since the XPath node supports XPath 1, you can only take a single name:
The first element: name(/source_oai_pmh/source/*). Which here would be “oai_pmh”.
The second element: name(/source_oai_pmh/source/*[2]) and so on.
Well, thank you @armingrudd . I come back about my question on uploading multiple xml files and extract my data in a single table. I read on the forum that you suggest this:
Thank you. It works for XML files with exactly the same name of root and item, but do you know if there is a possibility to do it with XML files which have the same structure but with diffrent name of root and items? In my below example, I have a file with a root “sources_sitemap” etc… and on the other file I have “sources_oai_pmh”. Is there a possibility or not at all? Thank you very much (I very beginner in Xpath/XML language so I try to understand step by step)
I am unfortunately not enough advanced levelto success in this issue. I am very grateful if you could show me with these two XML files. Thank you abes-calames-rdfa.xml (3.9 KB) hal-oai.xml (32.8 KB)
Well thank you it works well.
Then I would like another item “baseUrl”, so I added it in Column Expression with this expression “join(”/dns:sources_", column(“item”),"/dns:source/dns:", column(“item”),"/dns:parameters/dns:crawl/@baseUrl")" and in Xpath Query but it is weird that it doesn’t work for this specific item. I mean it works for one type of “source” (oai-pmh) I can can find in my table but not for the other (?sitemap). Is there an error in my query? Do I need to change something in the RegexReplace in String Manipulation? Thank you
The XPath you have created in the Column expressions node is converted to a flow variable which should be assigned to the corresponding xpath variable option the XPath node. Go to the Flow Variables tab of the XPath node and assign it to the xpath option. the first one is xpath0, then xpath1 and so on.
Excellent, thank you, you’re great! If I want to convert my end table into a CSV files, should I use the XLS writer after the End loop or this one has to be included somewhere into my workflow? And should I use a new string manipulation with a new loop?
If you want to export the final table (after reading all XML files) to a CSV file, you have to use a CSV Writer node after the Loop End node. But if you want to export the extracted data from each XML file to a separated CSV file, then use it after the String Manipulation inside the loop (the output port of the String Manipulation connects to the Loop End and CSV Writer nodes). Remember you have to create different file names using the Create File Name node if you want to use the writer node inside the loop.
Thank you, it worked well yesterday. However, today, I tried again and I had a problem with a warning “Input file ‘C:\Program Files\KNIME\missing’ does not exist”. In XML reader, I have chosen “location” as you told me, but when I choose this there the word “missing” which appears and then it doesn’t work. I don’t understant what happened.