Reading XML files ... Flat File Document Parser ... problems !

Dear All,

I got a folder with hundreds of XML that I need to process.

1. I've first tested if the format is okay, using the "XML Reader" and that give green light.

However, since "XML Reader" cannot read a directory of files, I've tried

a) . "Flat File Document Parser", followed by "String to XML" --> Fails with console error
         column "Document" could not be parsed

b). "Flat File Document Parser", followed by "Document Data Extractor" configured with "Abstract" and "Text", followed by "String to XML" . Both using "Abstract" or using "Text" I get the same error "could not be parsed".

How can one read XML files in Knime ???

Hi,

 

Using the list files node, you can create a list of all xml file names (in your folder)

Than use the iterate list of files meta node

Replace inside the meta node the variable based file reader with the xml reader.

Connect the loop start to the xml reader over flow variable port.

connect the xml out to the loop end.

 

I also attach you the sample workflow :-)

 

Cheers, Iris

Thankyou Iris for your prompt and detailed reply, and for the example ! I will try it tomorrow morning !!

Dear Iris,

It works perfectly well !!! Thankyou so much !!

With your example I've learned practically also

1) how to use the iterations

2) how to use the "Variable inport"

IMHO your example should be in a top tutorial page.

Thankyou so much !!!
Luca

Hello,

I try to use your advices and the IRIS' sample workflow but I'm facing a problem with the results of "Iterate List of files" node.

More precisely my problem is the following: I'm able to extract categories from one xml file in using xml reader, Xpath and ungroup nodes but when I tried to extract these same categories from the whole of xml files and that I use the list files and the iterate list of files nodes as in the IRIS'sample workflow, my collected results are not correct. I obtain the right number of rows (equal to iteration number or files in my folder) but it seems that information of only one xml file has been used. So after the iterate list of files node, if I try to parse each row with a Xpath node, the resulting output table has identical rows.

Here is attached my workflow file and 2 xml files.

Would you have an idea to resolve my problem ? What are the right options or basic settings to use in the "Variable Based File Reader" ?

Thank you. :)

 

How did you solve the problem with the various <?xml?> headers?

When I combine my xml files, the resulting document ( over 500 MB) contains the <?xml?> header from every single xml document.

Hi Chris,

you cannot use the variable based file reader for reading a xml file. You need to use the xml reader, there go to the flow variable tab and select as file url the url flow variable.

If I do so the flow executes correctly. In slashdot the first columns are unique per file, only the last one have different  values.

Best, Iris

 

PS: You did see our white paper about slash dot data?