I got a folder with hundreds of XML that I need to process.
1. I've first tested if the format is okay, using the "XML Reader" and that give green light.
However, since "XML Reader" cannot read a directory of files, I've tried
a) . "Flat File Document Parser", followed by "String to XML" --> Fails with console error
column "Document" could not be parsed
b). "Flat File Document Parser", followed by "Document Data Extractor" configured with "Abstract" and "Text", followed by "String to XML" . Both using "Abstract" or using "Text" I get the same error "could not be parsed".
I try to use your advices and the IRIS' sample workflow but I'm facing a problem with the results of "Iterate List of files" node.
More precisely my problem is the following: I'm able to extract categories from one xml file in using xml reader, Xpath and ungroup nodes but when I tried to extract these same categories from the whole of xml files and that I use the list files and the iterate list of files nodes as in the IRIS'sample workflow, my collected results are not correct. I obtain the right number of rows (equal to iteration number or files in my folder) but it seems that information of only one xml file has been used. So after the iterate list of files node, if I try to parse each row with a Xpath node, the resulting output table has identical rows.
Here is attached my workflow file and 2 xml files.
Would you have an idea to resolve my problem ? What are the right options or basic settings to use in the "Variable Based File Reader" ?
you cannot use the variable based file reader for reading a xml file. You need to use the xml reader, there go to the flow variable tab and select as file url the url flow variable.
If I do so the flow executes correctly. In slashdot the first columns are unique per file, only the last one have different values.
Best, Iris
PS: You did see our white paper about slash dot data?