XML Reader gives "GC Overhead" or "Java heap space" Error for 800MB XML file

j.schoonemann · March 27, 2017, 1:25pm

Hi,

I'm trying to load an XML file which is 800MB with the XML Reader node with default settings in Knime. My computer has 4GB RAM and 64bit OS.

What I've done so far (but nothing worked):

After browsing the forum I already changed my Xmx settings in the Knime.ini file to -Xmx3g;
I tried to run the XML Reader with an Xpath, but unfortunately the XML Reader node can't compile the Xpath, probably because it has a colon >> //lei:LEI;
Changed the memory policy settings of the XML Reader node to "Write tables to disc".

Just to be sure I could set up a working flow for this XML file, I've created a test flow which used a small subset of the original XML file. This all worked like a charm. My test flow looks like this:

So it containts only two steps, the first step reads the (small) XML file, the second step extracts the necessary information from the XML. The settings of the Xpath node are:

After running this test flow I end up with exactly what I need. But now I want to run this flow using the original XML file of 800MB. I can't understand why 3GB of available memory isn't enough to read an 800MB file.

Can anybody provide me with some tips and tricks on how I can accomplish this?

Thanks!
Jeanine

thor · March 27, 2017, 3:10pm

If you read XML files into memory, their size will be much larger than the size on disk due to internal data structures. However, using the correct XPath to extract only the relevant part should work - if this part is small enough.

j.schoonemann · March 27, 2017, 3:36pm

Hi Thor,

thank you for your reply. I'm using the correct Xpath, but the XML Reader node apparently has restrictions that the Xpath node doesn't have. Since the same path that works in the Xpath node gives an immediate error (before executing) in the XML Reader node.

Do you know anything else I can do to solve my memory issues and accomplish what I want? The parts I want to extract are small enough, but the problem is I can't extract them immediately and since I can't load the total file either,I'm stuck now.

Regards,
Jeanine

thor · March 27, 2017, 5:05pm

Did you set up the namespace correctly in the XML Reader? Otherwise it doesn't recognize the "lei" prefix in your XPath expression.

j.schoonemann · March 28, 2017, 8:07am

I've tried that as well, but all I get then is that the node creates an emtpy data table.

The XML I try to load looks like this (this is the smaller version):

what I did in the Xpath node (and which works if I use the smaller XML and don't run into memory issues):

and

I actually have two Xpaths there:
//lei:LEI
//lei:LegalName

If I try to repeat this logic in the XML Reader node, I use the same NameSpaces and for the Xpath I've tried numerous statements, from //lei:LEI to lei:LEIRecords and almost everything in between. Every option gives me an empty data table after executing the node. What am I missing?

thor · March 28, 2017, 2:14pm

Can you attach a sample file or even better the workflow you are using (including the input file)?

j.schoonemann · March 29, 2017, 8:34am

I've attached the flow, unfortunately I can't attach my input file, but you can download it here.

xml.zip

thor · March 29, 2017, 9:25am

You cannot use arbitrary XPath expressions in the XML Reader (see also the node description). I suggest to first extract all records into individual rows (using the XPath /lei:LEIData/lei:LEIRecords/lei:LEIRecord in the XML Reader) and then extract the relevant parts from each record with the XPath node.

j.schoonemann · May 30, 2017, 2:42pm

Hi Thor,

excuse me for my late reply, do you know if it's possible to receive a notification when you get a reply on your post in the forum?

I've tried to set the Xpath in the XMLReader, something I've tried before also but what didn't seem to work, I don't know what I'm doing wrong. I suspect it has something to do with the colon in het Xpath. Since in the description of the node it says it can only handle simple Xpaths (or something like that).

What I've done:

And the error I get:

thor · May 30, 2017, 9:56pm

If you are using the "lei" namespace prefix in the XPath Query field you also have to define it in the the table below (similar as in the XPath node)

j.schoonemann · June 19, 2017, 10:54am

Thanks Thor, works like a charm now!