Help with various JSON handling topics

j_ochoada · November 25, 2019, 5:02pm

Hi Everyone,

As of background I am running KNIME 4 on a windows 10 machine with 32GB of RAM with about 25G allotted to KNIME

I am trying to work with some large JSON structures. I am issuing a GET command to retrieve and export these structures. My first issue is trying to retrieve the JSON. I am getting timeouts from the GET request. It was suggested by @qqilihq that I take a look at the download node but in order to get access to that JSON through the API I need to pass a token which I don’t see a way to do in the download node. Is there something I’m missing or another way?

Trying to push forward I figured out how to access and save the JSON using cURL. The file size of the JSON I downloaded was 0.95G

I tried to load this into KNIME using the JSON Reader and technically it was successful but practically the RAM usage has maxed out and I can’t really do anything because KNIME and the machine is pegged. I have selected write everything to disk but that doesn’t seem to help enough. I’m hoping I’m missing something elementary here also.

The kicker is that I only need two fields from each record in the JSON but the vendor API does not allow for selecting certain fields.

Is this a hopeless pursuit? Is everyone else running on machines with hundreds of GB in RAM?

Any guidance, assistance, or reference here would be greatly appreciated.

Thanks so much,
Jason

qqilihq · November 25, 2019, 5:59pm

Hi Jason,

just seen your reply in the other thread (thanks for linking me here ). Regarding my suggestion with the “Download” node – I think it’s not possible to pass HTTP header information there. It might be worth to check whether the API from which you’re downloading supports supplying the API key in an alternative form though (i.e. via query param e.g. ?apiKey=lulululu).

This will not solve your parsing issue though. I’m quite sure that all JSON parsing in KNIME happens in-memory (and the associated Java object structure overhead would be quite fat – it can easily be that a X MB JSON will require 10 * X MB of RAM when being parsed – ymmv).

As you mention “curl”, I’d probably take the following approach:

Download the file via command line (curl), and then use jq to pull out the fields you actually require, write it to disk (consider writing it as CSV because this can be processed in a streaming manner) and then read this file into KNIME. (should you still hit memory limits with the default jq settings, have a look at it’s streaming mode)

Hope this helps!

Philipp

j_ochoada · November 25, 2019, 6:48pm

Hi Philipp,

Thanks once again. As I’m sure it’s obvious I’m learning all this API/JSON on the fly. I wasn’t aware of jq and I certainly will take a look. Thanks for the general understanding about the object overhead also. I’ve requested access to a VM which has access to much more memory which will hopefully happen in the near future.

Jason

j_ochoada · November 26, 2019, 4:30pm

Hi Philipp,

I wanted to report back that after downloading jq and spending some time on stack overlflow I was able to extract the values I needed out of that JSON very fast and with zero impact on the memory. I think that was because I was using the default streaming mode over the slurp option. I also have a request to see if I can get the KNIME client on a machine with larger memory specs as I keep running into these memory issues everywhere on my large datasets.

Thanks again @qqilihq!
Jason

system · May 27, 2020, 4:30am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.