Import JSON data, that is non-standard-compliant

kowisoft · April 8, 2020, 11:01am

Dear KNIMErs,

I recently got access to the Amazon Review dataset and thought it could be quite useful to experiment with data analytics, sentiment analysis etc. (you can request it here: https://nijianmo.github.io/amazon/index.html ). I downloaded the “category: software” subset which only contains approx 500k reviews

However, it seems that this data is NOT compliant to the JSON standard “nomenclature” meaning for example, it does not start with squared brackets “[” and “]”, comma is missing after each line.

Each line then contains several data points that are always there and sometimes there seems to be optional data. In addition, the “separator” comma of course also appears in the review texts.

Is there a way to bring this into a manageable KNIME table format?

This is what I tried:

JSON Reader: If I read the json.gz file I get an error message “Execute failed: Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens
at [Source: java.io.BufferedInputStream@2b612f61; line: 1, column: 2]” - researching it I found out, that the input file seems not to be compliant to the JSON standard
Turning it into a CSV: that was just another pain, as it added lots of semicola for the optional columns in the row, where this optional data is not there. I finally get it into a 1 column data table with this approach, but I can’t find a way to extract the content from it.

Any hint, how to proceed?

Thank you in advance!

qqilihq · April 8, 2020, 11:06am

I didn’t download the file, so the following suggestions are only based from my understanding from your post:

As far as I got, each line of the file is a valid JSON object. So you could read the file line-wise to string cells using the Line Reader node. Afterwards you should be able to parse each string cell to a JSON object using the String to JSON node and from there on perform further processing.

Would that work?

– Philipp

kowisoft · April 8, 2020, 11:33am

Whoaaa… @qqilihq

What a lightning fast response and that actually did the trick. I wasn’t aware (or overread) that each line item is a JSON element in itself (sure, this is not the right term, but hey, I’m only starting my data science journey).

It works like a charm

ps: I just discovered how to read your nick name (made me really laugh … )

system · April 15, 2020, 11:33am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.