I recently got access to the Amazon Review dataset and thought it could be quite useful to experiment with data analytics, sentiment analysis etc. (you can request it here: https://nijianmo.github.io/amazon/index.html ). I downloaded the “category: software” subset which only contains approx 500k reviews
However, it seems that this data is NOT compliant to the JSON standard “nomenclature” meaning for example, it does not start with squared brackets “[” and “]”, comma is missing after each line.
Each line then contains several data points that are always there and sometimes there seems to be optional data. In addition, the “separator” comma of course also appears in the review texts.
Is there a way to bring this into a manageable KNIME table format?
This is what I tried:
- JSON Reader: If I read the json.gz file I get an error message “Execute failed: Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens
at [Source: java.io.BufferedInputStream@2b612f61; line: 1, column: 2]” - researching it I found out, that the input file seems not to be compliant to the JSON standard
- Turning it into a CSV: that was just another pain, as it added lots of semicola for the optional columns in the row, where this optional data is not there. I finally get it into a 1 column data table with this approach, but I can’t find a way to extract the content from it.
Any hint, how to proceed?
Thank you in advance!