Text processing for some huge JSON Files containing Tweeds

Hi guys,

i´m really new to KNIME but already followed some tutorials and webinars and i´m really starting to like it :-)

For a project at university I have to do some text processing on twitter data. The analysis should include some frequency analysis on tweeds keywords and hashtags. The nodes i have to involve for doing this i found out by following this tutorial: http://tech.knime.org/files/knime_text_processing_introduction_technical_report_120515.pdf

I got some huge amounts of twitter data in JSON Format and i don´t really know how to get them in KNIME and then analysing them.

So, as i said i´m quite new, but willing to get into the topic. It would just be nice if anyone could recommend me some workflow that substitutes the IO section of the tutorial mentioned above. Can anybody point me into the right direction?

Any help will be greatly appreciated!

Regards,

 

Markus

 

 

 

Hi Markus,

I've just recently parsed some big JSON data within KNIME and used the following workflow:

  1. Line Reader (assuming, that each JSON entry you're interested in is on a separate line)
  2. Row Filter (for getting rid of undesired entries, comments, etc.)
  3. Java Snippet for splitting up the JSON data and putting everything in which I was interested in into a separate column

The splitting will get tedious, if you have complicated, nested JSON objects, but in my case I had a simple flat structure, which was relatively easy to parse manually. If your data is more complex, you can consider using a dedicated JSON parser (json.org is the most simple one, which I'm aware of), and include it as external library into the Java Snippet node.

Hope that helps,
Philipp

Great, thanks, I will give that a try.

Just one more question on text processing in KNIME. Following the tutorial mentioned above, it seems that text processing in KNIME is all about Documents, right?

If I get my Data into Database Columns, how can I proceed from then with text processing? It seems that the IO Process, creates a Table with only one column containing only document names ("The  output  of  all  parser  nodes  is  a  data  table  consisting  of  one  column  with 
DocumentCells.").

But what I intent to do is some text processing on one column (representing the content of the tweeds) of a database table. Another dimension to be taken care of would e.g. be the timestamp of the tweeds.

So, hopefully i do not confusing what text processing means. All the text processing IO nodes in KNIME are only about documents, not database tables. Can anybody tell me, if / how the workflow for text processing could work with database data as an input?

Thanks a lot in advance!

Regards,

 

Markus

UPDATE / answering my own question:

Found a possible solution to this by using the "Strings to Document" node in Text processing -> Transformation.

For what it's worth, even if JSON elements span multiple lines in the file, you can always concatenate all lines into a single string with GroupBy, and then parse the single JSON string with the (K)REST nodes - whether it's a huge one or not.

Cheers
E