Parsing Multi Line fields into tables

 

Hello, I am stuck on a rather "simple issue". We are loading in a 2 GB Datafile. It is in multi line format:

product/productId: B00006HAXW
product/title: Rock Rhythm & Doo Wop: Greatest Early Rock
product/price: unknown
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or 

I am parsing it usually with Python:

pyOut = {}

for str in kIn['Col0']:
    tokens = str.split(':', 1)
    if len(tokens):
        if tokens[0] in pyOut:
            column = pyOut[tokens[0]]
        else:
            column = []
        column.append(tokens[1].strip())
        pyOut.update({tokens[0] : column})

And it works fine, but when I am trying to load larger sizes ~32GB. I get this error:
ERROR Python Snippet       0:325:5    Execute failed: java.lang.RuntimeException: java.io.IOException: Cannot run program "python": CreateProcess error=2, The system cannot find the file specified

I would rather just skip python.. given the above data, any recomendations for a native node to transform it?

 

Thank you

 

In KNIME you can use a RegEx Split node to separate the field name (e.g. product/title) from the field content (e.g. the actual title). Once that is done, you can use the Pivot node in a creative way to spread each field onto an own column. You can collect all records in sets/lists, then split them.

This is much easier if your data have a regular pattern (e.g. same number of fields, no missing fields), otherwise it may require some form of looping over the rows to extract them to the proper columns.

There are also alternative ways to consider. If your records do not have a constant amount of rows in each of them, you can use a Java Snippet node (or a Rule Engine node) to identify where each record starts/ends. then use GroupBy to chunk them.

I can think of few other ways, but that all depends on the structure of your data.

Give it a try and share your workflow here for further recommendations.

Cheers,
Marco.