Knime Python node and gensim

Hi,

 

I'm trying to configure a Python node to run the Phrases module from gensim (replaces frequent collocated tokens with a single 'bigram' token, e.g. 'new', 'york' becomes  'new_york'), but I'm running into issues as gensim runs mainly using list of lists, and cannot get the output as Document DataFrame for further text processing. 

 

I can get the list of lists (tokenized string) processed correctly (tokens vs. tokens_bigram), but not sure how to transform it into a DataFrame as output of the node so that other text processing nodes can read it. See script attached.

 

Anyone with experience using KNIME and gensim for text preprocessing?

 

Thanks

Hi Diego,

 

you are nearly there getting your bigrams from python to KNIME. The python nodes support sending DataFrame columns containing lists back to KNIME. Therefore what you can do is use the following line to generate your output_table:

output_table = pd.DataFrame(list(map(lambda x: [x], tokens_bigram)), columns=['tokens_bigram'])

This will create a pandas DataFrame with a single column containing each of your bigram lists in a separate row.

I hope that helps you getting your data back to KNIME and use the KNIME textprocessing nodes. If you have any further questions I will be happy to answer them.

 

Best

Clemens

 

Thanks Clemens, it worked!