Text processing - NLP, document type - and Python

I’m playing around with documents, NLP and figured that KNIME’s text processing nodes work very poorly with Cyrillic alphabet and symbols, so I am trying now to instruct some things through Python.
But I can’t figure out how the document elements (title, body, date) would go into the pandas’ DataFrame. For what I see, the field Document itself is presented in a column, i.e. each document is in a cell.
What do I do to identify the elements?

I am not a big expert in text analytics. First question would be how elements would have to be presented for a python module to do such analysis. That I would assume are the one you would have to present.

You could check if the new columnar data storage that is supposed to work better with python would represent these text elements.

I would assume since they might be specific to KNIME that it does not help.

Then the question is if you could convert documents to simple string variables or arrays and present them to python.

Maybe an example would help. One that demonstrates how the data would have to appear in python.

1 Like

Truth is it takes most of a 30 minutes period for just transferring the data back and from Python.
I wanted to process simultaneously the Title and Body of each document, but I guess I’d have to extract just them, combine in one string, then transfer to Python, return to java / KNIME, then build the Document again.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.