Reading Rows of Text from CSV File

Hi there!

I am new to knime and already struggling with it. I am thinking of using knime as my main tool in my bachelor project.

Now to the problem. I have got a CSV file in which each row respesents a document (one string). I want to transform it to vector representation (document-term matrix, or term-document matrix).

I am able to read in the file and knime recognises each row as a string but then when I use Strings To Document node, it outputs a Document but each row consits only of "".

Do you have any ideas why?

Thanks.

Hi Calogero,

only the document title is shown in the data output table view. If the title is empty, "" is shown.

To create a numerical representation use the document vector node on a bag of words, as shown e.g. in the classification example (http://tech.knime.org/document-classification-example).

Here are some links which ma help you getting started with KNIME Textprocesing:

Introduction:

http://tech.knime.org/files/knime_text_processing_introduction_technical_report_120515.pdf

Online documentation:

http://tech.knime.org/documentation-3

Example workflows:

http://tech.knime.org/examples

Cheers, Kilian

Hi there!

Thanks for the links. I already had a look at these.

As I have understood, I have to use the String To Document node. The problem with the String To Document node is that I have to specify colums for Title, Full Text and Authors. Anyways, whatever I do, String To Document does not seem to be the right node for me. But then, which am I supposed to use?

Hi Calogero,

the Strings to Document node is the right node to create documents from strings. To create authors simply use the Java Snippet node and add a column containing a string, e.g. "John Doe" and a title e.g. "TitleX", with X as the row index.

Attached you find an example workflow (requiring KNIME Textprocessing >=2.9), showing how to create document vectors from strings.

Cheers Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.