Clean data sentence by sentence

Hi all,

I have a CSV file which has 2 rows and 7 columns.  However I am only interested in preprocessing the last column (the blue color).

Example of input file is attached in inputproduct.jpg.

How should I do the cleaning/preprocessing text sentence by sentence?

I have used the KNIME to do that for me, but it gives me the terms in separated lines. 

Example of cleaned data file is attached in outputproduct.jpg.

I want the Term as String Column to be put in one line/sentence, for instance

pleasant maneuvering coupe drove

aware dishonest nearly cabin otherwise

Can KNIME do that for me?

Many thanks.


If You want to (pre)process Your documents sentence by sentence You can use the "Sentence Extractor" node right after the Parser node (or Strings to Document node). This node will extract all sentences of all documents as string column. After that (create a title for each sentence e.g. "" and filter old documnt column) use the "Strings to Document" node to convert the sentence strings into documents. Now You have a document for each sentence. These documents can than be preprocessed the regular way, meaning that You can use the "BoW creator" followed by any preprocessing node. If You want to get the text of the preprocessed documents after the preprocessing is done, You can group by the Document Column ("GroupBy" node) and than use the "Document Data Extractor" to extract the Fulltext of the document, which is now preprocessed, as string column.

[Any Parser]->[Sentence Extractor]->[Java Snippet (to create a title string e.g. "")]->[Column Filter (to filter the old Document column)]->[Strings To Document (create a Doc for each sentence)]->[BoW]->[Preprocessing nodes e.g. Stopword filter, ...]->[GroupBy (group on Document column)]->[Document Data Extractor (extract Fulltext as string)]

Hope this helps.

Cheers, Kilian

If You just need the preprocessed sentences (at the end of your preprocessing procedure) as a string column, there is also a more easy way to do this. After You red the documents with any parser node use the "BoW" node than directly use any preprocessing node e.g. Stopword filter etc. (with the setting deep preprocessing, which is switched on be default anyway). Now that You have the preprocessed documents group by the Document column, and after that use the Sentence extractor.

[Any Parser]->[BoW]->[Preprocessing nodes e.g. Stopword filter, ...]->[GroupBy (group on Document column)]->[Sentence Extractor]

The result of this is among others one string column, containing the sentences of all preprocessed documents.


can you help me giving dataset for 20newsgroup  in knime tech.

or can you provide me dataset  for  textpreprocessing which contains only

text documents.


how to read a word file, pdf in knime tech.  and how to read multiple files ?