Clean data sentence by sentence

noorhafhizah · May 30, 2012, 8:29pm

Hi all,

I have a CSV file which has 2 rows and 7 columns. However I am only interested in preprocessing the last column (the blue color).

Example of input file is attached in inputproduct.jpg.

How should I do the cleaning/preprocessing text sentence by sentence?

I have used the KNIME to do that for me, but it gives me the terms in separated lines.

Example of cleaned data file is attached in outputproduct.jpg.

I want the Term as String Column to be put in one line/sentence, for instance

pleasant maneuvering coupe drove

aware dishonest nearly cabin otherwise

Can KNIME do that for me?

Many thanks.

Fizah.

kilian.thiel · June 5, 2012, 4:02pm

If You want to (pre)process Your documents sentence by sentence You can use the "Sentence Extractor" node right after the Parser node (or Strings to Document node). This node will extract all sentences of all documents as string column. After that (create a title for each sentence e.g. "" and filter old documnt column) use the "Strings to Document" node to convert the sentence strings into documents. Now You have a document for each sentence. These documents can than be preprocessed the regular way, meaning that You can use the "BoW creator" followed by any preprocessing node. If You want to get the text of the preprocessed documents after the preprocessing is done, You can group by the Document Column ("GroupBy" node) and than use the "Document Data Extractor" to extract the Fulltext of the document, which is now preprocessed, as string column.

[Any Parser]->[Sentence Extractor]->[Java Snippet (to create a title string e.g. "")]->[Column Filter (to filter the old Document column)]->[Strings To Document (create a Doc for each sentence)]->[BoW]->[Preprocessing nodes e.g. Stopword filter, ...]->[GroupBy (group on Document column)]->[Document Data Extractor (extract Fulltext as string)]

Hope this helps.

Cheers, Kilian

kilian.thiel · June 5, 2012, 4:15pm

If You just need the preprocessed sentences (at the end of your preprocessing procedure) as a string column, there is also a more easy way to do this. After You red the documents with any parser node use the "BoW" node than directly use any preprocessing node e.g. Stopword filter etc. (with the setting deep preprocessing, which is switched on be default anyway). Now that You have the preprocessed documents group by the Document column, and after that use the Sentence extractor.

[Any Parser]->[BoW]->[Preprocessing nodes e.g. Stopword filter, ...]->[GroupBy (group on Document column)]->[Sentence Extractor]

The result of this is among others one string column, containing the sentences of all preprocessed documents.

Kilian

suresh_reddy · August 7, 2012, 1:47pm

can you help me giving dataset for 20newsgroup in knime tech.

or can you provide me dataset for textpreprocessing which contains only

text documents.

how to read a word file, pdf in knime tech. and how to read multiple files ?

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.