Export documents after pre-processing



First of all, thanks for providing these highly useful tools!

I have parsed a set of plain text files set up a pre-processing workflow (tagger, BoW, filter - deep processing enabled) and now look for a possibility to export the results back to plain txt files again. In other words, I want to arrive at text files that contain the results as provided in the "Document Viewer" Node.

Is there a handy way to do this?

I have thought about doing this via the detour of the CSV Writer Node, but I am lacking the necessary data structure after  the filter node (which oprovides one row per term rather than one row per document).

Any hints are highly appreciated.

Thanks, Christian

Found a way via the "Document Data Extractor" (text) and "GroupBy" nodes. This allows at least to export the processed texts as a string in a .csv table.


If there are smarter solutions, hints are appreciated!

Best, Christian

Hi Christian,


i assume that you still have a bag of words like data table structure after the filtering/preprocessing. First you need to group the data to get one row per distinct document. Therefore you can use the "GroupBy" node and group over the column containing the preprocessed documents. Now you need to extract the data contained in the documents as string cells in oder to use the csv writer. Therefore the node "Document Data Extractor" can be used. In its dialog you can specify which data should be extracted, e.g. title, fulltext, authors ... This results in a data table containing the documents and the extracted field as strings columns. Then jus filter the document column with the "Column Filter" and after that you can use the csv writer.

If you just want to write the documents to file, and read it via KNIME later on, the "Table Writer / Reader" is an easy alternative. With these nodes you can write and read a complete data table easily, without any transformations.

Cheers, Kilian


This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.