Keep relation to original document

Hello,

I have to two sequences of nodes in my Knime-Workspace to filter documents with specific terms that are listed in a dictionary. In both sequences of nodes I lose the relation to the original document. Is ist possible to keep the Row-Number or each filename of the used documents?

The first one looks like this:

http://share-me.de/filter-jobnames.png

This nodes should filter job-offers that are listed in a dictionary. Unfortunally the BoW creator kills the RowIds of the original-documents. Is it possible to keep the relation between the Job-Offers and the output data in the "Interactive Table"?

The second node-sequence should filter all documents with related terms that have a levenshtein distance of 1 to the terms of a dictionary. Here I have the same problem, that the relation the original documents get lost by using the String Matcher. Is it possible here to keep the original rowId or the filname of each document?

http://share-me.de/lookup-dictionary.png

Thank you for every advice!

Greetings, Cls

Hello Cls,

 

during preprocessing (e.g. filtering the BoW) the original documents can be kept in an additional column which is specified by default. In the dialog of the filter node you can check the box "keep original documents". The output table of the filter node contains a term column, a document column (with filtered documents) and a column containing the original documents. With the Document Data Extractor the filename can be extracted afterwards out of the original documents.

 

It is indeed a bit uncomfortable that the original row ids get lost. A thing you could to, is to set a specific infomation you want to extract (e.g. the filename) as category or source of the document (using Strings to Document). This information can be extracted afterwards (Document Data Extractor). In version 2.8 it will be possible to insert different kind of meta information into documents (via key, value pairs) which can be extracted afterwards. That makes it much more handy than exploiting the category or source field.

 

Cheers, Kilian

 

Hey Kilian,

thank you for your response!

Is it possible to somehow combine the "Strings to document"-node with the "Flat File Document Parser" and the "List File"-node to automaticly assign each filename as "source of the document" to the corresponding document?

Thank you!
Cls

Hi Cls,

 

it depends on the format of your text (data) files. It is possible if the data is formatted in a csv like format i.e. each row of your file represents one documents and contains all necessary information e.g.:

 

"title1";"author";"fulltext1"

"title2";"author";"fulltext2"

 

and so on. If you have your data formatted like this, you don't have to use the "Flat file parser" instead use the "File Reader" node and than use the "Strings to documents" node.

 

If you have many of these files use the "List Files" node to create a data table containing all the file names and than Loop over these table rows. Hand over the file name (of the current loop iteration) to the File Reader as flow variable. Add an additional column to the File Readers output table e.g. via Java Snippet node containing the value of the flow variable (which is the file name). And finally use the "Strings to documents" node to create documents.

 

Cheers, Kilian

Thanks, that works!
Cls