How to keep original data

First of my documents are a string column from a database. I want to analyze the content of it.

However I’m kind of lost on how to link back terms to a specific ID (eg. primary key of that row in the database). Note that I do not want the document names in the terms (they can also occur within other documents) . So they are filtered with RegEx Filter and then I loose all document names.

I can set it as document name and then use the “Keep original document” setting of the prepossessing nodes to keep the name. But some nodes like POS Tagger then throw an error:

Only one column containing DocumentCells allowed !

My goal is in the end to have each record (ID) associated with the terms found in this string column.

How can I achieve this?


you can use the database IDs as source field. That can be specified in the Strings to Documents node. Later on you can extract the source information (which is the database IDs) using the Document Data Extractor.

Attached you find an example workflow showing how to insert IDs as sources and extract them later on.

Cheers, Kilian