Text processing - carry identifier

Hi,

I have the following problem/challenge. I am reading in a file containing about 3000 rows consisting of two columns. The first column is the identifier and the second is text information. When I execute the node "BOW Creator" my identifier is gone, so I am not able to join frequencies or whatever back to the identifier. I was able to join via the original document but that is not a perfect solution. Is it possible to carry the identifier during the whole text mining process?

Kind regards

Thomas Grau

 

Hey Thomas!

I guess you used the Strings to Documents node to create documents, right? 

Please have a look at your Strings to Document node settings. If you set the identifier column as "title" and the text information column as "full text", it should work out. The title is the information you will see in the document column after using the StD node. So if you use the BoW creator then, each row contains a term and the related document (which in turn contains the identifier/title and the text).

I hope this will help you.

Cheers,

Julian

Hi Thomas,

alternatively you can set your identifier as category or source meta information in the documents (using the Strings to Document node). This information can be extracted from documents later on e.g. with the Document Data Extractor node.

Cheers, Kilian