How to remove tags from documents

I was wondering if there is a way to remove POS tags from documents once they have served their purpose. I want to do this so that when I apply TF-IDF on the processed documents, different senses of the same term are  treated identically.

This seems like a pretty logical thing to do, but I can't yet see any way to do it. There doesn't appear to be any 'tag stripper' node, and I don't know how to adapt any other node to perform the same function.

The simplest workaround seems to be to convert the documents into strings (via the document data extractor), then convert the strings back into documents. This isn't too painful, but if there's a simpler way, I'd love to know about it.

That is a very good point :-). I never thought of stripping tags so far. It is possible with a workaround for a bag of word. First use the Term to String node and then the String to Term node. This will cut away the tags. Please note that for counting and frequency computation only the word of a term is compared not the tags.

Anyways, I will put the Tag Stripper on the list. Thank you for the hint.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.