Having tagged multipart terms (i.e. ngrams) using the Dictionary Tagger, I sometimes want to later un-tag these terms and have them revert back to their component parts (so that a bag of words would no longer reveal multi-part terms).
As far as I can see, the tag stripper node doesn’t do this – it removes the tags, but does not split the terms. So presently I do it by finding all muilti-part terms in the bag of words, splitting them as strings, grouping the results to list all the individual words, and then tagging all instances of these individual words using the Dictionary Tagger.
This works, but it is a fairly expensive process to do what is a conceptually simple action. (The final tagging operation is particularly intensive.) Is there a simpler way to do this that I have missed? Or could the option to do this perhaps be added to the Tag Stripper or a new node?
The tag stripper helps you if you want to remove all the tags of terms that are contained in a document.
It won’t help you in this case.
Does your dictionary already have multiple terms per each row?
Would it make sense to create another dictionary table with one term per row only and tag those terms within another branch in your workflow? You could use the Cell Splitter node to split the content of a selected column into parts. Then you could use the Dictionary Tagger (Multi Column) to tag the terms based on the different columns.
In this way way will end up with a workflow that has two branches:
- tagging multiple terms
- tagging one term
Would that help?
currently, there is no option to split up multi-word terms in to several terms.
One thing you can do is extracting all String information using the Document Data Extractor node and select all needed information in the dialog. The you can recreate the documents by using the Strings to Document node. It will tokenize the Strings based on the selected tokenizer.
This might be also fairly expensive and you will lose all document related information (e.g. tags), but it is probably faster that using the Dictionary Tagger to split the terms.
Thanks for the suggestion, Vincenzo. That approach makes good sense, except that I want to iterate some operations over the same documents. For example, first tag ngrams, then make some corrections or replacements, then process the corrected documents without the ngrams tagged.
Julian - I would use that method, but in the past I have found that the Document Data Extractor corrupts the documents by merging some terms together. I did find a workaround by adding a marker to the end of every word and then splitting terms that were joined by the marker, but that is also an expensive solution.
I’ve posted here about that issue previously, and I think there was some talk of it being resolved. Has there been any progress in this regard?