String Replacement (with multiple words and spaces) in a Document

I have a set of documents in which there are references presented in both their full titles and as acronyms (ex. Too Much Information vs. TMI).  I'd like to replace the full titles with acronyms, to establish consistency and reduce noise in the document-term matrix that I eventually end up working on.

 

In a previous post on a similar topic it was recommended to convert the documents to a Bag of Words, and then use the Dictionary Replacer or String (RegEx) Replacer nodes (with deep preprocessing enabled) to make the replacements, and then work off the resulting documents after re-grouping them.  In my case, since I am dealing with compound terms/titles, the Bag of Words conversion destroys the reference by splitting the title into its individual component terms (ex. Too Much Information --> [Too] [Much] [Information]), which prevents the desired matching and replacement.  The String Replacer looks like it can work directly on the document, but there are a fairly large number of these cases, so I'd have to use a long series of String Replacer nodes - which would be cumbersome.

 

Have I overlooked a node (or option in a node) that will let me deal with this situation? 

 

Thanks much!

Hi, to stop the terms being split up into individual words, you need to use the dictionary tagger node in the enrichment section prior to the bag of words node. Here you will be able to define all the terms you want to keep as a whole for replacement later with the dictionary replacer node.

simon.

Thanks very much Simon!  I hadn't considered using the Dictionary Tagger to make the initial matches.  Haven't had time to cobble the whole process together yet, but the initial tests seem good.

 

Daniel