Multiword replacement - Dict Replacer (2 in ports) ?

I am trying to replace instances of specific strings in a series of documents with a controlled vocabulary from a dictionary eg
“Bernard” is replaced with “Dog”
“St Bernard” is replaced with “Dog”
(for those interested I’m aiming to map MESH terms in PubMed abstracts)

I am using the Dict Replacer with 2 inputs but it appears what is happening is that I am only ever seeing
Bernard replaced with Dog so in cases eg “St Bernard are hairy” I am getting “St Dog is hairy” . I have tried sorting my dictionary from the longest replacement string to the shortest but this doesn’t have any effect.

Can anyone suggest a solution or an alternative approach?

Thanks

Chris

Hey @arthuc01,

Most preprocessing nodes can only be used for preprocessing single terms.
So unfortunately you can’t use the Dict Replacer straightforward for this task.

However, you could tag your documents with the Dictionary Tagger node (use any random tag and uncheck the “set unmodifiable” option). This will cause terms like “St” and “Bernard” to be concatenated to one term: “St Bernard”.
Note: You have to sort the dictionary properly. E.g. if “Bernard” occurs after “St Bernard” in the dict, the previously concatenated term in the document will be split again.

After building the multi-words with the Dictionary Tagger, you can use the Tag Stripper node to get rid of the tags. Then, you can use the Dict Replacer (2 in-ports), since the multi-words were concatenated to one term.

I hope this helps.

Best,
Julian

1 Like

Hello,

Just a thought, not sure if it would work, but maybe a simpler solution if it does. You could use the String Manipulation node to replace all " " with a “-” or something. Then try replace St-Bernard with “Dog”. Then use string manipulation again to replace all “-” with " ".

Kind regards,
Yush

1 Like

Hey @Yush,

Yes, that would be another solution. However, I guess he would have to extract all the information of the documents with the Document Data Extractor, do the replacing with the String Manipulation node and rebuild the document with the Strings To Document node.

The String Manipulation node only works on String columns and not on Document columns.

Best,
Julian

1 Like

Thank you to your both. I followed Julian’s approach and managed to get it to work. For anyone following this I had to set the following to get it to work

Dictionary table sorted by string length - Ascending
Dictionary Tagger options - Check - set named entities unmodifiable
Dict Replacer Check Ignore unmodifiable flag

Thanks again
Chris

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.