For a document similarity workflow I need to preprocess the documents using a dictionary. This dictionary consists of two colums of considerable length with search phrases and replacement phrases. A phrase can consist of 1-6 words. I tried three options already mentioned on the forum for this:
- Dict replacer (two ports): This only works for single terms. “Xxx yyy” can not be replaced by “ppp qqq”
- Dictionary tagger: Can not replace. Moreover it can not deal with strings that also are part of longer strings. “Xxx” is not tagged when “xxx yyy zzz” also is in the dictionary. And the tagging seems to be limited to a certain number of characters.
- Recursive loop together with string manipulation: This works fine but unfortunately very slow. Processing one document with the complete dictionary (14.000 phrases) takes more than an hour which is not acceptable.
So for this specific task I can’t find a suitable solution. Any ideas?
you can use the Dict Replacer (2 ports) node for this task, but before that you need to send your documents through a Dictionary Tagger node using the key column from your dictionary as tagging input. In this way the sentences you want to replace later will be threated as single terms and will not be further split (i.e. tokenized on spaces), hence dictionary replacement will work as expected.
Give it a try and feel free to report here in case of further problems.
Thanks for your suggestion and I tried it. The issue is the dictionary tagger has problems with strings of multiple words that overlap, an example:
|aaa bbb ccc ddd eee fff ggg
|aaa bbb ccc ddd eee fff
aaa bbb ccc ddd eee
|aaa bbb ccc ddd
|aaa bbb ccc
All these strings need to be replaced by xxx (this is an example, the real wordlist has various versions of these strings).
The result is:
|aaa bbb ccc ddd eee fff ggg
||ccc ddd eee fff ggg
|aaa bbb ccc ddd
|aaa ooo ppp
||aaa ooo ppp
|ooo aaa bbb ccc ddd eee fff ggg ppp
||ooo ccc ddd eee fff ggg ppp
So in this case only the string aaa bbb ccc is tagged. The other strings are ignored. In other cases only the first word is tagged and the longer strings ignored. The result with a dictionary with various combinations is unpredictable. I tried exact match in the dictionary tagger and set unmodifiable but the results are the same.
I am not enterily sure this would work, but I was thinking you could try to do the tagging/replacement in multiple steps, starting with the longest sentences and then down to the shortest ones. In this way you would eliminate most of the overlapping cases between the long and the short sentences.
You can chunk the replacement dictionary according to the length of each replacement sentence (characters or words) and go through each chunk in a loop, from the longest to the shortest.
I didn't have time to try out this concept, so please take it as such.
Thanks for your suggestions. I executed the idea to do the tagging in multiple steps in two different ways:
By using a recursive loop and filter the strings to tag by a decreasing number of words. Unfortunately the problem remains: longer tags are overwritten by shorter ones. I was thinking of replacing the tagged words by something else, but these kind of nodes work on string level and change the recurring document which doesn’t work.
By using a table row to variable loop. Every loop word-combinations with a decreasing number of words are filtered, taged, a bag of words is made, the tags are converted to a string, filtered and appended to the table at the end of the loop. This delivers a collection of tagged words of decreasing length per document.
Document 1, aaa bbb ccc ddd
Document 1, aaa bbb ccc
Document 1 aaa bbb
Document 1 aaa
So now I need something to filter the last 3 versions automatically. Moreover sometimes words like ccc are part of the wordlist and need to be excluded from the filter, quite complicated. Besides the output needs to be converted to a document vector in some way. So not finished yet.
I have two additional findings:
In the table row to variable loop I cant append the original document in the bag of words. If I use this option I get the error “Index of specified document column is not valid! Check your settings!” If I use the same nodes without the loop this doesn't occur. No idea why this happens. <Caused by wrong preprocessing and already solved>
I use the document data extractor to determine the number of terms of the table of words/strings to tag (after using strings to document). For some reasen the number of terms is twice the number of words. Are spaces counted as terms?
I have learned a lot of my experiments above. Hopefully I can find a satisfying solution for this complicated task. I would appreciate it very much if you can spend some time on my additional questions.
as you mentioned the Dictionary Replacer nodes work only on a term level not on a sentence level. Recursive loops are a bit unhandy for this use case, so I would suggest to do this before creating the documents. If you have string inputs e.g. from a file use the replavcer nodes that work on strings before creating the documents.
If you have PDF file and you are using a parser node I suggest to extract the text as string from the documents using the Document Data Extractor node. Then preprocess the strings and do the replacement and then create documents again. This is a workaround but it will be faster then a recursive loop (and probably easier).
I hope that helps.