I have a set of set of documents (several hundreds pages each in PDF) written by many authors.
There are in particular many cases of same words written either with hyphen / dash and without, e.g. “roll-out”, “roll out”, “rollout”.
After tokenising and converting to strings, I am trying to harmonise across the documents by removing the hyphens and separating the words, so as to come up with e.g. “roll out” across all documents. This way I hope to have a much more robust feature selection for further processing.
Splitting the hyphenated words is straightforward with String Manipulation. But then I’d like to store the parts and reuse to detect the equivalent single words (e.g. regexReplacer("$col$", “([A-z]+)-([A-z]+)”, “$1 $2”) and then apply a regexReplacer searching for “$1$2”). But it does not seem to work in nested regex.
Is there a way forward or am I searching the wrong way?
is it possible for you to share a small example workflow? That way it is way easier for others to understand what the data looks like and can quickly try out ideas.
Here it is Preprocessing_test.knwf (26.2 KB)
The other parts of the flow took me long to solve. Now this issue of harmonising hyphenated words is coming back as an important one. I added representative sentences from the texts I am processing. They include the words roll-out, semi-conductor, and end-to-end. They have either non-hyphenated versions (compact) or space(s) instead of dash.
The String manipulation node is trying to solve this harmonisation but fails in detecting the non-hyphenated versions of the words.
Any advice would be highly appreciated.
Hi @EricXL -
Sorry for the delay in responding here. I am a novice with RegEx, but in your case it seems like the RegEx Extractor node from the Palladian extension might be useful:
Apart from that, if you would rather try a dictionary-based approach, you could try the Dictionary Replacer (for Document columns) or the String Replace (Dictionary) (for String columns).