Harmonise hyphenated and non-hyphenated words

EricXL · November 23, 2021, 2:20pm

Dear all,

I have a set of set of documents (several hundreds pages each in PDF) written by many authors.
There are in particular many cases of same words written either with hyphen / dash and without, e.g. “roll-out”, “roll out”, “rollout”.
After tokenising and converting to strings, I am trying to harmonise across the documents by removing the hyphens and separating the words, so as to come up with e.g. “roll out” across all documents. This way I hope to have a much more robust feature selection for further processing.

Splitting the hyphenated words is straightforward with String Manipulation. But then I’d like to store the parts and reuse to detect the equivalent single words (e.g. regexReplacer(“$col$”, “([A-z]+)-([A-z]+)”, “$1 $2”) and then apply a regexReplacer searching for “$1$2”). But it does not seem to work in nested regex.
Is there a way forward or am I searching the wrong way?

Thanks
Eric

marvin_kickuth · November 25, 2021, 9:48am

Hi Eric,

is it possible for you to share a small example workflow? That way it is way easier for others to understand what the data looks like and can quickly try out ideas.

Kind regards
Marvin

EricXL · December 8, 2021, 9:35am

Hi Marvin,

Here it is Preprocessing_test.knwf (26.2 KB)
The other parts of the flow took me long to solve. Now this issue of harmonising hyphenated words is coming back as an important one. I added representative sentences from the texts I am processing. They include the words roll-out, semi-conductor, and end-to-end. They have either non-hyphenated versions (compact) or space(s) instead of dash.
The String manipulation node is trying to solve this harmonisation but fails in detecting the non-hyphenated versions of the words.

Any advice would be highly appreciated.
Thanks,

Eric

ScottF · December 21, 2021, 5:01pm

Hi @EricXL -

Sorry for the delay in responding here. I am a novice with RegEx, but in your case it seems like the RegEx Extractor node from the Palladian extension might be useful:

Apart from that, if you would rather try a dictionary-based approach, you could try the Dictionary Replacer (for Document columns) or the String Replace (Dictionary) (for String columns).

system · June 2, 2023, 9:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.