Removing 'hard returns' + ‘dashes’ from content pasted from PDF

Hi,

 

I pasted article's abstract (text) from PDF to excel. I am using the ‘string replacer’ node for removing the ‘hard returns’ => Regex=> [\r\n]+

 

However, I don’t know how to remove dashes that are positioned at the hard return. I would appreciate help for handling that.

 

For instance, in the sentence below I would like to remove together the ‘hard return’ the dash that splits the word ‘inter- ests’

 

address social inter-

ests and groups as well.

 

So, the output after removing the ‘hard return’ and the ‘dash’ would be:

 

address social interests and groups as well.

 

Is it possible to do this in Knime? How to?

 

Many thanks in advance,

Cadu

Hi Cadu,

 

to combine two word parts which have been splitted by hyphenation you need three steps:

1.) remove the new lines and hard returns that you end up with, e.g. ‘inter-ests’

2.) use Wild Card Tagger node and a regex like "\s+[\w]+-[\w]+\s+" as only dictionary entry to combine the two word parts considered as two terms, to one term.

3.) remove the dashes using the replacer node.

 

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.