Remove Hyphenation and Sentence Split

sven-abx · June 21, 2021, 1:36pm

Hello KNIME-Community,

I imported a PDF-File with Tika-Parser Node. Now I want to remove hyphenation, because the lines need to be stored in a Database.
Now:
Row100 This line was im-
Row101 ported.
Goal:
Row100 This line was imported.

Edit:
I found also something like this:
Now:
Row200 This sentence is also in the
Row201 document.
Goal:
Row This sentence is also in the document

I also tried via PDF-Parser Node and Sentence Extractor, but the result looks the same.

Any suggestions?

BR,
Sven

takbb · June 22, 2021, 10:06am

Hi @sven-abx, would you be able to upload a reasonable sized sample of your data, or is it sensitive?

Also is there a maximum line length that you can store in your database?

On the face of it, it looks like hyphenation appearing at the end of lines need to be handled but then lines also need to be joined together, possibly using “.” as the actual line terminator… Or do you have some lines that contain multiple sentences?

sven-abx · June 24, 2021, 11:11am

Hello @takbb Brian,

thanks for your reply. There is no restriction in length.
With a RegEx I could search for a “-”, but in every line/Row it is not possible to find a unique pattern.
It looks like cleaning the data by hand is mich easier.

Thanks and BR,
Sven

takbb · June 24, 2021, 11:22am

Hi Sven, Indeed sometimes a manual solution if it’s a “one-off” and there isn’t too much data, is more productive than a very complex, and possibly still-imperfect solution, but if it becomes too much, come back here and we’ll have another think…

system · December 23, 2021, 11:23pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.