I imported a PDF-File with Tika-Parser Node. Now I want to remove hyphenation, because the lines need to be stored in a Database. Now:
Row100 This line was im-
Row101 ported. Goal:
Row100 This line was imported.
Edit:
I found also something like this: Now:
Row200 This sentence is also in the
Row201 document. Goal:
Row This sentence is also in the document
I also tried via PDF-Parser Node and Sentence Extractor, but the result looks the same.
Hi @sven-abx, would you be able to upload a reasonable sized sample of your data, or is it sensitive?
Also is there a maximum line length that you can store in your database?
On the face of it, it looks like hyphenation appearing at the end of lines need to be handled but then lines also need to be joined together, possibly using “.” as the actual line terminator… Or do you have some lines that contain multiple sentences?
thanks for your reply. There is no restriction in length.
With a RegEx I could search for a “-”, but in every line/Row it is not possible to find a unique pattern.
It looks like cleaning the data by hand is mich easier.
Hi Sven, Indeed sometimes a manual solution if it’s a “one-off” and there isn’t too much data, is more productive than a very complex, and possibly still-imperfect solution, but if it becomes too much, come back here and we’ll have another think…