Suggestion:
Is it possible to add the option to be able to get the output in html?
As the node is based in PDFBOX (the pdfbox-app has that option included ExtractText -html), could be beneficial to reconstruct “paragraphs”.
Tika and PDF Parser use the same library.
Regex Processable
<p><b>United Nations
</b></p>
<p><b>Report of the Special
Committee on the Charter of
the United Nations and on the
Strengthening of the Role of
the Organization
</b>
</p>
NON Regex Processable
United Nations
Report of the Special
Committee on the Charter of
the United Nations and on the
Strengthening of the Role of
the Organization
Recreate paragraphs is complicated without those
guides.Almost impossible without the <p></p>
from an scanned document UN Charter
<p>CHAPTER I
PURPOSES AND PRINCIPLES
</p>
<p><i>Article 1
</i></p>
<p>The^Purposes of the United Nations are:
1. To maintain international peace and se-
</p>
<p>curity, and to that end: to take effective collec-
tive measures for the prevention and removal of
threats to the peace, and for the suppression of
acts of aggression or other breaches of the peace,
and to bring about by peaceful means, and in con-
formity with the principles of justice and inter-
national law, adjustment or settlement of inter-
national disputes or situations which might lead
to a breach of the peace;
</p>
<p>2. To develop friendly relations among nations
based on respect for the principle of equal rights
and self-determination of peoples, and to take
other appropriate measures to strengthen univer-
sal peace;
</p>
If the files are produced outside Knime using the pdfbox-app, then it is necessary upload files per row, (Vernails Extension has ONE node but you need to install ALL the others)