Is it possible to add the option to be able to get the output in html?
As the node is based in PDFBOX (the pdfbox-app has that option included ExtractText -html), could be beneficial to reconstruct “paragraphs”.
Tika and PDF Parser use the same library.
<p><b>United Nations </b></p> <p><b>Report of the Special Committee on the Charter of the United Nations and on the Strengthening of the Role of the Organization </b> </p>
NON Regex Processable
United Nations Report of the Special Committee on the Charter of the United Nations and on the Strengthening of the Role of the Organization
Recreate paragraphs is complicated without thoseguides.
Almost impossible without the
<p></p> from an scanned document UN Charter
<p>CHAPTER I PURPOSES AND PRINCIPLES </p> <p><i>Article 1 </i></p> <p>The^Purposes of the United Nations are: 1. To maintain international peace and se- </p> <p>curity, and to that end: to take effective collec- tive measures for the prevention and removal of threats to the peace, and for the suppression of acts of aggression or other breaches of the peace, and to bring about by peaceful means, and in con- formity with the principles of justice and inter- national law, adjustment or settlement of inter- national disputes or situations which might lead to a breach of the peace; </p> <p>2. To develop friendly relations among nations based on respect for the principle of equal rights and self-determination of peoples, and to take other appropriate measures to strengthen univer- sal peace; </p>
If the files are produced outside Knime using the pdfbox-app, then it is necessary upload files per row, (Vernails Extension has ONE node but you need to install ALL the others)