PDF Parser - add option -html (Suggestion)

Suggestion:

Is it possible to add the option to be able to get the output in html?

As the node is based in PDFBOX (the pdfbox-app has that option included ExtractText -html), could be beneficial to reconstruct “paragraphs”.

Tika and PDF Parser use the same library.

Regex Processable

<p><b>United Nations 
</b></p>
<p><b>Report of the Special 
Committee on the Charter of 
the United Nations and on the 
Strengthening of the Role of 
the Organization 
</b> 
</p>

NON Regex Processable

  
United Nations 
Report of the Special 
Committee on the Charter of 
the United Nations and on the 
Strengthening of the Role of 
the Organization 
 

Recreate paragraphs is complicated without those

guides.

Almost impossible without the <p></p> from an scanned document UN Charter

<p>CHAPTER I
PURPOSES AND PRINCIPLES
</p>
<p><i>Article 1
</i></p>
<p>The^Purposes of the United Nations are:
1. To maintain international peace and se-
</p>
<p>curity, and to that end: to take effective collec-
tive measures for the prevention and removal of
threats to the peace, and for the suppression of
acts of aggression or other breaches of the peace,
and to bring about by peaceful means, and in con-
formity with the principles of justice and inter-
national law, adjustment or settlement of inter-
national disputes or situations which might lead
to a breach of the peace;
</p>
<p>2. To develop friendly relations among nations
based on respect for the principle of equal rights
and self-determination of peoples, and to take
other appropriate measures to strengthen univer-
sal peace;
</p>

If the files are produced outside Knime using the pdfbox-app, then it is necessary upload files per row, (Vernails Extension has ONE node but you need to install ALL the others)

Hi @ricknime I’m curious as to whether you have found a way to parse that particular file you referred to, cause it really has some inconsistencies for the page layout inside.

Directly using the pdfbox

java -jar pdfbox-app-2.0.26.jar ExtractText -html D:\Projects\UN-TextMining\Documents\PDF\en\uncharter-en.pdf output\UNtest.html

Equivalent to Tika Node and PDF Parser Nodes

java -jar pdfbox-app-2.0.26.jar ExtractText D:\Projects\UN-TextMining\Documents\PDF\en\uncharter-en.pdf output\UNtest.html