PDF Parser - add option -html (Suggestion)

ricknime · May 19, 2022, 9:30am

Suggestion:

Is it possible to add the option to be able to get the output in html?

As the node is based in PDFBOX (the pdfbox-app has that option included ExtractText -html), could be beneficial to reconstruct “paragraphs”.

Tika and PDF Parser use the same library.

Regex Processable

<p><b>United Nations 
</b></p>
<p><b>Report of the Special 
Committee on the Charter of 
the United Nations and on the 
Strengthening of the Role of 
the Organization 
</b> 
</p>

NON Regex Processable

  
United Nations 
Report of the Special 
Committee on the Charter of 
the United Nations and on the 
Strengthening of the Role of 
the Organization

Recreate paragraphs is complicated without those

guides.

Almost impossible without the <p></p> from an scanned document UN Charter

<p>CHAPTER I
PURPOSES AND PRINCIPLES
</p>
<p><i>Article 1
</i></p>
<p>The^Purposes of the United Nations are:
1. To maintain international peace and se-
</p>
<p>curity, and to that end: to take effective collec-
tive measures for the prevention and removal of
threats to the peace, and for the suppression of
acts of aggression or other breaches of the peace,
and to bring about by peaceful means, and in con-
formity with the principles of justice and inter-
national law, adjustment or settlement of inter-
national disputes or situations which might lead
to a breach of the peace;
</p>
<p>2. To develop friendly relations among nations
based on respect for the principle of equal rights
and self-determination of peoples, and to take
other appropriate measures to strengthen univer-
sal peace;
</p>

If the files are produced outside Knime using the pdfbox-app, then it is necessary upload files per row, (Vernails Extension has ONE node but you need to install ALL the others)

badger101 · May 19, 2022, 12:38pm

Hi @ricknime I’m curious as to whether you have found a way to parse that particular file you referred to, cause it really has some inconsistencies for the page layout inside.

ricknime · May 19, 2022, 3:04pm

Directly using the pdfbox

java -jar pdfbox-app-2.0.26.jar ExtractText -html D:\Projects\UN-TextMining\Documents\PDF\en\uncharter-en.pdf output\UNtest.html

Equivalent to Tika Node and PDF Parser Nodes

java -jar pdfbox-app-2.0.26.jar ExtractText D:\Projects\UN-TextMining\Documents\PDF\en\uncharter-en.pdf output\UNtest.html

system · August 17, 2022, 3:04pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.