Convert a PDF file to HTML

Stephane73 · October 16, 2020, 1:17pm

Hi everybody,

I would like to convert PDF Files to a HTML files but I don’t know how to do it automatically.

Can you explain me if it’s possible to do that and how to do it ?

Thank you very much for your help.

wkhan · October 16, 2020, 2:16pm

I don’t see an example that does this on the hub. You could use something like this, or the PDF Parser node to extract PDF information, and then use one the HTML writer nodes:

Table to HTML Labs
Table to HTML String

Regards,
Wali

Stephane73 · October 16, 2020, 2:56pm

Thank your for you’re answer.

I’ve already try to use the PDF Parser but the result is not satisfactory.

So I would like to convert directly the document to an html document and be able to retrieve the information from there.

ScottF · October 16, 2020, 3:29pm

You can also try the Tika Parser for ingesting PDF files (as well as lots of other types of files)… but conversion from PDF to HTML will probably be a bit messy regardless of the strategy you use.

Stephane73 · October 17, 2020, 11:07pm

Hi @ScottF

I’ve already used the Tika Parser but for somes files I have a recognition issues.

For example there is some words which are interspersed with spaces like : F R E E D O M or words with unrecognized characters (M�dical �mergencies).

So, I 've tried to convert the PDF in different format to resolve this issue and finnaly to find that the Html format don’t have this issue.

Instead of using the Tikka parser node to read a PDF File, I’ve in mind to convert the PDF in a Html document and extract the text from the Html.

This is why I need to use a method to automatically convert PDF files to html.
But I don’t know if we can do it in Knime.

Thank you for you’re help.

ScottF · October 19, 2020, 3:30pm

I don’t think there is a way to do this natively with KNIME nodes aside from what Wali posted above, and a quick google search indicates there’s not a package in R to do this either.

What you might try instead is to call a Python package like pdfminer, which you could run from a Python Snippet node. Alternately, you might install a command line tool like Pandoc, which you could then call directly with KNIME using the External Tool node.

Stephane73 · October 20, 2020, 10:48am

Thank you for you’re suggestion.

Unfortunately, I don’t have experience of coding in Python, I’ll do my best to achieve this task.

Daniel_Weikert · October 20, 2020, 6:00pm

Do you have an example for testing?

Stephane73 · October 31, 2020, 8:27pm

So I have convert a PDF to a HTML file and I’m using then, the HTML Parser to read the file.

But unfortunaltelly the result is not good.

I 've use a python library to convert the PDF File to the HTML file. You can find the file in the attachments.
Nestlé.txt (36.6 KB) Nestlé.xml (36.9 KB)

Ps: I can’t upload pdf and html files in this post, so i change the extension to txt and to xml.

mlauber71 · November 2, 2020, 4:53am

Do you also have the workflow or code that did this?

Stephane73 · November 2, 2020, 1:45pm

I did the conversion separately in python without using Knime.

I used the pd2html program.

mlauber71 · November 2, 2020, 1:47pm

Maybe it could be an idea to integrate the Python cod into a KNIME Python node?

system · May 4, 2021, 1:47am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.