Convert a PDF file to HTML

Hi everybody,

I would like to convert PDF Files to a HTML files but I don’t know how to do it automatically.

Can you explain me if it’s possible to do that and how to do it ?

Thank you very much for your help.

Hi @Stephane73,

I don’t see an example that does this on the hub. You could use something like this, or the PDF Parser node to extract PDF information, and then use one the HTML writer nodes:

Table to HTML Labs
Table to HTML String

Regards,
Wali

Thank your for you’re answer.

I’ve already try to use the PDF Parser but the result is not satisfactory.

So I would like to convert directly the document to an html document and be able to retrieve the information from there.

You can also try the Tika Parser for ingesting PDF files (as well as lots of other types of files)… but conversion from PDF to HTML will probably be a bit messy regardless of the strategy you use.

Hi @ScottF

I’ve already used the Tika Parser but for somes files I have a recognition issues.

For example there is some words which are interspersed with spaces like : F R E E D O M or words with unrecognized characters (M�dical �mergencies).

So, I 've tried to convert the PDF in different format to resolve this issue and finnaly to find that the Html format don’t have this issue.

Instead of using the Tikka parser node to read a PDF File, I’ve in mind to convert the PDF in a Html document and extract the text from the Html.

This is why I need to use a method to automatically convert PDF files to html.
But I don’t know if we can do it in Knime.

Thank you for you’re help.

I don’t think there is a way to do this natively with KNIME nodes aside from what Wali posted above, and a quick google search indicates there’s not a package in R to do this either.

What you might try instead is to call a Python package like pdfminer, which you could run from a Python Snippet node. Alternately, you might install a command line tool like Pandoc, which you could then call directly with KNIME using the External Tool node.

1 Like

Thank you for you’re suggestion.

Unfortunately, I don’t have experience of coding in Python, I’ll do my best to achieve this task.

Do you have an example for testing?

So I have convert a PDF to a HTML file and I’m using then, the HTML Parser to read the file.

But unfortunaltelly the result is not good.

I 've use a python library to convert the PDF File to the HTML file. You can find the file in the attachments.
Nestlé.txt (36.6 KB) Nestlé.xml (36.9 KB)

Ps: I can’t upload pdf and html files in this post, so i change the extension to txt and to xml.

Do you also have the workflow or code that did this?

I did the conversion separately in python without using Knime.

I used the pd2html program.

Maybe it could be an idea to integrate the Python cod into a KNIME Python node?

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.