PDF file to image

kolster · November 27, 2019, 12:33pm

Dear forum,

I hope that somebody can give me a hint on how to proceed. I have tried to look around the various threads but can’t seem to find the nodes that I need. My task is rather simple ( I think): I have a bunch of pdf files coming from automatically generate analysis reports. I would like to take the chromatogram itself from each file and keep it as an image, which I can crop etc and put into a report
I have tried Tika Parser and PDF Parser, but it does not give me any image to proceed with. The window that I eventually want is the same every time:

Attached is a zip archive with three of the typical pdf files. Hope you can give me a hint.
Best,
Kolster
pdf test.zip (85.8 KB)

Martyna · November 29, 2019, 3:24pm

Hi @kolster

in principle possible, yes.
The diagrams in your pdf might be tricky as the labels/numbers/title are saved there as text and not part of the image. Can you play with some parameters while saving or generating the reports?
I tested the Tika Parser with simple publications from PubMed and it was working very good while your pdfs seem not to have images.
Maybe you can save the reports as a doc and then manually transform to pdf and see if there is a difference?

Best,
Martyna

kolster · December 2, 2019, 11:13am

Hi Martyna,
Thank you very much for your notes and for looking into my challenge here. I think I pretty much reached the same conclusion by also throwing some literature pdfs into the pool. These generated all the expected pictures.
I have the challenge that I have thousands of these reports, so playing around with their generation is not my first option…
I tried to transform some of the pdfs into .png using manual upload to a conversion site on the web (https://pdf2png.com/). These I could take in using an image reader.
Is there a way of generating an image from the first page of the pdf - like printing the pdf to an image? Then cropping the image down to my desired chromatograms?
Best,
Lars

qqilihq · December 2, 2019, 11:42am

Hi Kolster,

I came up with one potential solution using the Selenium Nodes (yes, I’m biased).

The following workflow will load your PDF file in a web browser (I tested it with Chrome, Chromium, and Firefox on the Mac – as these all have an integrated PDF viewer) and then simply take a screenshot. It’s currently just a proof of concept, as I was interested whether this would be possible at all – and it works.

The workflow’s output will give you a table which contains the screenshot as PNG:

From there you could perform some cropping steps within KNIME to get the desired area.

To get a higher resolution, I’d try to send the key commands for zooming to the browser before taking the screenshot (and probably increase the browser window’s size using the “Window” node). You could of course also try to switch to follow-up pages as necessary and take further screenshots there. But I haven’t tried this myself.

You can find the workflow on my NodePit Space:

Best,
Philipp

Disclaimer: I’m the developer of the Selenium Nodes and the Selenium Nodes are a paid plugin to support their development. There’s a free 30-day trial for evaluation.

system · June 1, 2020, 11:50pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.