Save a webpage as PDF

mir50531 · April 22, 2021, 2:50pm

So I’ve been at this for a long time now, and I’ve decided to come here for help. I’m trying to save a webpage as a PDF (and or some other way to extract the text) and I haven’t found a proper solution yet. Here’s the URL

https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXFLXFW0000AD&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC

Here is a list of the things I’ve done and some outcomes.

Selenium
- Create PDF - using headless chrome/firefox…then binary to PDF…but the filethat’s created is
  jumbled symbols/garbage
- Take screenshot - screenshot is somewhat okay, the Tess4J OCR doesn’t produce good results
  from it and I’m guessing it’s because of the scale of the picture?
- Finding javascript code online to save as PDF somewhere - couldn’t make anything work
- Setting chrome/firefox to download PDFs instead of open in browser, but the PDF that shows up
  isn’t the same as what you get when you open the URL and hit save or print to PDF
KNIME
- GET Request/Https retriever…couldn’t get anything working here
- Python scripts…tried finding some scripts to navigate to the URL and save as PDF, but couldn’t get
  it here either

Anyone have any solutions?

qqilihq · April 22, 2021, 2:55pm

Hi @mir50531,

Create PDF - using headless chrome/firefox…then binary to PDF…but the filethat’s created is is
jumbled symbols/garbage

Just out of interest: Does this still occur when using **non-**headless?

–Philipp

mir50531 · April 22, 2021, 3:25pm

Good question. It doesn’t work. The selenium nodes require headless to operate the Create PDF node.

qqilihq · April 22, 2021, 3:44pm

The selenium nodes require headless to operate the Create PDF node.

Ha. Indeed.

Just tried this, but when following your link I see a 403 Forbidden. Is this URL still working and/or is this URL publicly accessible?

mir50531 · April 22, 2021, 3:51pm

It’s working for me. Are you in the US? I was working w/ a python developer in Spain and he couldn’t access it.

qqilihq · April 22, 2021, 4:04pm

No, I’m in Germany. So it’s probably location-related.

mir50531 · April 22, 2021, 5:01pm

Thanks for trying. I’m hiring someone to build a python script that I can hopefully integrate.

aworker · April 23, 2021, 12:53pm

Hi @qqilihq

If tried the link using a VPN from France (configured as US) and I could access @mir50531’s PDF example. Just in case this could help

Best,

Ael

zioludo · April 23, 2021, 2:43pm

Hi all!

Am I missing something?

I was able to access file via US VPN from Italy and download it. Then I was able to read with PDF Parser node…

What’s the problem automating this inside a workflow?

Ludovico

PS: I tried also to grab the file via wget command but I got server error.

qqilihq · April 23, 2021, 7:03pm

Sounds like the result file returned by the URL is already the PDF (which is also suggested by the PDFGenerator in the name). Thus simply saving the file should suffice (no matter whether you do this via Selenium or the HTTP Retriever, etc.)

– Philipp

system · October 23, 2021, 7:03am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.