Save a webpage as PDF

So I’ve been at this for a long time now, and I’ve decided to come here for help. I’m trying to save a webpage as a PDF (and or some other way to extract the text) and I haven’t found a proper solution yet. Here’s the URL

https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXFLXFW0000AD&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC

Here is a list of the things I’ve done and some outcomes.

  • Selenium

    • Create PDF - using headless chrome/firefox…then binary to PDF…but the filethat’s created is
      jumbled symbols/garbage
    • Take screenshot - screenshot is somewhat okay, the Tess4J OCR doesn’t produce good results
      from it and I’m guessing it’s because of the scale of the picture?
    • Finding javascript code online to save as PDF somewhere - couldn’t make anything work
    • Setting chrome/firefox to download PDFs instead of open in browser, but the PDF that shows up
      isn’t the same as what you get when you open the URL and hit save or print to PDF
  • KNIME

    • GET Request/Https retriever…couldn’t get anything working here
    • Python scripts…tried finding some scripts to navigate to the URL and save as PDF, but couldn’t get
      it here either

Anyone have any solutions?

Hi @mir50531,

Create PDF - using headless chrome/firefox…then binary to PDF…but the filethat’s created is is
jumbled symbols/garbage

Just out of interest: Does this still occur when using **non-**headless?

–Philipp

Good question. It doesn’t work. The selenium nodes require headless to operate the Create PDF node.

1 Like

The selenium nodes require headless to operate the Create PDF node.

Ha. Indeed.

Just tried this, but when following your link I see a 403 Forbidden. Is this URL still working and/or is this URL publicly accessible?

It’s working for me. Are you in the US? I was working w/ a python developer in Spain and he couldn’t access it.

No, I’m in Germany. So it’s probably location-related.

1 Like

Thanks for trying. I’m hiring someone to build a python script that I can hopefully integrate.

1 Like

Hi @qqilihq

If tried the link using a VPN from France (configured as US) and I could access @mir50531’s PDF example. Just in case this could help :wink:

Best,

Ael

3 Likes

Hi all!

Am I missing something?

I was able to access file via US VPN from Italy and download it. Then I was able to read with PDF Parser node…

What’s the problem automating this inside a workflow?

Ludovico

PS: I tried also to grab the file via wget command but I got server error.

3 Likes

Sounds like the result file returned by the URL is already the PDF (which is also suggested by the PDFGenerator in the name). Thus simply saving the file should suffice (no matter whether you do this via Selenium or the HTTP Retriever, etc.)

– Philipp

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.