So I’ve been at this for a long time now, and I’ve decided to come here for help. I’m trying to save a webpage as a PDF (and or some other way to extract the text) and I haven’t found a proper solution yet. Here’s the URL
Here is a list of the things I’ve done and some outcomes.
Selenium
Create PDF - using headless chrome/firefox…then binary to PDF…but the filethat’s created is
jumbled symbols/garbage
Take screenshot - screenshot is somewhat okay, the Tess4J OCR doesn’t produce good results
from it and I’m guessing it’s because of the scale of the picture?
Finding javascript code online to save as PDF somewhere - couldn’t make anything work
Setting chrome/firefox to download PDFs instead of open in browser, but the PDF that shows up
isn’t the same as what you get when you open the URL and hit save or print to PDF
KNIME
GET Request/Https retriever…couldn’t get anything working here
Python scripts…tried finding some scripts to navigate to the URL and save as PDF, but couldn’t get
it here either
Sounds like the result file returned by the URL is already the PDF (which is also suggested by the PDFGenerator in the name). Thus simply saving the file should suffice (no matter whether you do this via Selenium or the HTTP Retriever, etc.)