Read pdf files from google storage without downloading the files

Hi,

I need to read PDF files from google storage without downloading them, since although it is done with download node, they are too many files, so that’s why I need to avoid having to download them.

image

In python script node I use this code to read the pdfs and extract the information, I have tried to put in the location the value of Google cloud storage File Picker node, also the value of the list remote files from GCP, but when executing the code I always get the error that Can’t find the specified directory, does anyone know how I can read PDF files remotely?

Hi @daviddelos - sorry for the delayed response here.

I may be misunderstanding the problem, but I don’t think there’s a way to do this without downloading the files in some way. Even if you were able to read the PDFs via a Python script node, that data would still be downloaded locally for Python to ingest, right?

Maybe a workaround is some sort of loop where you only download and process a few PDFs at a time, then delete them to save space. I’m assuming disk space is your concern here?

(I could be way off base, and maybe there’s an easy solution I don’t know about - if that’s the case I hope someone else will chime in! I just haven’t seen anyone try to do remote GCP processing using a Python Script node before…)

2 Likes

Hi @ScottF,
Thanks for answering,
Yes, you understood my problem well.
And yes, I should not download the PDF because it is too large an amount of files and when I download them all at once my memory fills up, but your idea can help me, I will include a loop to download the PDF, extract the text from them and delete the file after the text has been extracted, before continuing.
Thanks for the idea.

1 Like

Hi @daviddelos,

I think that is the direction (no file downloading) with new file handling nodes. See this topic from previous release: excel reader 4.1 file connector port

And see here for from today’s release:
https://www.knime.com/whats-new-in-knime-42#file-handling-framework

Br,
Ivan

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.