Help Needed: Looping Through AWS S3 Folders to Extract CSVs

kowisoft · November 26, 2024, 9:42am

Dear KNIMErs,

I am trying to build a workflow to loop through multiple AWS S3 folders daily and extract CSV files. The folder structure has both fixed and dynamic components, which I’ll explain in detail below. Despite my efforts, I am struggling to dynamically feed the file path into the CSV Reader node.

Here’s the folder structure inside the S3 bucket:

a/b/c/product-name-a/MONTHLY/2024-10-01/2024-10-31/xycabz345aaa-2024-10-31.csv

Breakdown of the Folder and File Structure:

a/b/c: Fixed path defining the main subfolder for daily snapshot files.
product-name-a: Semi-dynamic . This represents one of ~20-30 products. The product names are fixed for now but may change over time.
MONTHLY: A subfolder present in each product folder.
2024-10-01: The month-level subfolder. It reflects the date the folder was created (e.g., October 1st) and contains all files for that month.
2024-10-31: The daily snapshot subfolder, labeled with the creation date (e.g., October 31st).
xycabz345aaa-2024-10-31.csv: The dynamic CSV file name, combining a random hash and the creation date.

What I Have Tried:

To handle the dynamic parts, I planned to:

Use loops and String Manipulation nodes to create the required folder structure.
Dynamically generate the file paths using List Files/Folders, String Manipulation, and String to Path nodes.
Feed the resulting file path as a flow variable into the CSV Reader node.

The Issue:

When I pass the dynamically created path as a flow variable to the CSV Reader, I get an error saying the file does not exist. This happens despite verifying that the path is correct when checked manually.

My Question:

Is there a way to dynamically pass a path flow variable into the CSV Reader node to successfully read data from AWS S3 buckets? If not, is there an alternative approach for handling such dynamic S3 folder structures in KNIME?

I appreciate any guidance or ideas to resolve this issue. Thank you in advance for your help!

Add94 · November 26, 2024, 9:57pm

Hi @kowisoft

If you try to read a single file in KNIME it works but the issue occurs when looping and dynamically changing the path?

What I mean by the above is, are there definitely no access rights issues trying to access those S3 buckets via KNIME?. I had similar issues a while ago trying to access SharePoint files - turned out I had the right access as a user but the KNIME app itself had limited access privileges.

What about using the below node instead of the Amazon S3 connector?

mlauber71 · November 27, 2024, 4:09am

@kowisoft besides the question of access rights have you tried to combine this with a listing of existing files to see if the paths would work for that and knime does ‘see’ the files.

Are all files failing or does this sometimes work. In that case a re-try construct might help.

You also might check the exact error message in debug mode to see if there are further information.

kowisoft · November 28, 2024, 11:19am

Thank you both for your kind help @mlauber71 and @Add94 -

I finally got it working and the error I have encountered was based on me not understanding what String to Path node can do.

I was struggling with the fact that somehow I could only create LOCAL or RELATIVE TO paths but no CONNECTED paths. However, the latter is required by the List Files / Folders node if you want to read from an AWS S3 bucket dynamically.

But then it was simply a matter of adding an additional File System Connection Port to it and connecting it with the incoming connection from my Amazon S3 Connector node.

Now I only needed to loop through the different subfolders intelligently to get the files imported.

But both your comments somehow pushed me in the right direction, thanks again for that!

system · December 5, 2024, 11:20am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.