Data blanding by reading several files in several folders by iterating excel column

abekchanov · November 24, 2022, 9:57pm

Hi KNIMERS,

I’m a beginner, and have recently started learning KNIME. I want to build a workflow that automates the process. The following tasks I should carry out:

While reading the csv files in a system, the system displayed errors in some of the csv files’ rows. -These error messages were exported to an Excel spreadsheet.
The excel file has several columns, one of which contains error messages about those CSV files. There is another column that contains the folder name where the csv file is located, another column that contains the file name and the row where the error occurred, and some other columns that have identical headers with the csv files’ columns.

Only the files that are given in the Excel file need to be processed. Thus, we need to iterate the columns where the folder name and the file name are given. Then we need to append the column with the error message to the corresponding CSV files and create new CSV files with corresponding error messages.

Can anyone help me to create this workflow?
I will be grateful for any advice.

mlauber71 · November 24, 2022, 11:27pm

@abekchanov if you could give an example it would be good to discuss that. It should contain (artificial) examples how your data would look like.

From what you say you will have to combine several techniques that are all present in knime. If you want to understand the whole path and file business you might read the lengthy KNIME File Handling Guide. It is long but then you will know everything about the handling of paths and files. In your case you will have to convert the paths from the excel column and loop thru them import the CSV files edit the data and save them again.

A smaller example in action is this one with the dynamic reading of files and handling of loops

Then you will have to import the CSV files - where there could be an issue since they failed and might be a problem to be imported.

abekchanov · November 25, 2022, 3:10pm

Hi @mlauber71, thank you for your support.
I have manually formulated dummy files to build a similar workflow. By looking at these files one might understand the point of the task.
Here the files but they need to be csv converted, I couldn’t upload csv version, therefore, these are in excel format. More specifically, imagine that file_1 is from folder_1, file_2 and file_3 are from folder_2, and file_4 and file_4 and file_5 are from folder_3 as in our example.

file_1.xlsx (8.8 KB)
file_2.xlsx (8.8 KB)
file_3.xlsx (8.8 KB)

file_4.xlsx (8.9 KB)
file_5.xlsx (9.0 KB)

The Error_protocol file is the excel file that shows which file in which row has detected an error message. According to this excel file, we need to iterate the file name and process only those files that are listed in the file_name column. I’m also uploading the files that show how the final result files should look like(error_report_file_1, error_report_file_2, and error_report_file_3)
Error_protocol.xlsx (9.2 KB)
Here the result files:
error_report_file_1.xlsx (9.9 KB)
error_report_file_2.xlsx (9.9 KB)
error_report_file_3.xlsx (9.9 KB)

Thank you once agian!

abekchanov · November 28, 2022, 2:15pm

Hi @mlauber71, have you seen the example files that I uploaded?

mlauber71 · November 28, 2022, 6:08pm

@abekchanov yes I got them. Thank you for the samples. I put together an example that you could check out.

What I noticed is that your example is not consistent the file_3 and file_2 are not meant to be in folder_3 - so I inserted a Try/catch block to deal with the error.

Also you might want to check how to construct file variables

abekchanov · November 29, 2022, 8:29am

@mlauber71 , Thank you very much for your thorough response. You are of great assistance to me! I’ve learned a lot, however, there are still some nodes which functionalities I need to master.

mlauber71 · November 29, 2022, 8:43am

@abekchanov glad you like it. Yes it does make sense to check out the various nodes. I have tried to add descriptions. If you have questions about certain functions feel free to ask. One benefit of the KNIME approach with a lot of small tasks is that you can always check what has bewen done in every step.

Please note: some functions rest on the structure of your data stayingf the same, like the structure of the row column (.)(row)(.). This rule will require the data to e like that. But you can easily switch that out if you must.

abekchanov · November 29, 2022, 3:28pm

@mlauber71, once more, many thanks. Here, I need to make a clarification. In this workflow, you added the paths of the result files (error_report_file_1, error_report_file_2, error_report_file_3) too. I uploaded these files to have a clear picture of the end results.
However, they shouldn’t be required to participate in the process. Only the workflow outcomes ought to match those files. The workflow needs to be created based on the Error_protocol and the files: file_1, file_2, file_3, file_4, and file_5. When we write back the results to the writer node, we need to get those results (error_report_file_1, error_report_file_2, error_report_file_3). How would I go about extracting those nodes? Could you please give me some advice on which nodes have to be modified or eliminated? Thank you in advance for your time and effort.

mlauber71 · November 29, 2022, 9:03pm

@abekchanov I added the folder name to the file in order to make it consistent if a file with the same name should appear but you can always remove that part.

The whole thing is based on the Error_protocol file.

Maybe you could use the example and describe again at the files and folders what you want to do or maybe try for yourself. My impression often is that having a clear and complete definition is the first step to a solution.

system · December 6, 2022, 9:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.