I would like to create an APP in Knime Analytics Platform reading data, reviewing this data, correct some columns and then stored the verified data in a database.
Let me explain the steps.
Read data from CSV files. All CSV files have the same structure. Having a header section and a data section. The header section is of the form ‘parameter : value’. The data sections contains of multiple rows. Each row 1 measurement. This data is put into a table. Each data row is a row is this table and the header data is added to the row as columns. The parameter names are the column names. As there are a lot of files and I process the files in a loop, reading the file by submitting the file name as a path variable to the CSV reader, this part of the process takes a lot of time when there are a lot (let’s say > 1000) files are processed. Re-executing of this part is a time consuming process.
The header data is checked and corrected if needed. Fortunately the headers of all files are the same for a batch of CSV files. When there is a parameter missing or wrong this must be corrected. This checking and correction is a manual operation.
Store the corrected table in a database (with some additional meta data)
How can I make this a workflow as an app with components in the AP? In step 2 I must be able to update the table with some user input. When this input is ready the table must be updated and stored in the database.
Do I need to use multiple components, for each step a component?
What are best practices for this process?
I hope this is already sufficient information for guiding me to possible solutions.
Hello @Johnny_Gel
Reading throughout the challenge description, You are requesting advice on a broad set of topics; we (people helping in forum) are willing to help but our time is limited too. I would recommend to split the topics aiming to get attention from relevant topic champions to your posts, and besides, you can give some more details providing examples.
As I see it:
FIRST TOPIC: Read data from CSV files and performance.
I understood that you have already an ETL workflow, however from description I assume that your workflow processes all data each time (“Re-executing of this part is a time consuming process.”). Have you tried to process only incremental data and append it to existing one (previous process)? ; BULLET 3 how is your storage system / DB?
SECOND TOPIC: Unpaired header templates from batch files.
You commented that you need to check manually. If you can extend on this topic
what is the root problem for this: typing issues, missed columns…
as far as there is a logic behind these corrections, you can translate the logic to the workflow; then you can automate these corrections.
THIRD TOPIC (this isn’t bullet 3): app components
When you talk about components; I think you are really talking about ‘Interactive Visualizations (dashboards)’. They are related but not necessarily the same. Your questions about are quite generic. What are the sketched steps in your workflow? What is the user interaction for every step?
I have a manual header re-map component that is based on the interactive view, but I have only tested / used my log file and backup approach on the desktop. If you test it on a server, then I would love to know how it works. It is drop-down driven based on matching the headers in another table or list. It is overbuilt by quite a bit for small jobs since I am often remapping hundreds of headers at once, but you can also just turn the filters off for simplicity. It can also re-order, ensure missing header names when no re-map is present, etc. Already used names are removed from the drop-down options. There is an option to remove a remap, which opens it back up as an option on the drop-down list. It is a pretty heavy component which takes some time to execute, but worth it on my end for the manual UI.
This is header names only. Converting types is too error prone and can require data manipulation, so I split that into another step.
Here is a workflow that you can use to see how it works.
You can use a Table Difference Finder or custom rule test to check if header re-mapping is required. You could then use a case switch to bypass it if the headers match already.
Thanks for taking time to answer. I really appreciate it.
Your interpretation of my challenge was broader than I meant to be, sorry for that. Almost all topics I have solved in a other workflow. The main topic I want to discuss is the stop/pause and go of a workflow.
Reading CSV file and putting them in a table after some manipulation is done in a loop. I cannot read all csv files in once because I have to split each file in a header part and data part. But maybe I’m wrong.
When the header is incorrect, I have to correct them. So you need to check the header first. Fortunately I only process csv files from 1 batch and the batch type meta data like things as used program, batch name, etc are the same and the error is the same for the complete batch. So I can insert correct values for the some to the header parameter. The decision to change those parameters can be made after reading the first file or when all files are read. Correction can be done either in the read csv file loop or when all csv file are in a table. So this is not an issue which I can’t solve.
When I checked the data is correct and are not yet in the database, I store them in a database. Further analysis of the data is done with the database as input.
My main issue is to stop the workflow check and correct some data and then continue with the storage in the database.
Currently my solution is to store the data in knime tables, create components and run and show these components separately.
It could be that what I want to achieve is not possible.
Connect the variable output port to a Wait…. node, configuring it in a ‘Wait for file…’ Modification option. It will wait until ‘Date modified’ file metadata changes.
I hope this helps.
BR
P.S. Avoid MS AutoSave mode if the case: If you want to turn AutoSave off, by default, for all files, go to File > Options > Save and uncheck the box next to AutoSave