handle row based on value in first field - kind of hierarchical

Adrian73 · May 11, 2022, 7:30am

Hello, very new Knime. We are looking to move away from our text based script files for handling data to something visual & are working through various examples to see how we can handle the different types of jobs we currently have working.

Now for this type I’m not even sure what you call this type of file so found it hard to search for an answer.
Basically each row has a record type indicator as the first field & that indicates what fields are expected for that record.

Types & handling could be:
100: header row. it contains the creation date that I will need for all rows.
200: premise details. I will need the premise id for the next related rows until another 200 row is read.
300: meter details. i will need the meter id for the next related rows until another 300 (or 200) row is read.
400: register read. register id, read date, interval length, read columns (1440/interval length = how many read columns after this there will be.)
500: event record. only appears when there are reading events & has various columns for identifying issues & reasons.
900: footer record. number of premises read, number of register readings
So as an example the file could look like:

100,20220511,FromParty,ToParty
200,321654987,ConfigType,DepotName
300,10005000,SerialNumber,Make,Model
400,A,20220510,144,54,67,89,56,45,64,89,68,48,58
400,B,20220510,144,64,65,65,85,45,61,34,87,84,51
300,100050203,SerialNumber,Make,Model
400,A,20220510,144.56,84,51,56,57,18,85,45,85,15
200,654456654,ConfigType,DepotName
300,100068484,SerialNumber,Make,Model
400,A,20220510,288,554,567,859,566,457
500,Z29,035
900,2,4

Then I need to put this data into 2 separate tables basically for the 400 & 500 rows with the appropriate data from their related previous rows.

Like these as an example:
Table 400:

PREMISE_ID 	 METER_ID 	 REGISTER_ID 	 READ_DT 	 INTERVAL_LENGTH 	 I01 	 I02 	 ... 	 I10	CREATE_DT
321654987	10005000	A	10/05/2022	144	54	67	48	58	11/05/2022
321654987	10005000	B	10/05/2022	144	64	65	84	51	11/05/2022
321654987	100050203	A	10/05/2022	144	56	84	85	15	11/05/2022
654456654	100068484	A	10/05/2022	288	554	567	null	null	11/05/2022

and table 500:

PREMISE_ID 	 METER_ID 	 REGISTER_ID 	 READ_DT 	ERROR_CD	REASON_CD
654456654	100068484	A	10/05/2022	Z29	`035

I have been trying to find how to perform the above. I believe I would to move through a file sequentially storing the various data from the 100, 200, & 300 into variables. Then attach it to the 400 & 500 rows in the correct order. But I can’t work out which nodes I would achieve this with. It seems all the “splitter” types would handle each row by its type but there wouldn’t be a relationship to the correct other rows.

Can somebody please assist in pointing me in the right direction.

Cheers
Adrian

victor_palacios · May 12, 2022, 5:47pm

Hello,

I tried working through an example I made up using your data as a txt file:

Special Formats.knar.knwf (66.8 KB)

I didn’t get to the level of Meter_ID but with this you should be able to understand the logic and different nodes you can use to construct what you need.

mlauber71 · May 12, 2022, 7:50pm

@Adrian73 welcome to the KNIME forum. Since I am on a mobile device I could not have a look at the solution by @victor_palacios. But it seems your example and the resulting data might not be fully consistent.

There have been similar data preparation tasks in the pastel you might want to take a look. And if you do not have a solution yet maybe amend your sample data.

Adrian73 · May 12, 2022, 11:51pm

WOW. I suspected it was going to be involved but that is crazy. It is kind of disappointing that it is this complicated. I think this makes it harder to justify using Knime as a solution to replace our current scripts.

Having said that I do like how you made it on the surface look simple. But I’m not sure the rest of the team would want to move to it if that is the type of solution we need to go into for each of our scripts.

Thank you very much for your time & effort.

Cheers

mlauber71 · May 13, 2022, 5:01am

@Adrian73 out of curiosity what tool and script would you use? My impression is KNIME can handle such scenarios but as a data analytics platform has much more to offer.

In think with your kind of data some individual programming will always be necessary. Other ETL tasks will be much easier if there is a database or a standard structure like CSV or JSON.

Adrian73 · May 13, 2022, 5:22am

Currently we are doing all by scripting in Python. Very flexible & has a lot of scientific & analytic modules. We were hoping that we could move to a visual application to cut down on the training requirement & programming knowledge required.

basically for this it would be a simple case of:
read row
is it 100 - store create date in variable
is it 200 - update premise variables
is it 300 - update the meter variables
is it 400 - record premise + meter variables + 400 row data + create date
is it 500 - record premise + meter variables + 500 row data + create date
loop til end

Oh well. I can report that I tried.

Cheers
Adrian

mlauber71 · May 13, 2022, 5:39am

@Adrian73 could it be worth a try to use your import python script within KNIME and then do the rest with visual programming? You can even run jupyter notebooks or your own python modules from KNIME (and KNIME from jupyter).

The python script could be set up once by an expert and then just be used.

This is how several large companies I know operate. If there is a great R or python script: just integrate it into KNIME. So programming experts and maybe sales people can work together.

Maybe you could share a python script for your import task and we can build a KNIME workflow around.

duristef · May 13, 2022, 1:27pm

@Adrian73 @mlauber71 @victor_palacios
In this workflow I tried to exploit the hierarchical structure of the file and the possibility of replacing the missing values of a row with the corresponding values of the upper one. I must admit that it is not very elegant, but it exemplifies a set of techniques that can be exploited in Knime

Results (columns can be reordered)

hierarchical_dataset.knwf (53.8 KB)

victor_palacios · May 13, 2022, 8:58pm

I think justifying KNIME overall is quite simple:

Price
Traditional programmers will be more expensive than a visual programmer that can build something with similar capability.
Scarcity/Sustainability
Is it easier to hire a talented traditional programmer (and keep them) or to train someone to use KNIME?
Time
Building the Minimal Viable Product is less time consuming in KNIME. Also consider the time it takes to code in any traditional language and get the insights vs. building in KNIME. Not to mention when you need to review KNIME workflows vs reviewing lines and lines of code.

In this particular case my solution was what I could come up quickly (less then 2 hours - I wonder how long the script solution took?) but others have submitted solutions as well (@duristef 's solution seems quite elegant), so it may be worth it to check if their solutions are more stream-lined and what you might need to convince others. As well, I find it easier to say “KNIME with traditional programming” vs “KNIME replacing traditional programming” since KNIME has several scripting nodes for Python, R, Java(Script), Regex.

Adrian73 · May 16, 2022, 3:24am

Thank you Duristef. You have provided me with the node that makes my answer as easy as I thought it should be.

The Missing Value node is exactly the thing that I was after to allow me to get the previous values populated for all rows. In my search to solve this I was looking for a node that specifically populated the previous value. But the only node I found was the lag node except it duplicated blank the fields. This was the big part of why I struggled to solve this myself. I found the node I thought I needed & it didn’t do what I hoped it would.

My next issue was that I thought because I had a series of “If this do that” type processes that I needed to create loops to handle it. You Knime pros know that isn’t true.

The answer really is as simple i thought it should be.

Hierarchical File read.knwf (46.7 KB)

I think this keeps Knime in consideration again. Thank you all for your assistance & time.

Cheers
Adrian

Adrian73 · May 17, 2022, 12:21am

Hey guys, further to my initial query, I did forget one critical part. I mentioned for row type 400 that the interval length determines how many columns of readings follow it. What I forgot when I constructed my question was that there are always columns after these reads. So for my example lets assume there are 2 columns; Quality Method & Load Datetime.

The issue is that for the 400 row if it’s got less reads than the 10 we need to fill the other intervals up to 10 with blanks then have those extra 2 columns after them. (These are a set amount - in the real data it is 5 columns so the number of additional columns is fixed & known.)

So it should be:
Premise_id, Meter_id, Register_id, Read_dt, Interval_length, Intervals 1 … 10, Quality_method, Load_dt, Create_dt

Example data:
400,A,20220510,144.56,84,51,56,57,18,85,45,85,15,S2,202205110012
400,A,20220510,288,554,567,859,566,457,A,202205110012

In python it’s a matter of adding all the 400 except the last 2 fields, filling the fields (up to 10) with blanks, then adding the last 2 columns of the row. (Top & tailing with the 200, 300 & 100 applicable data of course.)

I don’t even know where to begin my thoughts on this in Knime. Can you please provide some further direction.

Cheers
Adrian

duristef · May 17, 2022, 11:19am

@Adrian73
maybe this workflow can help

Input

Result (part of table 400)

Hierarchical File read 2.knwf (33.2 KB)

Adrian73 · May 18, 2022, 11:26pm

@duristef you certainly have a handle on all the different nodes. I learn more from your flows than I have elsewhere. There are quite a few instances where I can see simpler steps than what you have done. I think we would work well together. Thank you for all your assistance & time here.

Cheers
Adrian

system · May 25, 2022, 11:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.