row comparison

You probably have “None” values in column “ETID”

in ETID column some values having empty

Hi @Abhiram ,

OK, then that explains it. Please add these conditions too:
input_table_1['ETID'][i] is not None and input_table_1['ETID'][j] is not None

Add them before the conditions that use input_table_1['ETID']. For example, add it there:
and input_table_1[‘Amount’][i] == input_table_1[‘Amount’][j] and input_table_1[‘ETID’][i] is not None and input_table_1[‘ETID’][j] is not None and input_table_1[‘ETID’][i] != input_table_1[‘ETID’][j] and input_table_1[‘ETID’][i][0:2] in [“CS”, “CT”] and input_table_1[‘ETID’][j][0:2] in [“CS”, “CT”] and

So, it should integrate like this:
... and input_table_1['Amount'][i] == input_table_1['Amount'][j] and input_table_1['ETID'][i] is not None and input_table_1['ETID'][j] is not None and input_table_1['ETID'][i] != input_table_1['ETID'][j] and input_table_1['ETID'][i][0:2] in ["CS", "CT"] and input_table_1['ETID'][j][0:2] in ["CS", "CT"] and ...

I was able to reproduce the error message. It definitely happens when 1 of the 2 ETID is empty. If they’re both empty, it’s not a problem since this rule would be evaluated before the substring:
input_table_1['ETID'][i] != input_table_1['ETID'][j]

But once one of them is empty, it will then skip this rule and evaluate the substring, and that’s when it fails.

The above modifications solve this issue.

EDIT: BTW, you are not using the latest workflow that I gave, which was

You are using an older version that I gave. It’s hard to troubleshoot if you are using something different from the latest version.

You should also not paste the error message and then give us a modified script - like when your error message said the issue was on line 15, but lines 14, 15 and 16 were empty lines. All these make it hard to troubleshoot properly.

When I reproduced the error, it told me exactly the correct line of the code where the issue was.

1 Like

and what about row 16

    if output_table_1['Cleared?(Yes/No)'][i] == "Yes": #Row being compared to is already flagged, skip and move on

is it a [i] or a [j]? (not really important, it’s only a matter of performance)

You need to set both [i] and [j] to “Yes” as per my workflows (all versions of it do this).

That’s done based on your rules, which basically is to set both the Current and the Matching rows to “Yes”, where [i] is the current row, and [j] is the matching row

EDIT: Sorry, I thought it was @Abhiram who asked the question :slight_smile:
Just realized that it was @duristef who asked that after re-reading it.

Yes, so based on @Abhiram 's requirements, we need to flag both the current row and the matched row

Yes, that’s ok with me, but I’m talking about another line. As you can see, the same condition is checked in two lines (those with the arrows)

for i in range(num_of_rows):
  if i == num_of_rows: #Last row, do nothing
    break
  ==> if output_table_1['Cleared?(Yes/No)'][i] == "Yes": #Row is already flagged, skip and move on
    continue

  for j in range(i+1, num_of_rows): #On to the next rows
    ==> if output_table_1['Cleared?(Yes/No)'][i] == "Yes": #Row being compared to is already flagged, skip and move on
      continue

but if the first condition is met, then the inner loop doesn’t even start. On the contrary, if it starts there’s no need to check the same cell again. Instead, I think that in the inner loop you want to check if output_table_1['Cleared?(Yes/No)'][** j **] == "Yes". But it’s really not important, the script works the same, it’s only less efficient

Hi @duristef , yes this is because it’s going through the whole table.

It starts with a row, and then compares with the rest of the rows, and for the sake relating to the code, let’s call the starting row as [i], and then it compares with [i+1] until the end of the table, or until it finds the row that match with some conditions (it only needs to find 1 matched row, hence the “continue” when the if statement is true).

So, let’s say [j] is matched. The other rule is that if a row is already matched, you cannot match it again with another row, so that is why both [i] and [j] are set to “Yes”.

Of course, as the loop is going through the table, at some point [i] will reach [j], meaning that it would have already been flagged previously, and we want to skip that row, so yes, we don’t want the inner loop to even start for this, skip to next [i]

Ok, you set output_table_1['Cleared?(Yes/No)'][i]="Yes" in the inner loop, but even then the second (inner) condition is useless, because you break the loop after you set that cell, so you won’t meet it no more. Instead, it would make sense to check output_table_1[‘Cleared?(Yes/No)’][ j ]: if it’s a Yes, that means the row has already been processed and can be skipped (but maybe I don’t get the ratio of the application)
Sorry for being so stubborn and good night :slightly_smiling_face:

1 Like

Hi @duristef , sorry, now I see what you mean. At first, I thought you were asking about the assignment, and then I thought you were asking why we were doing the validation on [i].

Yes, you are correct! The for loop will only run if output_table_1['Cleared?(Yes/No)'][i] != "Yes", hereby rendering the check in the loop unnecessary.

And you are correct. This was meant to be output_table_1[‘Cleared?(Yes/No)’][j] == “Yes”

EDIT: @Abhiram , here’s the updated version: Row Comparison.knwf (21.7 KB)

You still haven’t answered why you are separating Case I into another Python Script node? The problem with doing this is that now you have 2 nodes to maintain and also execution-wise, it’s inefficient as both share some base conditions, meaning they have to be evaluated twice.

2 Likes

Thanks @bruno29a & @duristef the script got working,

just now completed the work and sent to my team for the final confirmation.

1 Like

I have converted this WF to a non-scripting version, which by happenstance also reduced the complexity from 2 nested loops down to a single loop. As others have suspected, it’s not as pretty though. The key ingredient was using an in-memory H2 database to enable both random row access and recursive table access.
This WF is a direct port of the Python logic shared by @bruno29a in post #116. I’d like to mention that the two nodes in his WF aren’t logically equivalent. In the rearranged snippet, rows that enter the last IF, right before cases A-H are tested, cannot get tested for case I. Replacing the last ELIF to an IF should fix that though.

Rough description of how it works:

  1. input table is stored in temporary DB
  2. loop over original rows
    current row is stored in Flow Variables
    rows to compare against are stored in table (after step 3)
  3. DB Query to get current state of table
    optimisations: queries rows that have not been cleared or checked before; skips on empty table; skips if current row is cleared; (same as Python script)
  4. checks cases (this is the ugly part; hidden in the Metanode)
  5. skips iteration if no match is found
  6. matching rows are sorted to original order, then filtered down to 1 row to make sure only pairs are changed
  7. current row is added to table, both rows are set to cleared="Yes" via Rule Engine
  8. updates status of the pair
  9. after Loop End, final table is queried from the DB



row_comparison.knwf (142.9 KB)

5 Likes

Nice one @Thyme . Using a DB Join is definitely an alternative when we want to compare data with multiple rows. The join is doing the loop for you. I just did not have the “Thyme” to explore this. It’s a 116-posts trying to get this to work for 1 method :smiley: Rules were changed, rules were added, rules needed clarifications, sample data was changed… Just did not have the energy to try something else.

I’m glad that someone did it that way too.

2 Likes

Yeah, I’ve been following this thread loosely since the beginning. It’s been quite a ride. :sweat_smile:

We could definitely let the DB do the case checking for us, but I didn’t want to node golf this. Other people need to be able to wrap their head around it after all!
Using an H2 DB is a neat big trick, which been a lot easier to learn than recursive loops. Thank you for showing me that one @bruno29a =)

4 Likes

thanks a lot @Thyme.

But as of now it is working fine with python script node. if required i will use the above workflow also.

Thanks.

So @Abhiram … is it resolved?

Hi @bruno29a ,

As of now i am following your approach, it is working.

As i am new to data analyst,
Can you help me that which things could need to be learn so that i can be strong in Data analyst. like i need to learn more on python as well as other tools(Knime etc) so can you suggest any ideas?

Hi @Abhiram , you can mark it as resolved by marking a post as solution. That way other users can easily find the solution, and the thread can be closed.

Regarding ways how you can learn Knime, what @iCFO suggested in your other thread is more or less what I would suggest too. @iCFO really did a good job there by giving different ways to do this

At least you are coming from Alteryx. I’ve not used any of these tools before. I’ve used Knime for about 1.5 years so far, and been active on the forum for just over a year (exactly 13 months). But Knime allowed me to do things that I have been doing manually for years, so I got very interested in it, and consequently, I used it heavily when I found out about it. In 2-3 weeks of using it, I learned about 90% of what I currently know. Like anything else in life, “practice makes perfect”.

2 Likes

@Abhiram Please select Brunos post as solution if you’re using his Python script. :slight_smile:

Speaking of solutions, here’s another one, using Recursive Loops:
It uses the same “kernel” for finding matching rows, with a minor modification. It uses fewer nodes and I can see it being faster than both the H2 version and the Python script, but for maintenance reasons, the Python solution is the one to go with.

How it works:

  1. split off first row (+turn it into Flow Variables to compare against other rows)
  2. find matches
    if there’s a match:
    a) first matching row scoots over to first row
    b) both rows status is set to “Yes”
  3. first row (and its match) are collected in the Loop End
  4. remaining rows go to step 1


row_comparison.knwf (282.4 KB)

1 Like

Thanks @Thyme for another solution or process

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.