row comparison

Hi @Abhiram most of the things can be done without Python, including this. However, in this case, Python (or Java) is the more efficient way to tackle this. The challenge here is that Knime processes a table row-wise, meaning that you can easily access each of the column values across a row, but it’s not as direct if you want to access different rows at the same time. You have to build different combinations of nodes in order to do this.

In your case, it’s even more complex in that you want to do this for different rows for different columns. It will take quite some steps to reach what you are trying to do, not to mention that you will then have to apply the rules on top of that.

What is the issue with using Python?

ok ok @bruno29a ,

Because my team currently was not supporting me to do python script node in between the knime workflow

where they telling that in future this workflow need to deploy in Production at that time may be encountered some issues.

Hi Team (@bruno29a) ,

one of the person in this community had given the solution without using python script node in the workflow,

Set value on lagged row multi-column-conditions with adjacent rows (2).knwf (43.2 KB)

but the above work flow is working fine for certain rows, ex: i mean if ROW 1 == ROW 2 ( not including 9th condition) then Cleared column able to get YES for both rows,

but here one challenge is if ROW 1 == ROW 4 it is not comparing and also not able to print the values as YES (WHere if it met the 8 conditions) for both ROW 1 and ROW 4.

Is there any chance to modify the above workflow to solve the issue without python.

As i am very new in Knime, if i able to solve these type of issues, so that i can able to learn from all of you. It may help me a lot if you provide some solution

for using python i am still waiting for the team confirmation to whether it’s fine to use python or not
?
if they accept to use python then no issues

Hi @Abhiram , as I mentioned already, it probably is possible to do this without Python, but it will be much more complex and longer to implement.

The Lag Column is definitely one of the tools that you can use. However, the limitation with the Lag Column is that you can only compare 2 specific rows at a time. If your case, you need to compare one row to ALL the rows under it, and then repeat the same procedure for each rows after. I can’t even count how many Lag Column nodes to do this, and of course, the more rows you have, the more nodes you have to use. And this is not even implementing a dynamic solution (as in for varying number of rows).

In addition of the challenge of reading multiple rows at a time, you need to apply different rules/conditions while doing this reading.

Python script it the only direct way to access various rows at a time and where you can apply the rules. Not even the Java Snippet allows you to do that (to read various rows at a time).

1 Like

Hi @bruno29a , Thanks for the solution which you provided through python script.

it is working on my original data, but small error was coming while executing the script on my original data where it almost contains 1300 rows:

Row Comparison_Python.knwf (27.6 KB)
error was getting on node 6(python script) on my original data

the error was on Case I i.e : “NoneType” object is not subscriptable **
** Traceback (most recent call last):

** file “”, line 15, in **
** TypeError: ‘NoneType’ object is not subscriptable**

Hi @bruno29a,

Can u help on this?

Hi @Abhiram , can you share the data that it’s complaining on? It’s not in the workflow.

I ran your node 6 with the sample data that came with the workflow, and it did not complain. I need the file that you used to see what’s the issue.

EDIT: As you can see, node 6 ran without any issue:

Is it the same sample file, and is it the same python script? There’s nothing on line 15 in the script. In fact, the lines before and after are also empty (14, 15, 16).

thanks for the reply,

yes for the sample data the workflow was running fine, but while checking on my orginal data facing issue on below line

if input_table_1[‘Cashlocation’][i] == input_table_1[‘Cashlocation’][j] and input_table_1[‘Currency’][i] == input_table_1[‘Currency’][j] and input_table_1[‘Amount’][i] == input_table_1[‘Amount’][j] and input_table_1[‘ETID’][i] != input_table_1[‘ETID’][j] and input_table_1[‘ETID’][i][0:2] in [“CS”, “CT”] and input_table_1[‘ETID’][j][0:2] in [“CS”, “CT”] and ((input_table_1[‘Transaction SubType’][i] == “BUY” and input_table_1[‘Transaction SubType’][j] == “SELL”) or (input_table_1[‘Transaction SubType’][i] == “SELL” and input_table_1[‘Transaction SubType’][j] == “BUY”)):

where in line 17 on node 6,

error was:

“NoneType object is not subscriptable”
“Traceback (most recent call last)”

i am thinking the code was working fine for sample data but it is not working for the original data (just case I)
May be we need to modify anything.

—>apologies i cannot able to send the original data here

Hi @Abhiram
Looks like one or more columns are missing. Maybe neither input_table_1 (in the original data ) nor output_table_1 (which is a copy of the former, when the loop starts) have the “Cleared?(Yes/No)” column?

Hi @duristef ,

i had cross checked with the data where none columns missing, if any columns missing, but while checking on sample data the workflow was working.

but while i checked with original data only not working getting error

I looked more carefully at your script in Node 6 and I think that a problem comes from the index of the dataframe, which is not an integer, but a string (“Row0”, “Row1”,…).
I think that there’s another error here

  for j in range(i+1, num_of_rows): #On to the next rows
    if output_table_1['Cleared?(Yes/No)'][i] == "Yes":

You probably mean “output_table_1[‘Cleared?(Yes/No)’][j]”.

I’ve rewritten the script. Check it carefully, I hope it suits your needs.

import pandas as pd

num_of_rows = len(input_table_1)

# reindex input_table_1 with integers
s = pd.Series([x for x in range(num_of_rows)])
input_table_1.set_index(s, inplace=True)

# Copy input to output
output_table_1 = input_table_1.copy()

for i in range(num_of_rows):
	''' 
        NO NEED: LOOP WILL STOP AT num_of_rows-1
	if i == num_of_rows: #Last row, do nothing
		break
	'''
	
	if output_table_1.at[i,'Cleared?(Yes/No)'] == "Yes": #Row is already flagged, skip and move on
		continue

	for j in range(i+1, num_of_rows): #On to the next rows
		if output_table_1.at[j,'Cleared?(Yes/No)'] == "Yes": #Row being compared to is already flagged, skip and move on
			continue
		
		if input_table_1.at[i, 'Cashlocation'] == input_table_1.at[j, 'Cashlocation'] and input_table_1.at[i, 'Currency'] == input_table_1.at[j, 'Currency'] and input_table_1.at[i, 'Amount'] == input_table_1.at[j, 'Amount'] and input_table_1.at[i, 'ETID'] != input_table_1.at[j, 'ETID'] and input_table_1.at[i, 'ETID'][0:2] in ["CS", "CT"] and input_table_1.at[j,'ETID'][0:2] in ["CS", "CT"] and ((input_table_1.at[i, 'Transaction SubType'] == "BUY" and input_table_1.at[j, 'Transaction SubType'] == "SELL") or (input_table_1.at[i, 'Transaction SubType'] == "SELL" and input_table_1.at[j, 'Transaction SubType'] == "BUY")):
			# output_table_1['Cleared?(Yes/No)'][i] = "Yes" ################# NO NEED, JUST CHECKED
			output_table_1.at[j, 'Cleared?(Yes/No)'] = "Yes"
			print(i, j, "Case I")
			break

I don’t know which line is 17 but the None Error points to a referenced row/column is None
If you can only debug yourself and the hint line 17 does not help, then split the formula into pieces and test it step by step

Hi @Abhiram , I think I figured out the issue, and that’s thanks to @duristef 's hint of:

        NO NEED: LOOP WILL STOP AT num_of_rows-1
	if i == num_of_rows: #Last row, do nothing
		break

@Abhiram @duristef , we do need the condition of skipping if it’s at the last row, it’s just that I forgot that the range would go from 0 to (num_of_rows - 1)

@Abhiram , change this line
from:
if i == num_of_rows: #Last row, do nothing
to
if i == (num_of_rows - 1): #Last row, do nothing

And you should do this in both of your Python scripts - any reason why you separated my original script into 2 scripts?

EDIT: @duristef

Both the one at index [i] and index [j] need to be set to “Yes”.

I’m not a Python developer, so I have a question for you:
What is the difference between:
output_table_1.at[i,'Cleared?(Yes/No)']
and:
output_table_1['Cleared?(Yes/No)'][i]

A) You’re right, output_table_1.at[i, 'Cleared?(Yes/No)'] = "Yes" must remain. My mistake.

B) There’s no need to check for the value of i: when it becomes = num_of_rows-1, the start value of the inner loop (i+1=num_of_rows) becomes greater than the stop value (num_of_rows-1) and the inner loop is skipped. The problem’s not there.

C) Both lines you quote access the same cell of the dataframe. You access the values using an array-like index, e.g.
output_table_1['Cleared?(Yes/No)'][0]

while I must use the dataframe index
output_table_1.at["Row0", 'Cleared?(Yes/No)']

so I need to reindex the dataframe in order to use the loops. I thought that the non-numerical indexing of the dataframe would cause errors, but that’s not true if one uses your syntax.
The difference between the two methods becomes relevant when you set the cell value. Your method raises a warning, and this page explains why

D) When I say

You probably mean “output_table_1[‘Cleared?(Yes/No)’][j]”

I mean “after the start of the inner loop”, because I assume you don’t want to check the same value you’ve checked in the outer loop to continue without executing the inner one

import pandas as pd

# Copy input to output
num_of_rows = len(input_table_1)
s = pd.Series([x for x in range(num_of_rows)])
input_table_1.set_index(s, inplace=True)
output_table_1 = input_table_1.copy()

for i in range(num_of_rows):
	''' NO NEED: LOOP WILL STOP AT num_of_rows-1 
        AND NEXT LOOP WON'T START BECAUSE start value = i+1 = num_of_rows > stop value = num_of_rows-1
	if i == num_of_rows: #Last row, do nothing
		break
	'''
	
	if output_table_1.at[i,'Cleared?(Yes/No)'] == "Yes": #Row is already flagged, skip and move on
		continue

	for j in range(i+1, num_of_rows): #On to the next rows
		if output_table_1.at[j,'Cleared?(Yes/No)'] == "Yes": #Row being compared to is already flagged, skip and move on
			continue
		
		if (
			input_table_1.at[i, 'Cashlocation'] == input_table_1.at[j, 'Cashlocation']
			and input_table_1.at[i, 'Currency'] == input_table_1.at[j, 'Currency'] 
			and input_table_1.at[i, 'Amount'] == input_table_1.at[j, 'Amount'] 
			and input_table_1.at[i, 'ETID'] != input_table_1.at[j, 'ETID'] 
			and input_table_1.at[i, 'ETID'][0:2] in ["CS", "CT"] 
			and input_table_1.at[j,'ETID'][0:2] in ["CS", "CT"] 
			and (
					(input_table_1.at[i, 'Transaction SubType'] == "BUY" and input_table_1.at[j, 'Transaction SubType'] == "SELL") 
					or
					(input_table_1.at[i, 'Transaction SubType'] == "SELL" and input_table_1.at[j, 'Transaction SubType'] == "BUY")
				)
			):
			output_table_1.at[i, 'Cleared?(Yes/No)'] = "Yes"
			output_table_1.at[j, 'Cleared?(Yes/No)'] = "Yes"
			print(i, j, "Case I")
			break

i just now tried this,

getting an error as Attribute error: module ‘pandas’ has no attribute 'series’

whether this issue because of knime version

hi @bruno29a ,

i modified what you suggested still getting nonetype object was not subscriptable

Correct, it does raise a warning, and I’ve been pulling my hair to try to understand why. I assumed that the .at[] would not, and that is why I wanted to know the difference. Thanks for sharing the link.

In the meantime, I had suppressed the warning as a “workaround”:

import pandas as pd
pd.options.mode.chained_assignment = None #This is to disable the SettingWithCopyWarning warning

Hi @Abhiram , yes, following what @duristef explained, it’s not the issue:

Indeed, it has no attribute “series”. The attribute is Series. I hope the issue doesn’t raise because you changed the case of the script