Knime To Check mandatory columns for missing values from Row3 onward and flag errors if found.

san_98 · October 30, 2024, 8:55am

I have Excel data sourced from a KNIME table, and I need help with the following task:

Identify columns labeled “Mandatory” in Row 0.
Check for missing values in these mandatory columns starting from Row 3 onwards.
Generate an error message if any mandatory column has missing values in the specified rows.

{FED98FD0-2DCC-430D-89F5-1F861A2B5F4D}923×582 31.9 KB

giovannicianchetta · October 31, 2024, 2:47pm

is the attached workflow a good starting point for this problem?
I created a fake table with somehow a similar structure but different values.

find missing values.knwf (83.3 KB)

takbb · October 31, 2024, 6:18pm

Hi @san_98 ,

I believe you can achieve what you require with a java snippet containing a very small amount of java code. The reason for using the java snippet is that it is capable of halting a workflow with an error message:

	if (c_ColumnBeingTested == null)
	{		
		String message ="Column : '" + v_currentColumnName +"'" +
		" contains missing value at rowid: "+ ROWID + "(row number: "+(ROWINDEX+3)+ ")";

		throw new Abort(message);
	}

Placing the java snippet in a column list loop and renaming the “current column” to a standard column name that the Java Snippet “knows” (“ColumnBeingTested”) causes the java snippet to read through every row on each column, and failing with an error message as soon as it encounters an error.

If no error is encountered, the loop completes and the flow variable from the Loop End allows further processing to continue (branched from your data source e.g. my Table Creator)

The initial processing prior to the loop determines which columns are marked “mandatory” so that the loop only iterates mandatory columns.

Here is a link to the demo:

san_98 · November 5, 2024, 5:26am

Hi @takbb / @giovannicianchetta ,

Thank you for the response. The solution works well for Excel files with minimal data; however, in my case, I have multiple Excel files with large data volumes. For example, the maximum case includes an Excel file with 206 columns and 2 lakhs rows. Our main goal is to reduce the time taken for transferring data from Excel to a table. Won’t the loop in the workflow increase the processing time?

takbb · November 6, 2024, 11:55am

Hi @san_98, you are right that a column list loop to process a spreadsheet of 200,000 rows x 200+ columns may be slow to process but now that you have given an indication of spreadsheet dimensions, I have looked at performance and for an Excel spreadsheet of 200,000 rows by 220 columns, this was the breakdown of processing times:

Process	Seconds
Excel Reader	81
Pre-loop processing	41
Looping all columns	128
(This was for a table with no missing values, so it had to loop all columns)

This shows why it is good to have information about data volumes and other factors which could influence performance, as they can affect the amount of effort expending on optimisations. There is little point optimising something that is “good enough”.

So, with this new information, there is one very easy optimisation: terminate the loop with “Variable Loop End” instead of “Loop End”.

This can be done here since we don’t actually need to collect the data from the loop itself, and doing so wastes time as it has to collect and combine data.

This one change reduced the loop time to:

Process	Seconds
Excel Reader	81
Pre-loop processing	41
Looping all columns (Variable Loop End)	14

Yes really… on my system 128 seconds became 14 seconds with that one change.

Big learning from this is to use Variable Loop End unless you need to collect the data from the loop!

I have updated my original workflow on the hub with this simple modification.

It got me thinking though about how to reduce the loop requirements further.

My thinking: it would be great to not loop at all if we could know in advance whether there are any mandatory columns, and any mandatory columns containing missing values.

I have uploaded an enhanced version of the original flow onto the hub that attempts to “pre-know” by first of all filtering the “mandatory” columns and then, for each of those getting counts for all the columns of the number of non-missing values. This can be done by GroupBy quite quickly, and much faster than the Column List Loop.

The Column List loop then only acts on columns that are mandatory and contain missing values (and in this instance will of course always terminate with the java snippet “abort” on the first column encountered, so it doesn’t strictly require a loop at all, but it was more convenient to leave that part of the processing alone.

I also looked at additional optimisations in the initial processing because previously the “transpose” to determine mandatory columns was being performed on the entire (200,000 row) table. This only needs to occur on the top (non data) rows. That makes a big difference to the amount of data being shifted around, and therefore time.

For the same Excel file as above containing no missing mandatory values, the timings beyond the Excel Reader (which hasn’t changed) were:

Process	Seconds
Filtering only columns containing missing values	2
Looping all columns	0 (didn’t need to execute)

And how does the timing compare when there is missing data?

Running the same process on an excel file of the same dimensions with missing data in at least one column resulted in these timings:

Process	Seconds
Excel Reader	78. No actual processing change. Just a fluctuation - still around 80 seconds
Filtering only columns containing missing values	2
Looping columns	0 (always now terminates on first iteration)

So basically, almost all of your processing time will now be on reading the Excel file, rather than finding the mandatory value validation.

Here is a link to the enhanced version. It contains more complexity up-front. The original version with the Variable Loop End modification is reasonably performant so whether you need to additional complexity depends on how many files you have to process and your overall timings. You might also choose to take some of the enhancements from the newer version (the variable loop end, and that table transpose optimisation I mentioned) and leave the rest.

system · November 13, 2024, 11:55am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.