How can I avoid having to open and close a database connection twice when working with a sub-workflow that is only called when a certain condition is met?

MgntMoe · February 6, 2023, 9:25am

I would like to export the number of faulty rows of a SELECT statement, no matter if the the result is = 0 or <> 0, to an Excel file. If the result is <> 0, I would like to export the faulty rows as well to a separate Excel file. How can I do this in a way that is better than this one?

(I am using KNIME 4.6.3 and not the most recent version because of these restrictions.)

Main workflow:

Sub-workflow:

As you can see, the output port of the Call Workflow (Table Based) node is crossed out. When the cursor is pointed to it, it says Output (Inactive Port Object).

What bothers me is that a database connection is opened and closed twice which I deem to be unnecessary but maybe there is no alternative for this use case?

What do you think? Can this problem be solved in a better way?

iCFO · February 6, 2023, 9:46am

You can send the info over to the sub workflow via json instead of hitting the database a second time. Use table to json before the call, and then json to table in the sub workflow. You will likely need to do an Auto Type Cast afterwards (and possibly some manual Field Type adjustments) to get your data types back.

mlauber71 · February 6, 2023, 10:26am

@MgntMoe one option might be to just add a RowFilter and export the results without a switch? If you absolutely must have a sub-workflow you might as well just store the data in a table and read it back in without JSON.

Daniel_Weikert · February 6, 2023, 5:21pm

Good question. What is the reason behind the subworkflow?
br

MgntMoe · February 7, 2023, 3:23pm

@iCFO @mlauber71 @Daniel_Weikert I think I did not describe my workflow detailed enough. The Count faulty rows node SELECTs a table with one element: A number that is either = 0 or <> 0. That is, I have to use another DB Query Reader node in the sub-workflow (or before calling the sub-workflow, if that is possible) because the table that comes out of the bottom output port of the IF Switch is useless for further processing. It just contains the number of faulty rows but I need to SELECT the detailed faulty rows themselves from the table that I would like to export to an Excel file as well if the number of faulty rows is <> 0.

iCFO · February 7, 2023, 3:52pm

Is the DB Query Reader really the only way you can dependably count “faulty rows”? My gut is that this is a needlessly complex approach. My preferred approach would be to read all of the necessary DB data, identify or count the “faulty rows” using KNIME nodes, then write to Excel all within the same workflow.

mlauber71 · February 7, 2023, 4:35pm

@MgntMoe my guess would be that you can do more than one Query from the database, why does this have to be in a sub-workflow and not in another query behind the first one? Maybe controlled by Flow Variables to make sure it happens after the first one (if that is necessary).

MgntMoe · February 10, 2023, 11:42am

@iCFO @mlauber71 Your approaches seem to be better. I have replaced the Count faulty rows SQL query with the sub-workflow SQL query that SELECTs the detailed faulty rows. Then, I replaced the Table Row to Variable node with the Row Count node.

I installed the Row Count node by downloading and installing it locally in the following manner:

I went to File -> Import KNIME Workflow….
Where it says Select file:, I clicked on Browse… and selected the Row Count.knwf file.
The Select folder: is set to LOCAL:/.

After doing this and clicking on Finish, the Row Count node appeared in my KNIME Explorer and I was able to use it.

You may ask yourselves why I have installed the node in such a way. Well, other approaches like dragging and dropping did not work. When I tried to open File -> Install KNIME Extensions…, I received the following error message:

Problem Occurred

'Contacting Software Sites' has encountered a problem.

No repository found at http://update.knime.com/partner/4.6.

I am, however, able to open this window by clicking on Help -> Install New Software…. But I do not know how to properly proceed from there in order to install the node so that it appears in the Node Repository.

Please keep in mind the special circumstances that I have to deal with when I am using KNIME. They seem to be responsible for these cumbersome approaches that I have to take.

Coming back to the workflow, I would now like to export the number of faulty rows to an Excel file following the top branch of the IF Switch and export the detailed faulty rows to an Excel file following the bottom branch. The bottom branch can be easily constructed by replacing the Call Workflow (Table Based) node with a simple Excel Writer node. But how can I do what I would like to do when it comes to the top branch?

In order to solve this problem, I would like to perform another SQL query on the IF Switch top branch table and move this result to an Excel Writer node afterwards. When I was thinking about how this can be done, I came across this post where it is suggested that it is possible to write a table into a local SQLite database, perform SQL queries on it and use it again. This could solve my problem if I can get it to run properly.

Do you know how this can be handled? Can this page help me to solve my problem?

Right now, my workflow looks like this:

iCFO · February 10, 2023, 12:08pm

Not sure what the 2 queries are comprised of, but what about performing both of them in the beginning and joining (or concatenating) them together so that you have all of the necessary data for the workflow from the start? You can always filter out or remove columns depending on if statement branch afterwards to achieve your desired excel documents.

mlauber71 · February 10, 2023, 2:47pm

@MgntMoe I still think you would have to think about your approach. I built an example where in a loop at every iteration there would be a check if in the chunk there would be a row with fault=1. In this case, the faulty lines would be exported to a separate sheet in an Excel file. Otherwise, just the number of fault lines would be shipped.

Maybe you can take a look. It also uses SQLite as a sample database.

kn_example_switch_case_store_file_sql.knwf (158.4 KB)

MgntMoe · February 17, 2023, 9:24am

@iCFO @mlauber71 I think I might have found a proper solution. This is how my workflow looks right now:

The second DB Query Reader node called Count faulty rows is only called after the IF Switch node is executed. No matter what the result of the IF Switch node is, however, it tells the DB Query Reader node Count faulty rows to run.

As you can see, I created an example that is false for the bottom output port of the IF Switch node but the number of faulty rows is still exported properly.

If both output ports of the IF Switch node are activated, the result looks as follows:

Can an error possibly occur with this workflow?

When I executed the second workflow, I observed that the Excel Writer node Faulty rows was executed successfully before the DB Query Reader node Count faulty rows was finished.

Assume that the DB Query Reader node Count faulty rows, the Excel Writer node Number of faulty rows and the DB Connection Closer node Close the database connection would finish before the Excel Writer node Faulty rows would even start to run: Would this be a problem or not?

I assume that the result of the DB Query Reader node Select faulty rows is stored locally when the IF Switch is finished so that, at this point, it does not matter if the database connection is closed or not when it comes to exporting the table from the bottom output port to the Excel Writer node Faulty rows. Is this assumption correct? If it is not correct, I would have to add some precautionary measures to this workflow in order to prevent errors from happening.

(Please note that the DB Query Reader node in my last example is named Count faulty rows. This is incorrect. It should have been named Select faulty rows.)

iCFO · February 17, 2023, 11:32am

The flow pattern looks sound timing wise to me. Once the data is read into the workflow, the DB can be closed and additional nodes can continue to execute downstream like Excel writer.

It is really hard to say if it is optimal without understanding the second query “Count faulty rows”. Does this yield results that can’t be obtained by performing calculations in KNIME based on the Select faulty rows query?

MgntMoe · February 24, 2023, 9:38am

@iCFO Suppose there is a table that looks as follows:

DESCRIBE people_table;

Name          Null?    Type
------------- -------- ------------------
ID            NOT NULL NUMBER(10)
NAME          NOT NULL VARCHAR2(255 CHAR)
YEAR_OF_BIRTH          NUMBER(4)

Every row where the year_of_birth entry is null is considered to be invalid. In order to find all of the invalid rows, I write the following Select faulty rows statement:

  SELECT id AS "ID",
         name AS "Name",
         year_of_birth AS "Year of birth"
    FROM people_table
   WHERE year_of_birth IS null
ORDER BY id ASC;

The result of this query is the Faulty rows table.

Because I also want the Number of faulty rows table to look a certain way, my Count faulty rows query looks as follows:

SELECT count(*) AS "Number of faulty rows"
  FROM people_table
 WHERE year_of_birth IS null;

This is the reason for my two separate SQL queries. Is my approach reasonable considering these circumstances? What do you think? Is there a better and/or more efficient way to achieve the same goal?

iCFO · February 24, 2023, 10:13am

It definitely appears that you are underutilizing KNIME by performing a second DB query which is effectively just performing a calculation on the primary table that has already been read in.

Once you have read in the list of names with a null birth year criteria, you can perform a count on that table by using the Group By node. Just doing a Sum aggregation on “ID” with no group column should get you the same results in a much cleaner more efficient manner. You should be able to achieve your desired count table structure via ETL in KNIME as long as the necessary columns are included in your original query.

mlauber71 · February 24, 2023, 4:57pm

@MgntMoe have you considered testing my suggested solution and maybe comment on why this would not (or would) be sufficient for your task? My impression is still that you are overcomplicating your approach - but maybe I am wrong.

MgntMoe · March 3, 2023, 10:10am

Okay, my table is a little bit more complex than the one I presented as an example.

@iCFO You are right that the DB Query Reader node Count faulty rows can be replaced with a GroupBy node that can output the Number of faulty rows table. But I would like to format this table in a specific way.

My real table consists of two tables that are joined in the DB Query Reader node Select faulty rows. Both include a year column (think year_1 in table 1 and year_2 in table 2). Whenever at least one of the two is null, this is considered to be a faulty row.

I can configure the new GroupBy node in such a way that I select the tab Settings -> Manual Aggregation and then add the both relevant columns from the Available columns box to the To change multiple columns use the right click for context menu. box. In the Aggregation (click to change) column, I select Missing value count for both columns.

The table that is output by the GroupBy node is a table that looks as follows:

+--------+---------------------------------+---------------------------------+
| Row ID | [I] Missing value count(YEAR_1) | [I] Missing value count(YEAR_2) |
+--------+---------------------------------+---------------------------------+
| Row0   |                             300 |                             100 |
+--------+---------------------------------+---------------------------------+

I, however, do not want my output to look like this table. I want it to look like this:

+-----------------------+
| Number of faulty rows |
+-----------------------+
|                   400 |
+-----------------------+

What can I do in order to achieve this?

@mlauber71 I downloaded your workflow and took a look at it. It seems to be an elegant approach to solve a problem that is very similar to mine but I, unfortunately, cannot make it work for my use case. I have never worked with a SQLite database before and I do not know how to set up such a database locally. I suppose that it is quite easy but is this approach too complicated for my use case? I don’t know. If I can resolve the last problem that I described in this post, my goal is reached and I was able to prevent querying the database twice. But if your approach is better, please let me know what I can do in order to use it. Is it sufficient to replace the SQLite Connector node with an Oracle Connector node and only change some details in your solution?

It seems that your solution might be able to solve my problem of formatting my outputs properly. But does it also work then I have to join two tables? The DB Table Selector node seems to be a node that is able to select one table. But can it also select two or more and join them?

mlauber71 · March 4, 2023, 12:18pm

The SQLite is just there to represent a database. It can be any SQL database, I just tried to break down your problem to a local workflow that one can discuss and edit to solve the core problem (only one query?).

The KNIME SQL nodes can do a lot of things, including joins either thru the prepared DB nodes or with ‘open’ SQL code. At the point of download a lot of manipulations could have happened. The

can have a larger query for example. So maybe you can adapt and test the approach on your database - or maybe create an example with a local database that would represent your problem. I still think you might have to think about your concept. The elements in KNIME (switch, DB nodes) are all there.

iCFO · March 4, 2023, 2:20pm

@MgntMoe

Here are 2 ways to sum those columns and get your desired format internally with KNIME base nodes.

forum sum columns.knwf (10.5 KB)

MgntMoe · March 14, 2023, 1:00pm

Case 1: Faulty rows do exist

The Math Formula node Use $${IRow Count}$$ uses the Row Count from the Flow Variable List. So, in the Expression box, my expression is just $${IRow Count}$$. Append Column: is selected and Number of faulty rows entered as a name. Also, Convert to Int is checked.

The Column Filter node Filter “Number of faulty rows” column only includes the Number of faulty rows column and excludes all the others.

Let us assume that the number of faulty rows is 400. The Duplicate Row Filter node Remove duplicates is set to Remove duplicate rows because the table now just contains one column that contains the number 400 in 400 rows.

Case 2: Faulty rows do not exist

If the table does not contain any faulty rows, just the Number of faulty rows table is exported. But it does not state that the number of faulty rows is 0; it represents the number of faulty rows as a null value, that is, as an empty cell in the Excel file.

@mlauber71 I was able to easily integrate @iCFO’s last answer into my current workflow which now almost completely satisfies what I would like to achieve. Now, the solution contains one DB Query Reader node instead of two. I also found out that the connection from the Excel Writer node Number of faulty rows to the DB Connection Closer node Close the database connection can also be drawn from the DB Query Reader node Select faulty rows.

Your approach contains two DB Query Reader nodes: One for selecting the faulty rows and one for selecting the number of faulty rows. But I think that I can use your proposed nodes as well in order to solve the problem!

This is how my alternative workflow looks like:

The DB Table Selector node Select faulty rows contains my SQL statement that selects the faulty rows.

Top branch:

The DB GroupBy node Count faulty rows groups by an ID column. First, I went to Settings -> Manual Aggregation and added the column from Available columns to To change multiple columns use right mouse click for context menu. and changed the Aggregation (click to change) to COUNT. In the Advanced settings section, I checked the Add COUNT(*) box and set Number of faulty rows as a column name.
The DB Reader node Convert table is used in order to operate on a locally stored table.
The Column Filter node Filter “Number of faulty rows” is used in order to include the Number of faulty rows column.
The Excel Writer node Number of faulty rows outputs the number of faulty rows to an Excel file and tells the DB Reader node Convert table to start its operation.

Bottom branch:

The DB Reader node Convert table is used like in the top branch.
The Row Count node counts the number of faulty rows.
The Rule Engine Variable node Condition for case discrimination contains the two lines $${IRow Count}$$ >= 1 => "top" and $${IRow Count}$$ <= 1 "bottom". The New flow variable name is prediction.
In the IF Switch node At least one faulty row?, Flow Variables -> PortChoice is set to prediction. After the node is finished, the database connection is closed. (I suppose that it can already be closed after the DB Reader node Convert table is finished.)

This solution does what I want, even when the number of faulty rows is 0:

What do you think of the different approaches?

By looking at the second solution, I could replace the DB Reader node Convert table in the top branch with a DB Column Filter node Filter “Number of faulty rows” and place the DB Reader node Convert table afterwards. But are there any other improvements that can be done? Or is the second approach still too complicated?

(Please note that I was not able to set the Rule Engine Variable in such a way that would have allowed me to leave out $${IRow Count}$$ <= 1 "bottom". If I had left it out, the IF Switch node would not work properly! Is there a better way to solve this particular sub-problem?)

mlauber71 · March 15, 2023, 4:52pm

@MgntMoe with KNIME ofter there is more than one way to do things. Judging from the screenshots I would remark these things:

I personally would prefer to always explicitly connect all the nodes in a workflow that might be part of a larger operation where you might just execute the last node and all the other would follow. Your construct with the Excel Writer of faulty lines would only execute if you execute the whole workflow - if this is what you are doing it should be fine. My example was assuming (simulating) a constant or repeating process
then I do not fully understand your concern about the SQL connections. Your construct would also maybe initiate more than one query to the database which should be OK