Extract data from db using value from other db

chiashin · May 25, 2023, 7:08am

Hi people,

Database B, Table A:
User_ID
001
002
003
004
005

Table A is the result I got from Database B. I want to use value in column User_ID to retrieve data from Database A using DB Table Selector (it’s connected to DB Connector).
The data would be too large if I just extract everything, so I will need to specify which value to retrieve only in Database A. For now, I do the work manually like custom the query in the DB Table Selector node myself:

SELECT * FROM #table# WHERE User_ID IN (‘001’, ‘002’, ‘003’, ‘004’, ‘005’)

I wonder if there is any method to automate the process. Like every time the result is generated as Table A from Database B. It will automatically just retrieve the value needed from the Database A.

takbb · May 25, 2023, 8:23am

Hi @chiashin , and welcome to the KNIME community.

Yes this is possible.

Here is a demonstration workflow that reads one database and then uses data from that to form an IN query to be used against a different database.

Querying database based on values from another database.knwf (52.0 KB)

In the above workflow, the creation of the delimited items for the IN query uses a component that I wrote specifically for this type of scenario, but this can also be achieved using a combination of GroupBy and String manipulation nodes, or possibly other methods (see the posts below for more info). I find it convenient to just use a component for it though

You may find the following forum topics useful as they discuss the creation of an IN clause, and other comma-delimited clauses for use against a database, providing different ideas and approaches:

takbb · May 25, 2023, 8:58am

I just missed the 30min deadline to edit the above, but there is mention in the second of those posts of the “DB Looping” node, which I’d forgotten about, and which can be used for limiting the query to a specific set of query values too.

see also:

chiashin · May 26, 2023, 6:37am

Hi @takbb, thank you for your answer.

May I know when you extract the data from Database A, do you assume to extract the whole table first, then only filter using the SQL in DB Query Reader?

Not sure if I’m mistaking your approach, can you explain further for the steps in above image? Like what do the flow variables do? Can I replace the DB Writer part with other node like DB Table Selector maybe?

To clarify my question:

(Connector) → (DB Table Selector) → (DB Reader)

Above is the nodes I used to extract data from database. I custom the SQL in (DB Table Selector) to obtain only the value I need. But now I try to automate the process. I don’t want to extract the whole table since the beginning because the dataset would be too large.

Do you know any nodes or approaches to replace (DB Table Selector)?

takbb · May 26, 2023, 9:06am

Hi @chiashin ,

In the example, the DB Writers are only there to populate a sample in-memory database and provide some sample data, so apologies if they are causing some confusion. The purpose of the red flow variable lines between the DB Writers to the subsequent nodes are there purely to enforce the correct sequence of processing so that the demo works (They ensure that the sample data has been written to the tables before attempting to query the data using the DB Query Readers).

In place of the H2 Connectors, you have a DB connector to your two databases and the DB Writers would not exist, as your tables are already populated in the database.

I used DB Query Readers containing specific queries so that at no time does it pull back ALL the data from any table, but instead only pull back the data selected by the queries.

OK, so let’s look at how this would work with (Connector) → (DB Table Selector) → (DB Reader)

The DB Table Selector works collaboratively with other “in database” db nodes (those with the brown square connectors) to filter the data being returned. The data is only returned to KNIME when you use the DB Reader which has the black triangular output.

If your DB Table Selector is configured to reference a specific table (not a custom query), and is then attached directly to a DB Reader, it will return ALL rows from your table to KNIME, which is clearly what you are trying to avoid.

This configuration simply returns all rows from both tables

So what do you do to filter the rows being returned?

What you can do though is place other “in database” connectors between the DB Table Selector and the DB Reader to modify the query that will be executed by the DB Reader. (It is the execution of the DB Reader that actually runs the query against the database)

I don’t know how you are choosing your data from your first table, but in this demo, the LogEntry table is a table consisting of 6 rows, but we could imagine it being 6 million rows. I want to return Log Entries that are on or after 1 March 2023. In my original demo, the DB Query Reader was used to simulate this, and it was given a sql select statement

select * from "LogEntry" where "Date" >= '2023-03-01'

Now, this could be replicated directly in the DB Table selector, by re-configuring it and providing this as a custom query, instead of simply telling it a table name, but we can also do this by including a DB Row Filter

The DB Row Filter doesn’t actual select any rows, and when it is executed, no data is returned to KNIME. It’s purpose is to modify the query that will be executed by the DB Reader.

You can inspect what is happening. Open the output port on the DB Table Selector

And then click on the DB Query tab

You will see the select statement that would be executed if a DB Query node is attached and executed.

Now do the same on the DB Row Filter:

You can see that the query has been modified so that it contains a condition.
Whether that generated SQL statement is particularly efficient will probably depend on your particular database and one reason why I prefer to write custom queries instead… However, that discussion is not the purpose of this demo

Note that at this point the query has still not been executed. The brown-square nodes are chained together and modify the query until such time as the DB Reader is executed. Only then is the data returned to KNIME.

When the DB Reader is executed, the select statement returns these rows to KNIME:

In this demo, I then want to find the Users for those particular Log Entries, and my Users table is (slightly weirdly maybe ) in a different database. So we want to select only those rows from Database B that apply to the specific users.

If we were writing the sql manually we would use

select * from "Users" where "UserId" in ('005','003')

The mechanism that KNIME provides for this is the DB Looping node (the node I forgot about when writing my initial response, but is actually perfectly suited for this job even though I’ve actually hardly ever used it)

Like the DB Reader, the DB Looping node also connects to a database using the brown connectors, and when executed, returns data to KNIME, but unlike the DB Reader node, it executes a specific SQL Select statement defined within its own configuration.

It takes an input data feed which can come from anywhere, but in this case will use the output of the “Log Entries” DB Reader.

Within its configuration, you specify the select statement you wish to use, and modify it with the specific conditions, based on the incoming data port.

The upper panel “Database Column List” is the list of columns available on the database connection (i.e. the “Users” table in this case).

The centre panel “Column List” is the list of columns on the incoming data port.

So we modify the SQL Statement to include the condition as shown here which causes the DB Looping node to supply all the values of UserId which are on the incoming data feed, as a comma-delimited list which can be used inside the IN clause:

A couple of other things to note:
By ticking “Retain all columns” and “Append Input Columns”, the output will contain data from both the incoming data feed (the “LogEntry”), and the augmented data from the “Users” table.

In this case, you should Untick “Read All”, because otherwise you will find rows are matched incorrectly. I haven’t experimented to see under what circumstances “Read All” should be ticked or unticked. According to the help it is ticked to “Execute a single query with all values of the selected columns from the input table”. Anyway, for this demo, unticking it is the required option!

Additionally I’ve been lazy with using “Select *” whereas of course you may choose to select the specific columns to be returned.

The result was this data table, where you can see the effect of the DB Looping node was to generate “collection” elements.

These can be resolved by adding an Ungroup node

Attached is an updated version of the previous workflow with this example. When you look at the workflow, ignore the red flow variables attached to the DB Table Selector nodes. (I left them out of the above pictures as they aren’t relevant.) As mentioned before, the red flows are purely to make the demo work, as it has to pre-populate some database tables before it can then query them, and they serve here purely to ensure the correct order of execution of the nodes.

Querying database based on values from another database v2.knwf (64.0 KB)

I hope that helps.

chiashin · May 29, 2023, 2:42am

That’s exactly what I need.
Thank you @takbb !

system · June 5, 2023, 2:43am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.