Help with DB Query node

nba · October 31, 2023, 5:04pm

Hello, I am attempting to migrate one of my workflow to DB to improve efficiency.

To do so I am using a text to SQL engine to try to produce SQL code matching what I am otherwise doing using math formula node.
I have several questions:

1/ From a general standpoint, am I right to assume that if I want to produce the same result as a math formula node with a DB query node I need, in sql, to instruct to
(a) create a new column and (b) fill this newly created column with the results from a given computation?

2/ I have received numerous times the error: “database view place holder (#Table#) must not be replaced.”
Shall I conclude that all DB Query node must absolutely start with this statement?
I would appreciate if someone could explain or point to non-specialist documentation regarding this “place holder” thing.

3/ The specific text 2 SQL query I am trying to run is the following:
SELECT * FROM #Table# AS table; ALTER TABLE tableADD XYZ DECIMAL(10,2); UPDATETableSET XYZ =Column1 + (Column2*Column3`);

Can someone point out to me what are the error(s)?

Many thanks
NB

takbb · October 31, 2023, 5:44pm

Hi @nba , I confess from your post that I’m not entirely sure what you mean by migrating a workflow to a database. It feels like you are trying to treat KNIME like it is a database, but presumably you have a database available already that you are connecting KNIME to.

Yes you can create database tables “on the fly” and alter them (add columns as you go) etc, but that isn’t generally considered good practice.

I didn’t understand question 1. You wouldn’t generally create columns in a database as you are processing. The database is there as a source or destination for your data. Yes you can update, but it would be very inefficient if you had a workflow that created a database table, and then tried to create and populate whole new columns with the results of calculations as it went along. I’d have to better understand your use case to be able to give any kind of satisfactory answer to what you are thinking of here.

Re Q2, the DB Query node is used in conjunction with a DB Table Selector node. The DB Table Selector node allows you to choose a pre-existing table from the database (or can take a “custom query”) and the resultant table from that node is referenced using the the #table# placeholder. DB Query is one of the “in database” nodes, which has the brown connectors. “In-database” is a slightly misleading term as these nodes are not actually running “inside” the database, but it is a useful analogy because they work together with other “in database” nodes, and don’t return data to KNIME until they are asked to by a node such as “db reader”, which then has a black-triangle connector to return the results.

So in simple terms, to use a “DB Query” node, you put a “DB Table Selector” in front of it and a DB Reader" after it.

The non-“in database” equivalent is the DB Query Reader which just runs a query against the database and returns a result, which sounds more like what you are wanting to do at this point.

As for Q3, the ALTER TABLE statement you have to add a new column to a database table is what is known as “DDL” (data definition language), in contrast to “DML” which is “Data Manipulation Language”.

DML in general database terms covers the statements as Select, Update, Insert and Delete. But actually, it is better to think of “Select” as DQL “Data Query Language” as it doesn’t modify anything, and then treat Update, Insert and Delete as DML, because they do modify your data,

DDL pretty much covers everything else “Create Table”, “Alter Table”, “Drop Table” etc.

If you think of it in those terms, then DQL (“select”) is what you can put it in a DB Query or DB Query Reader node.

If you are running anything else DDL/DML, you will want to put such statements into a DB SQL Executor node, rather than into a query node.

I have attached a workflow to demonstrate some of the different uses of various DB nodes mentioned above. I hope you find it useful to gain a basic understanding of how those nodes operate.

I’ve adjusted it to include flow variable connectors from the DB Insert statement. This ensures that subsequent nodes can only be executed after the demo database table “EXAMPLE” has been created and populated with some sample data.

I think though that from your questions you have a few misconceptions about how you would work with a database. Feel free to ask further questions but I feel you need to take a step back and better explain your use case because it doesn’t feel to me like the solution you are trying to achieve is a good answer to whatever the question is

DB demo.knwf (29.9 KB)

nba · November 1, 2023, 12:32pm

Hi Takbb,

Thank you for your thorough and insightful answer.
I appreciate the opportunity, through our exchange, to take a step back and question the foundations of my attempt to use MySQL with Knime.

I have finished building a workflow that performs a lot of data generation (tens of columns created) and wrangling before I get to the data analysis.
The main problem I face now is, as often, performance.
I have identified that one of the main bottlenecks is (a) a series of column creations using math formula nodes or column combiner nodes and (b) a Groupby performing several aggregations on a 7 to 10 million-row table.
This is for a single entity, and the workflow should run for multiple entities. For components before and after that, the size of tables is somewhat manageable (100k-400K), and the performance is acceptable.

After reviewing my options (which seem to be fairly limited:-) ), I thought that redesigning the workflow using a DB (MySQL) could help, especially to perform the Groupby (here comes my infamous “migrating the workflow to a DB” comment).
I face several problems with that option:
(1) I have never used DB or SQL. Actually, the reason I fell in love with Knime several years ago is because I am a manager/ entrepreneur, not an IT or data scientist, and there is so much that can be done in Knime with my background.
(2) The DB nodes available seem to cover only partially the non-DB nodes available in Knime, meaning that I cannot easily redesign my workflow. All together from a non DB specialist, it is pretty difficult to understand what DB nodes are doing and to map them with non-DB ones.
(3) I realized that loading 7–10 million rows in a database in order to perform a groupby is not at all a solution. This led me to the conclusion that the loading in DB should be done much earlier, when loading time is not a limiting factor.
This approach seems ok, but it led me to the current problem I am facing and for which I am posting: I am now required to create tens of columns in DB before aggregating with Groupby, and you, just like I, must have come to the realization that I have no clue how to do it :-).

From your answer,

(1) I understand that the idea of updating the table by creating additional rows is not very orthodox.
(2) (a) I appreciate your explanations regarding the “in-database” operations and how to return the data to Knime with “db reader” as well as the distinction between DML, DQL, DDL.
It helps make differences between all these DB nodes which seemed similar at first.
(b) Concerning the use of the DB query reader, I am not sure it is my solution as the output is a knime table and I have to perform multiple column creation using DB nodes.
Otherwise, it seems like I would have to load the knime table outputed back to the DB.

Where I am left now and the new questions I need to answer.
(1) Thanks to your workflow, I think DB SQL executor is the most suitable option for me, though not really orthodox.
I could try to create all my columns by mimicking your approach, perform the Groupby and then use the DB query reader at the end to return the data in Knime only when its size is manageable.
Will it be efficient? I guess until I try, I will never know. However, your thoughts are more than welcome.
Q1: Can I create the column with an “alter table” and update it with calculations in a single DB SQL executor node?
Q2: Do you know of any knime documentation that would perform the kind of beginner introduction you have started with me and would help explain the function/ use case of the DB nodes available?
After reading the description available for most of the DB nodes, I must confess that I don’t understand most of it, as it requires an understanding of SQL basic statements as a reference.
It would be great to have documentation mapping knime DB node functions to non-DB nodes (what is the DB equivalent of the column appender, missing value, column combiner, rule engine nodes etc… )But I could not find this information.

All together, many thanks for your insights. On top of bringing some practical insights, it has helped me realize that using DB with Knime is not a quick fix, but I think it is worth pursuing.
NB

mlauber71 · November 3, 2023, 4:21am

@nba please keep in mind that the performance of an SQL database will also depend on the power of the machine it is running on.

In general it will make sense to employ a database system and use KNIME as your platform.

You could also try to improve the performance of your KNIME workflows:

About KNIME and SQL I can offer this collection which also links to an SQL Introduction and this sample workflow.

nba · November 3, 2023, 1:13pm

@mlauber71 , this is great, thanks!

system · February 1, 2024, 1:13pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.