DB Update won't progress - stuck on 'executing....' status

summer_le · February 12, 2022, 11:51am

Hi there,

Can anyone help me resolve why my DB Update is always stuck on ‘Executing - copying input object at port 2’ and I could leave it for hours and it doesn’t progress.

I have tried closing and restarting Knime and also my computer. I am trying to update over 100,000 records so I even used a ‘Row Filter’ to just write 3 records and I still get the same probblem.

I have also checked my VPN connection to the database and it works just fine.

Has anyone ever experienced this? It is very frustrating as it seems a very simple thing I am trying to do

bruno29a · February 12, 2022, 4:12pm

Hi @summer_le , it’s kinda hard to know without seeing what you are doing.

The DB Update is actually running on the DB server side. Knime is just waiting for a response from the DB server that the update was completed. What kind of updates are you doing?

summer_le · February 12, 2022, 10:01pm

Hi @bruno29a

I’m just updating one column in the database from my results. Very simple in my opinion.

Capture10

In my DB update, I just set it to update this one particular column if the ID of these 3 other columns match.

Does it have anything to do with my connection at home or anything?

summer_le · February 12, 2022, 10:13pm

Hi @bruno29a again, I tried something different and put in a row filter to only update maybe 100 records at a time and it seems to work if I do small batches like that but since I have over 120,000 records, this is not feasible for me to sit & watch it update, change the row filter and repeat 1,200 times. it seems like the number of rows is what is causing this issue I keep getting.

Is there a way to mitigate update time-outs for large number of records updates?

bruno29a · February 13, 2022, 12:21am

Hi @summer_le

How big is the data that you are transferring from Knime (your computer) to the DB server? A few MB? It should not take too long to transfer. That’s about the only thing that’s affected by your connection speed.

The rest of the operation is happening on the DB server.

That all depends what that one column has. A one column containing a video is much bigger than 100 columns of small int combined, so it’s not about the fact that it’s one column. And how many rows are you updating? Also, if you have WHERE conditions, they would be the bottleneck, rather than the fact that you are updating 100 columns, especially if you have a lot of rows.

Other factors to consider is the amount of traffic happening on the DB server when you are performing the UPDATE. Are any tables being locked at that time?

Breaking the UPDATE into batches is always a better idea. There are always better advantages to do this, though there are also disadvantages in going in batches (You hit the DB server more).

That’s not how you want to spend your weekend

You can use a Chunk Loop that’s made especially for creating batches

That’s how you would define the number of rows per batch:

Internally though, the DB Update is also sending the updates in batches. You can set the batch size here:

The only difference is that you have to wait for the whole operation of the 120,000 as opposed to batching it manually with the Loop.

summer_le · February 13, 2022, 1:36am

Hi @bruno29a The column I am updating is just updating it with a 1 if the 3 IDs match, so just an integer. The number of records I am doing that to is about 120,000 but there are about 800,000 records in that table to check.

Which option is more efficient? Updating the DB Update to smaller batches or doing the chunk loop?

and if I were to do the chunk loop, what would that look like?

bruno29a · February 13, 2022, 5:49am

Hi @summer_le , 800k records is not a lot, and in fact, if the column that contains these IDs is indexed, then it won’t scan/check the whole table, it will target only these 120k records (hence why I said that the performance can also depend on your WHERE conditions).

Quite frankly, I have never used the DB Update node. I usually execute an SQL with the DB SQL Executor node, and if I have to do data manipulation beforehand in Knime, I would load the manipulated data into a temporary table on the DB server, and then run the UPDATE from that temporary table.

So, I’m not entirely sure how the DB Update node behaves in the background. I wonder also if your connection did not expire if the DB Update took so long.

The fact that you were able to run the DB Update with small batch (Row Filter), it’s probably better to keep doing it that way, meaning using the Loop instead of batching inside the DB Update.

With the Loop, it would look something like this:

So, instead of the Row Filter, you would use a Chunk Loop.

An additional note is that it’s also better to connect your DB Connector node (Oracle Connector) to the last node of the data processing (basically the node that’s before your Row Filter), that way you are making sure that the DB connection will be initiated as late as possible, which is basically once your data is processed and ready to be used with the DB. If you start your DB connection too early, it will remain idle during the time that Knime is processing your data and can expire if data processing takes too long. There is absolutely no advantage to establish the DB connection early.

Also, in order to optimize the batch, you may first try with the Row Filter as you did, and adjust, may be push from 100 to 200, then to 500, to see if 500 is OK per batch, and once you get that magic number, you can then configure the batch size in the Chunk Loop - with a size of 100, you will be hitting the DB 5 times more than with a size of 500.

Ultimately, if you do have SQL knowledge, it might be better to just load your processed data in a temp table on the DB Server, and then run the UPDATE statement manually via the DB SQL Executor, OR load the processed data to a physical temporary table on the DB Server, and then connect to your DB via the Oracle DB client and run your UPDATE statement there.

mlauber71 · February 13, 2022, 5:57am

@summer_le how large is the data that is used to check if the IDs do match. Are these a few dozen or hundreds or thousands of lines? Is it the 120,000 you try to match against the 800,000?

summer_le · February 13, 2022, 6:11am

Hi @bruno29a I am in the process of finding that magic number now. 100 - 500 works well, I am trying 1,000 now and it seems to be taking a long time.

Thank you for the explanation of the DB connections, I’m too novice to know that was an option so I will try that next and see if that helps the performance.

My WHERE statement in the DB Update has 3 IDs it need to match to to find the unique record.

With your last suggestion, do you mean that I could try to write my final dataset into a different table in my Oracle database and then use the DB SQL Executor and write something like:

Update table name
set desired column = ‘integer’
WHERE COLUMNID_1 = COLUMNID_1
AND COLUMNID_2 = COLUMNID_2
AND COLUMNID_3 = COLUMNID_3 ?

Something like that?

summer_le · February 13, 2022, 6:14am

Hi @mlauber71 the table I am trying to update has about 1 million records and among that, there is only about 125,000 records I need to update with the same number with a combination of 3 IDs to get the unique record. I explained to Bruno that if I use a row filter and only update maybe 100 - 500 records at a time, then it is okay but anymore and it seems to time out and gives me an error.

mlauber71 · February 13, 2022, 6:23am

@summer_le just to be clear. You are having 125,000 individual records you are comparing against 3 columns on your database that would contain 800,000 records?

How large are these 125,000 records and what type of column are they (maybe write the table too disk and check)? Integer, string?

summer_le · February 13, 2022, 7:05am

@mlauber71 yes that is right. The dataset with 125,000 records has 33 columns but I really only need 4 columns of that whole table, 1) the column that needs updating and the 3 columns combined makes the unique ID needs to find which records in the database with 800k records to update.

The column I am wanting to update is only an integer. I am not sure if I use a column filter to only take the 4 columns I need before doing the DB Update if that would help performance.

The dataset with the 125k records, the column types are integers, strings and date fields.

mlauber71 · February 13, 2022, 9:56am

@summer_le I still think I need to understand more about the setting. Would it be possible to create a small sample of the relevant columns and lines.

The 125,000 would represent a DISTINCT set on the three columns?

Another thing you could try is (left) join the 125,000 file by the three columns and rename the column with the new value and use a CASE when. So again the main task would be done in the server.

That is always a good idea I would say.

Next thing to explore would be oracle connection settings. If there are limitations, timeout etc that might hinder the operation.

summer_le · February 13, 2022, 11:39am

Hi @mlauber71 this is a table sample of what is in the ‘Column Filter’ node: (where there is 125k lines to be updated)

And this is my workflow part where I am trying to update the Oracle database with:

Capture12

The columns I have in green are the 3 WHERE columns I am using in the ‘DB Update’ node. And the column in yellow is what I am trying to update into the Oracle database (which has the 800k records). That particular column has other values in it at the moment that are incorrect for those 125k records which is what I am trying to update.

That workflow of mine is taking a really long time and most of the time errors out because it takes so long that the connection is lost.

I am trying to find a better way to update it that doesn’t constantly encounter connection lost. could it be because I am using 3 different columns in my WHERE condition of the DB update node?

Because I have a similar thing I was doing before but it was able to update instantly and it was using only one unique ID.

bruno29a · February 13, 2022, 6:29pm

Hi @summer_le , are these 3 columns indexed? The Branch_ID has weird numbers as ID of Branch.

And are all the Schedule_Update values to be updated to 23?

summer_le · February 13, 2022, 10:21pm

@bruno29a Yes I they are indexed and the Branch_ID was just sample data but they simply contain numerics. And yes, that is right, the Schedule_Update values are to be updated with 23.

summer_le · February 14, 2022, 12:42am

@bruno29a @mlauber71 hi both, thank you so much for helping me. You gave me a lot of new knowledge that I didn’t have before.

There is no explanation why the performance is so slow so in the end I did a combo of bits of both your suggestion where I manipulated my data, then read in the original table, joined on the data and used a rule engine to perform the action of replacing the final column with my manipulated data and then wrote the whole thing back into the original table, using the DB SQL Executor to truncate the original table. It works much faster than trying to update records based on WHERE SQL

Thank you both and I hope we have a chance to chat again in the future (esp since I am new to Knime and will most lilely be asking a million more things)

bruno29a · February 14, 2022, 3:23am

Hi @summer_le , if truncate is an option, then it’s OK. The INSERT is always much faster since there is no WHERE conditions compared to the UPDATE one.

This was a great alternative

summer_le · February 14, 2022, 12:32pm

Thank you @bruno29a and also thanks to you, I know about chunk loops and will definitely find a use for it. Have a wonderful week!

mlauber71 · February 14, 2022, 6:17pm

@summer_le I create an example how to UPDATE data within the database. In this case a H2 Db. The exact syntax might differ if you use it on another Db system.

In this case in oder for the UPDATE … SET syntax to work a VIEW is being created that would do a LEFT JOIN between the old and the new data (of course you could just use that also).