Bug Report - SQLITE DB INSERT/WRITER node failing to handle UNIQUE constraint in Batches

dagomezma · November 22, 2022, 12:02am

I have been working on a Workflow which uses a SQLITE Database to handle some data. I’ve isolated the problem in a separate workflow and have made a simple table structure (just 3 Integer columns with an UNIQUE constraint which specifies that the combination of the 3 columns must be unique) so anyone may run it and hopefuly better understand the problem I’m about to describe.

When using the DB INSERT/WRITER node to insert new rows in a table with an UNIQUE constraint the node fails to properly show a message such as “…UNIQUE constraint failed…” when at least one row already existed in the table and the Batch Size of the node has been configured to be greater than 1. For example, if the table already comprises the set (5, 10, 5) at columns (int1, int2, int3) and you try to INSERT said (5, 10, 5) values along with other rows with values which are not yet in the table (such as, say, (5, 10, 6)) then you will get at the InsertError output column the value “Something happened while batch execution. Some of the rows could be executed.” for every row in the that Batch, instead of getting specific information for each row such as InsertStatus=0 and InsertError=“…UNIQUE constraint failed…” for rows which were already in the table and InsertStatus=1 and InsertError=NULL for rows which were yet to be added to the table.

The behaviour I just describred is undesirable, yet understandable to some degree (there may be an explanation I don’t know at the JDBC/SQLITE level). What is really worrying is that at the same time, when you try to INSERT rows which are not yet in the table (new rows) along with rows which are already in the table (old rows), which is the same problem described above, the new rows that happen to be in the same Batch that some old rows will also fail (let’s remember that the old rows are expected indeed to fail because of the UNIQUE constraint). For example, if you try to INSERT a set of 6 rows which happen to be in the same Batch…

(10, 15, 10)
(10, 15, 11)
(10, 15, 12)
(10, 15, 13)
(10, 15, 14)
(10, 15, 15)
… and rows (10, 15, 10), (10, 15, 11) and (10, 15, 12) were already in the table, you will get the error “Something happened while batch execution. Some of the rows could be executed.” and also rows (10, 15, 13), (10, 15, 14) and (10, 15, 15) will NOT be inserted although the error message claims that “some of the rows COULD be executed”, and you are NOT being warned about this data loss.

The data loss could be avoided by specifying Batch Size=1 in the node’s configuration, as I’ve tested in my example workflow, but this may cause the workflow to be a lot slower, specially with large tables.

I’m using KNIME 4.6.3 and for this case I’m just using the base KNIME DB nodes. As I said, I’ve isolated the problem in a Workflow which you can find here (the Workflow uses DB INSERT but the same happens if DB WRITER node is used):
Bug_DBINSERT.knwf (26.9 KB)

The Workflow uses the SQLite Connector node to create a temporary (in-memory) database, but in my original Workflow I was working with a SQLITE database file (where the same happens).

Please let me know if you know this to be a JDBC problem rather than a KNIME problem (I don’t have the time and knowledge right now to test in pure Java with JDBC), or if you know why this is happening and how to approach a solution.

dagomezma · November 22, 2022, 4:57pm

I was indeed a problem in how Batches are handled in JDBC, as reported here 7 years ago:

I tried to implement a solution with the node DB MERGE (UPSERT) as suggested by some comments in that post, but with no avail: DB MERGE asks for a key (WHERE columns) different from the SET columns, which would make it necessary to know some key beforehand, it would be nice to have some kind of option such as “ON CONFLICT DO NOTHING” for DB INSERT, or maybe some option to have the DB MERGE allowing no WHERE columns selected.

takbb · November 23, 2022, 8:24am

Hi @dagomezma ,

I too was looking at whether DB Merge would be a potential solution but you are right that it won’t work in you case because your entire set of columns is your key. I agree that in theory it ought to be the solution except for the restriction that you mention.

If you were able to add one extra “dummy” (non-key ) column to your table it would possibly then work.

The SET columns would then be the dummy column, and the WHERE columns would then be your three key columns. Can you try that as an experiment? Does it then work?

Other than that the only solutions I can think of are:

(1) a prior read from the database of the current rows on the table which you then join with your “new data” table, then take the non-matching rows from the joiner and send that to the DB Insert.

Whether that is practical depends in the size of the table, and also doesn’t preclude it being affected by external updates to the table occurring at the same time that you read/join/insert, so it could still fail in a multi-user/ multi-process environment.

(2) Use DB Insert to write to an empty “staging” table on the database and then use an additional SQL call to insert into the real destination table selecting “where not exists(…)” from the staging table.

mlauber71 · November 23, 2022, 9:57am

@dagomezma not sure if this is a KNIME or a driver problme. But nonetheless I would give you these examples in case you are looking for another solution. But I assume you already have one.

dagomezma · November 24, 2022, 11:26pm

Thanks a lot @takbb, I was aware of point 1 you proposed which was to first read from the database but this wasn’t really convenient in my case because I would expect it to limit performance when reading and when filtering by reference on large tables, so it was preferable to let the database worry about such a comparison (i.e. DB MERGE).

Your solution of adding a dummy column works perfectly, and it even works if I leave the dummy column permanently empty which is nice, so although I still think a real solution is needed (JDBC should handle batches in a better way / DB MERGE should have an option of having none SET columns or having none WHERE columns) I think yours is the best workaround right now.

system · December 1, 2022, 11:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.