DB Row Inserter - Slow

Hello friends,

I am performing some tests to insert rows into a table in Databricks.
The workflow is very simple.

I just connect to the table and perform the row insertion.
My table has approximately 83 columns and 300,000 rows.

I am testing it, and inserting 20 rows took 3 minutes.
This duration is too long. :turtle:

I searched the forums and found people commenting on the same issue, but I did not understand if there is a faster solution.

My question is: Is there something that works faster, specifically for Databricks?

not necessarily a databricks topic:

  1. are you really talking about insert, or do you mean update/upsert? If you have a fragmented or no index, these operations are expected to be slow if primary keys or check constraints need to be evaluated
  2. whats your batch size? higher batch size may be suitable if your data is slim
  3. whats the logging configuration on the database? this can impact write speeds significantly
  4. Knime has a DB Loader which does bulk inserts (usually without type checking). I dont know if thats available for Databricks but should offer the best performance

OLAP Warehouses are usually not ideal for OLTP (insert) transactions.

tl;dr: talk to your database admins

So many difficult questions, lol.

  1. I don’t have an index. It is a table without primary key rules. I just need to populate the table. The columns are organized by data type (string with string, etc.).
  2. I have no idea. Is this configuration in Node or Databricks?
  3. I have no idea. I am not the Databricks DBA. I only have some table creation access, but I am not an admin. I cannot change settings.
  4. I checked some forum answers and saw some comments about this DB Loader node, but I didn’t understand the settings. I downloaded an example and tried to use it, but when I connected the node to the Databricks connector, it wasn’t accepted. I don’t know if I did something wrong.

Pure Insert without any keys should be decently fast.
batch size can be set in the Insert Node but I suspect if you didnt change the default, its set to 1000. still, check this in the Insert Node.

DB Loader KNIME → Python may not be supported.
You may have better success going a Knime → Spark → Databricks route, but this will become even more complex.

Yes, Batch size is default. I didn’t change anything.

Maybe the node is it what it is.
It is slow by nature.

Even if the node is slow, the insertion of 20 rows shouldn’t last 3 minutes.

@Felipereis50 maybe take a look at this example. Here the actual data is being loaded via a file connection not the SQL connection.

4 Likes

@mlauber71

:heart_eyes: :heart_eyes: :heart_eyes: :heart_eyes:
Annotated workflow for study.
I didn’t know how to configure the DB Loader, and now it has become easier.
I hope it helps others.
Very simple and straightforward.

Thank you very much.
Your workflow worked perfectly.

Your example included 5,400 lines quickly.

Great stuff.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.