Big Data ETL Problems

alex1368 · December 26, 2024, 7:08am

Hello friends
I am working in the field of ETL and my data input volume is large (for example 400 million records) and when deleting duplicate data etc. my system TEMP section gets full and workflow gives error. I have done this through chunk loop but still the same problem (TEMP folder gets full) has occurred please guide me about this

Add94 · December 26, 2024, 3:26pm

Hi @alex1368 , would you be able to upload the workflow to see if there are any critical areas where optimization may help the most? Generally speaking , 400m rows is a lot assuming you do not have a lot of disk space to work with as each step (using knime native nodes) is materialized. I would lean towards querying the data in an environment that allows for lazy evaluation - e.g. using db nodes in an olap database (like duckdb via jdbc driver) or Polars package via python nodes. Alternatively you can try using spark nodes as spark has lazy eval but it’s less optimal locally compared to e.g. duckdb.

ActionAndi · December 27, 2024, 2:58pm

Hi,
yeah 400M is quite big but doable Your “chunk-wise” approach is really good.
In addition to Add94 comment I suggest fololwing to check:

If you want to remove duplicates in table which is stored in a database like MySQL or MSSQL, do not download the tables to your machine. Let the work be done at the database. Ask “ChatGPT-like” helpers.
If you need to download the data, query only the columns you really need. Do not “select * from table”
Check your knime.ini settings. Maybe you can add some more RAM → xmx-Option
and if this does not help, try to use a Python-Script node to work with the data. Sometimes (not always) things are easier done in Python.

system · March 27, 2025, 2:58pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.