Reading large amount of lines from an Excel File

Hi there! I need to read a large amount of lines from an Excel file (more than 80k) and use each one of them as input to an insert statement, in order to populate a database table. However, when I read the entire spreadsheet and try to populate the table, I always have a java heap error.

I observed that if I reduce the amont for 4k lines, I can persist the data. But in this case, I have to always change the parameters “read rows from” and “to” in Excel Reader component. For instance: read from line 1 to 4000; wait the whole process to finish; then change it to line 4001 to 8000 and so on.

Example:


Does anyone know how to make it more practical? Maybe I can use another loop to feed the parameters of this already created loop?

Thanks in advance!

Testing insert loop.knwf (16.5 KB)

It is not really clear what you want to do and also the structure of the loop might cause the Loop End to always collect all the data (again) which might result in the slow performance. The Variable Loop End node might be the better choice.

And also the example workflow does not contain any data and the DB SQL executor does not seem to contain any SQL code.

If you want to insert chunks of data into a DB you would not need to do that with a loop but you could just INSERT that. And if Excel is too large you might consider storing your input data as CSV or TSV instead of the full excel file.

And then you could try and use R or Python packages to import Excel data.

Then you could check a lot of performance and optimisation tips.

https://www.knime.com/blog/optimizing-knime-workflows-for-performance
(there are more - I could provide you with a collection of links)

2 Likes

Can you not use the DB Writer nodes?

Hello, let me try to improve my workflow explanation here. mlauber71, thanks for the advices, I changed the Loop End to a Variable Loop End and that worked fine, even for an Excel as an input for data. Now I don’t need to break the input into chunks of line, that worked for the whole spreadsheet (about 80k) at once.

I’ll try later the CVS and TSV.

I have also changed the workflow, which is depicted below (and the attachment as well - version1, now with some data and SQL), in order to explain why I use the SQL Executor instead of DB Writer (now also refering to what EvanB stated): the simple fact is because I need to use a SQL Sequence to increment the primary key values, like the one below (from SQL Executor):

INSERT INTO TABLE_SEG (“SEG_PK”,“NAME”,“ATIV”,“EMP”) VALUES
(SQ_SEG.NEXTVAL,’$${SNAME}$$’,’$${SATIV}$$’,$${IEMP}$$)

If someone has a better idea, I would be very grateful.

Thanks!

Testing insert loop v1.knwf (16.4 KB)

Ok, makes sense. I use Snowflake, and it just takes whatever you throw into it. I don’t know much about creating keys. :sweat_smile:

1 Like

Thank you EvanB! Regards

Depends on the database but since it doesn’t have auto-increment id I assume it’s oracle. Create the id with a trigger then you can insert without this crutch. mysql and postgres AFAIK have auto increment ids so it is is one of these, use the proper type for primary key.
Or more hacky alter your sequence to increment by n row, get 1 value and then alter it back and generate the remaining values in knime.

Anyway the writer will be a lot faster than this 1 statement per insert.

Hi beginner, I agree, but unfortunately I cannot change the way the DB was designed. Getting the last row and increment it could be a good idea, I’ll check whether it is viable for me or not.

Thanks!

Ok, since you don’t have that much control over the DB this suggestion might also not work because it would require you being able to issue ALTER statements on the sequence + it’s hacky and might cause problems if other applications are expected to be writing to the DB at the same time.

If your on oracle, there would be this solution:

SELECT SEQ_TEST.NEXTVAL FROM DUAL CONNECT BY LEVEL <= 200

This gives you a table of the next 200 values with 1 call and obviously the sequence is updated correctly. That seems the correct way to go. If you are not on oracle, maybe your DB has something similar.

Or for postgreSQL:

select nextval('id_seq')
from generate_series(1,3);
1 Like

Hi, using the presented workflow, where some data are being read from an excel file and written to a DB, where a loop is used to insert the lines, I wonder how can I display an output to a knime console. This output could be a string constructed using the variables available within the loop.

Example: I can read the excel file and, for each line, I have a bunch of variables that I can display, so I can, at any step of the loop, write a line in the console so I can troubleshoot the contents that are being written to a database without the need to run a select on this DB each time the data gives an error. Some kind of log, to be more precise.

Obs: If writing to a console is not possible, maybe writing to a text file could be an alternative.

Thanks in advance.

You could convert your variables to a table and store the result in a file or insert it into a small local database like H2 or SQLite

1 Like

Java Snippet can write to knime console

1 Like

Hi,

Using the NodeLogger in this case? Or there is another way?

Thanks

there are methods already available from the snippet class. type this.log and autocomplete should make suggestions like logWarn. You can the simply build a String from your flow variables and use it as argument for that method.

Hi beginner,

Really good advice! Worked so fine and with a single line of Java code.

Thank you so much!

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.