BigData in Knime

Hello to everyone!

Please tell me how to work for knime with big data (300 gb+ = 36 csv files).
We need to make regular pivots, but loading the files themselves and the pivoting takes a very long time. How can this be optimized?

Thanks for the helping!

@greatvarona these things com to my mind:

4 Likes

Hello!

I want to ask you about the benefits of working with databases in Knime.

What will be faster for working with huge csv-files: to load in DB large csv in advance and make pivots and grouby with “databases‘ nodes” or to make these transformations with standards nodes, but in a “simple streaming” mode?

Now my workflow is working with standard nodes with the help of “Cache-node” (thank you for that :blush:), but I want to reduce the time of executing.

Glad that it brought some improvement. Maybe you could check out the other hints about storage too. Performance always much depend on the overall performance of your machine/computer.

Then: while it is nice to have everything in one large KNIME workflow: if you run into performance issues it might be one approach to shirk the task in one workflow to just the data and tasks you need, save the result and continue in another workflow that would be opened afterwards.

The processing of larger amounts of data will always have to do with your system and what the task is. Databases are often preferred for large data operations and KNIME does support that (KNIME Database Extension Guide), where KNIME ist the front end and hold the logic while the majority of the work is done on a (powerful) database. If you have really large tasks you might have to use Big Data technology (but we are not there yet).

The question if streaming can help (again) very much depends on your task. If the process would require stats over the whole file it might not help - if your task can be applied for single rows it can be a good idea. One thing to consider would be if there is a subgroup that you might be able to iterate over.

2 Likes

@mlauber71 Hello!
I’m trying to upload a 28GB csv-file into H2-base.
In advance I placed the file with an empty H2 database on the SSD of our server.
Loading data into the database was successful, but queries are not being processed.
I attach the workflow. But when I loading the data into the SQLite-base, such queries were executed (queries before the nodes “GroupBy”). What’s my mistake?


H2_base.knwf (128.0 KB)

The first things I would check are database limitations and KNIME logs when you execute your flow.
br

@greatvarona question is where the H2 database resides. On the server and are you going to access it via a local KNIME? In that case the data would have to travel thru the network connection between the server and the local machine and since this is a one file database it might be possible that the whole file might have to be loaded. So not an optimal scenario.

Then the question how much memory would you have allocated fror KNIME?

I think we would need more context. Then: my formula for the two standalone databases is that H2 has much more functions while SQLite is more accessible and robust but with less functions. So choose your fighter.

Then the question what it is that you want to do with the data once you have it loaded into KNMIE or SQLite. Which functions. Might they be performed within KNIME thru streaming or column-wise or in chunks? You might have to plan ahead. A local KNIME with enough memory and optimised storage is a very powerful tool.

Will you work with the large dataset the whole time or is it some initial load and then continue with a subset? In that case I could try to see if there are some very efficient data loading packages for Python that might help with the initial setup. Or you try to employ the help of a local Hive environment via the big data nodes.

Typically this would ne the job of a real database - if you do not have one you will have to try a few things …

1 Like

@greatvarona

The difference in speed will depend upon the power of the machine running KNIME and that running the database.

If you are using a database then you will also need to determine which columns should be indexed. Columns defining the categories (rows) in group-by and (rows x columns) in pivot should be indexed to improve the speed of the grouping and pivoting processes. If they are not indexed then processing on the database server can be extremely slow.

You can also group-by on the database / using database nodes and then pivot in KNIME which also works.

Finally, @mlauber71 didn’t ask which table back-end you are using. Switching to a columnar back-end may give you a storage / performance benefit if you have categorical columns with few values that are repeated often (typical when producing group-by and pivots).

DiaAzul

1 Like