BigData in Knime

greatvarona · November 8, 2022, 6:11am

Hello to everyone!

Please tell me how to work for knime with big data (300 gb+ = 36 csv files).
We need to make regular pivots, but loading the files themselves and the pivoting takes a very long time. How can this be optimized?

Thanks for the helping!

mlauber71 · November 8, 2022, 8:48am

@greatvarona these things com to my mind:

check out my collection about KNIME and performance (A meta collection about KNIME and performance and performance tuning and some problems – KNIME Hub) namely the data storage settings and the memory allocated to KNIME
see if you can do the task in chunks or only keep the necessary columns (Problem with disk space during workflow execution. - #6 by mlauber71)
tell us more about your data and your system (memory, version …)
see if the (local) use of Big Data techniques could make a difference (This workflow demonstrates several methods to import one or many CSV file into Hive – KNIME Hub) (Read and process HUGE csv File - #12 by mlauber71) - it will not increase the memory but maybe Hive can process the files on disk (DB Pivot – KNIME Hub). Tough I have never seriously tried that on a local machine

greatvarona · November 14, 2022, 1:56pm

Hello!

I want to ask you about the benefits of working with databases in Knime.

What will be faster for working with huge csv-files: to load in DB large csv in advance and make pivots and grouby with “databases‘ nodes” or to make these transformations with standards nodes, but in a “simple streaming” mode?

Now my workflow is working with standard nodes with the help of “Cache-node” (thank you for that ), but I want to reduce the time of executing.

mlauber71 · November 14, 2022, 10:35pm

Glad that it brought some improvement. Maybe you could check out the other hints about storage too. Performance always much depend on the overall performance of your machine/computer.

Then: while it is nice to have everything in one large KNIME workflow: if you run into performance issues it might be one approach to shirk the task in one workflow to just the data and tasks you need, save the result and continue in another workflow that would be opened afterwards.

The processing of larger amounts of data will always have to do with your system and what the task is. Databases are often preferred for large data operations and KNIME does support that (KNIME Database Extension Guide), where KNIME ist the front end and hold the logic while the majority of the work is done on a (powerful) database. If you have really large tasks you might have to use Big Data technology (but we are not there yet).

The question if streaming can help (again) very much depends on your task. If the process would require stats over the whole file it might not help - if your task can be applied for single rows it can be a good idea. One thing to consider would be if there is a subgroup that you might be able to iterate over.

greatvarona · November 17, 2022, 6:52am

@mlauber71 Hello!
I’m trying to upload a 28GB csv-file into H2-base.
In advance I placed the file with an empty H2 database on the SSD of our server.
Loading data into the database was successful, but queries are not being processed.
I attach the workflow. But when I loading the data into the SQLite-base, such queries were executed (queries before the nodes “GroupBy”). What’s my mistake?

H2_base.knwf (128.0 KB)

Daniel_Weikert · November 17, 2022, 6:55pm

The first things I would check are database limitations and KNIME logs when you execute your flow.
br

mlauber71 · November 17, 2022, 8:16pm

@greatvarona question is where the H2 database resides. On the server and are you going to access it via a local KNIME? In that case the data would have to travel thru the network connection between the server and the local machine and since this is a one file database it might be possible that the whole file might have to be loaded. So not an optimal scenario.

Then the question how much memory would you have allocated fror KNIME?

I think we would need more context. Then: my formula for the two standalone databases is that H2 has much more functions while SQLite is more accessible and robust but with less functions. So choose your fighter.

Then the question what it is that you want to do with the data once you have it loaded into KNMIE or SQLite. Which functions. Might they be performed within KNIME thru streaming or column-wise or in chunks? You might have to plan ahead. A local KNIME with enough memory and optimised storage is a very powerful tool.

Will you work with the large dataset the whole time or is it some initial load and then continue with a subset? In that case I could try to see if there are some very efficient data loading packages for Python that might help with the initial setup. Or you try to employ the help of a local Hive environment via the big data nodes.

Typically this would ne the job of a real database - if you do not have one you will have to try a few things …

DiaAzul · November 17, 2022, 9:18pm

@greatvarona

The difference in speed will depend upon the power of the machine running KNIME and that running the database.

If you are using a database then you will also need to determine which columns should be indexed. Columns defining the categories (rows) in group-by and (rows x columns) in pivot should be indexed to improve the speed of the grouping and pivoting processes. If they are not indexed then processing on the database server can be extremely slow.

You can also group-by on the database / using database nodes and then pivot in KNIME which also works.

Finally, @mlauber71 didn’t ask which table back-end you are using. Switching to a columnar back-end may give you a storage / performance benefit if you have categorical columns with few values that are repeated often (typical when producing group-by and pivots).

DiaAzul

system · February 15, 2023, 9:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.