I am new to KNIME. I am running version 5.1.2 on a Windows laptop.
Background:
I have a statistics workflow node that appeared to stop or hang at ~22%. It sat at the same level for several hours. I canceled the execution and it stopped after another ~two hours.
Questions:
-
While executing, the statistics node reported that some fields exceed the max number of unique values. If this warning occurs does the node stop executing, or should it continue in spite of the exceeded number of unique values?
-
Is there a way to tell the statistics node to continue if that max is exceeded vs. just guessing the maximum number of unique values?
Thanks!
Hi,
The Statistics node should not hang in any case. If you need more than the 60 unique values counted, it may work to prepend a Domain Calculator node and in the bottom right of its config dialog choose not to limit the maximum number of unique values. However, this by itself may have a performance impact. It’s worth a try, though.
Kind regards,
Alexander
Thanks. Just to clarify, I am using the 3-output Statistics node - same name as the simpler one. I do not see an option to disable the max number of unique values. It may just be taking a very long time to complete. I have restarted it and I will let it run for 24-48 hours and see if it completes.
Hi,
The option to disable the max number of unique values is in the Domain Calculator node. In KNIME, every column has a “column spec” attached, consisting of the column name, column type, and possible values for nominal columns and min and max values for numeric columns. The possible values are only stored for up to 60 values, though. If you want more, you can recalculate the spec with the Domain Calculator. Maybe then the Statistics node will make use of the spec and finish faster.
Can you let me know how many rows and numeric and nominal columns you have? Then I can make some experiments on my own.
Kind regards,
Alexander
1 Like
Thanks! The Statistics node did complete in ~18 hours. The file has over 100M rows and over 300 columns. The occurrence output table had close to 10K rows (I set the max to 10K). So, the workflow appears to be working.
Wow, that is a lot. Which statistics were you interested in? As far as I know the node does not work in parallel, so if you are interested in statistics that are easy to calculate by first doing aggregation in parallel and then computing another aggregate of aggregates, you may be able to speed up your processing by using the Parallel Chunk Start and Parallel Chunk End nodes. They run the loop body in parallel on chunks of the input data. That way, you could calculate for example Sum, Count, and possible values in parallel, then calculate the total sum as the sum of sums, the total count as sum of counts, the mean as sum of sums divided by sum of counts, the possible values as the union of possible values, and so on. That could easily half or even quarter your processing time, depending on the number of CPU cores you have.
Kind regards,
Alexander
The primary interest is in getting a list of all values for each field. Having all of the values helps to verify that the data is valid and complete.
I can calculate stats separately in a spreadsheet.
I think the best approach is to run a separate program/script to read the CSV data and produce a list of all values for all fields. If I can’t find one that already exists, I’ll probably write one.