Statistics Node Maximum Number of Values Per Column

exceluser · October 23, 2023, 6:41am

I am new to KNIME. I am running version 5.1.2 on a Windows laptop.

Background:
I have a statistics workflow node that appeared to stop or hang at ~22%. It sat at the same level for several hours. I canceled the execution and it stopped after another ~two hours.

Questions:

While executing, the statistics node reported that some fields exceed the max number of unique values. If this warning occurs does the node stop executing, or should it continue in spite of the exceeded number of unique values?
Is there a way to tell the statistics node to continue if that max is exceeded vs. just guessing the maximum number of unique values?

Thanks!

AlexanderFillbrunn · October 23, 2023, 7:30am

Hi,
The Statistics node should not hang in any case. If you need more than the 60 unique values counted, it may work to prepend a Domain Calculator node and in the bottom right of its config dialog choose not to limit the maximum number of unique values. However, this by itself may have a performance impact. It’s worth a try, though.
Kind regards,
Alexander

exceluser · October 23, 2023, 5:27pm

Thanks. Just to clarify, I am using the 3-output Statistics node - same name as the simpler one. I do not see an option to disable the max number of unique values. It may just be taking a very long time to complete. I have restarted it and I will let it run for 24-48 hours and see if it completes.

AlexanderFillbrunn · October 24, 2023, 7:13am

Hi,
The option to disable the max number of unique values is in the Domain Calculator node. In KNIME, every column has a “column spec” attached, consisting of the column name, column type, and possible values for nominal columns and min and max values for numeric columns. The possible values are only stored for up to 60 values, though. If you want more, you can recalculate the spec with the Domain Calculator. Maybe then the Statistics node will make use of the spec and finish faster.

Can you let me know how many rows and numeric and nominal columns you have? Then I can make some experiments on my own.

Kind regards,
Alexander

exceluser · October 24, 2023, 8:19am

Thanks! The Statistics node did complete in ~18 hours. The file has over 100M rows and over 300 columns. The occurrence output table had close to 10K rows (I set the max to 10K). So, the workflow appears to be working.

AlexanderFillbrunn · October 24, 2023, 8:52am

Wow, that is a lot. Which statistics were you interested in? As far as I know the node does not work in parallel, so if you are interested in statistics that are easy to calculate by first doing aggregation in parallel and then computing another aggregate of aggregates, you may be able to speed up your processing by using the Parallel Chunk Start and Parallel Chunk End nodes. They run the loop body in parallel on chunks of the input data. That way, you could calculate for example Sum, Count, and possible values in parallel, then calculate the total sum as the sum of sums, the total count as sum of counts, the mean as sum of sums divided by sum of counts, the possible values as the union of possible values, and so on. That could easily half or even quarter your processing time, depending on the number of CPU cores you have.
Kind regards,
Alexander

exceluser · October 25, 2023, 12:17am

The primary interest is in getting a list of all values for each field. Having all of the values helps to verify that the data is valid and complete.

I can calculate stats separately in a spreadsheet.

I think the best approach is to run a separate program/script to read the CSV data and produce a list of all values for all fields. If I can’t find one that already exists, I’ll probably write one.

system · January 23, 2024, 12:18am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.