I am a new knime user and not familiar with all the nodes and their function. I have a question regarding the GroupBy node.
I have a table with over 1000+ columns which contain string values (alphabets, that can be converted to numeric if needed). I want to use GroupBy (or any other) node to calculate the value with maximum occurence in each column and it's percentage occurence in that particular column.
Now, I can identify the value with maximum occurence and the number of of times it occurs in a column, but I am not able to calculate its percentage occurance.
For example if I have the following column:
Here using GroupBy node (Aggregation method: Mode) I can calculate that 'A' occurs maximum times and that it occurs 5 times. How do I tell that its percentage occurance is 50%?
Can anyone suggest a workflow for that, please?
Isn't there a percentage aggregation count in the groupby node?
There is but but how do I make sure that the percentage of the value with maximum occurrence is calculated?
Is there any way to do it that I may not know about?
Hope the attached workflow helps. It was not clear whether it was a requirement to compute the initial groups too if not, the inner loop can be replaced with a simpler part using TableRow To Variable.
PS.: It is easy to use strings instead of the numeric columns, but the data generator is easier to use numerics. (Just change the row filter and the snippets.)
attached you can find another flow without loops which should be faster for tables with many columns.
Quite nice. :) Thanks for sharing.
Would it not be a problem to select the columns with count in the _Count suffix? With 3 columns that is easy though.
I guess I misunderstood the original problem, silly me (in my example the highest value's percentage is selected not the most common value's).
Do you have an idea how should the multiple mode handled? The GroupBy Mode selects only a single value (the first most common). I think the better idea would be not using that at all, but using the occurrences table as you do, except keeping all of the most common values. (But in this case the loop is necessary again.)
by using "_Column" as search string and selecting the "Select all search hits" option you can select all necessary columns in one go in the column loop. We already have a bug to include regular expression support in the column filter which would make it easier.
You are right. If the column contains more than one most frequent value the mode operation returns only one of the values. In order to find all these values you would need to use the occurrence table as you said. Thanks for your comment.