Counting occurences over String columns

beginner · July 30, 2014, 7:05am

I have say 10 String columns. These columns can contain the same or different values. I want to find the value(s) that occurs the most over these 10 columns. If there are multiple values with same occurence, return all of them.

I can do that with a rather complex workflow including Column Aggreagation, cell splitting and a Java Snippet. My question is if there is a node I don't know about that could easily achieve this?

Aaron_Hart · July 30, 2014, 5:34pm

It sounds like a good case for unpivot followed by a groupby. Can you post an example workflow?

beginner · August 13, 2014, 9:03am

This would be for ranking passed on occurence.

See attachment. This works but look overly complicated.

rank_test.zip

Aaron_Hart · August 13, 2014, 11:12am

I have a solution using unpivot into R snippet, but I there doesn't appear to be an elegant way to handle ties.

After completely unpivoting the table, an R snippet with the following code will give you the most frequent entry for each row. Unfortunately, "which.max" returns only the first entry in a tie. Alternatively, "which.is.max" from the nnet package will return a random winner, but if you need all 3, I think your current method is best.

library(plyr)

myFun <- function(x){
tbl <- table(x$ColumnValues)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}

knime.out = ddply(knime.in,.(RowIDs),.fun=myFun)

Regards,

Aaron

unknown_user · May 5, 2017, 10:38am

Hi collegues, in particular @beginner_

I know it's an old thread but I have take a look to the example workflow, in particular the Java Snipped node in which you rank the occurrences.

int highestCount = -1;

for (String occurence : c_Uniqueconcatenatewithcount_SplitResultList) {
	
 	final Matcher matcher = pattern.matcher(occurence);
 	if (matcher.matches()) {
		final String compound = matcher.group(1);
		final int count = Integer.parseInt(matcher.group(2));
		if (count > highestCount) {		
			out_MostCommon = compound;
			highestCount = count;
		} else if(count == highestCount) {
			out_MostCommon += ", " + compound;
		}
 	}
}

My question is:

and if I want to get as result the top 5 highest unique occurrences instead of the highest one? How the Java code should be structured in this example?

Thanks in advice.

-Giulio