Identifying Intersecting Clusters

Dear Knimers, I need help in the above-mentioned task in the title.

Here’s how my data looks like:

As an example, I would like Cluster 8, Cluster 14, Cluster 17, Cluster 19, Cluster 23 & Cluster 22 to be identified as one group because they are linked to one another by ‘Alpaca’ & ‘Blueberries’ either directly or indirectly. The intersecting elements must be of exact matches, not fuzzy matches. Other examples would be Cluster 24 & Cluster 25 due to the element ‘Blues’.

I am looking for a fully automated solution. The outcome table can be as simple as a row reporting the names of the linked clusters separated by commas.

Here’s the dummy data:
Clusters.knwf (9.0 KB)

Many thanks!

1 Like

Hi @badger101 , how would the output look like?

Let’s take your example of “I would like Cluster 8, Cluster 14, Cluster 17, Cluster 19, Cluster 23 & Cluster 22 to be identified as one group”, what would be the output of that? An additional column called “Group” and with a Group number/ID?

Hi @bruno29a , the new group name can take in one of its clusters’ names since the clusters’ names are unique. I would prefer it as a separate table of 2 columns - New Group Name & Group Members.

For example,

New Name | Members
Cluster 24 | Cluster 24, Cluster 25
Cluster 8 | Cluster 8, Cluster 14, Cluster 17, Cluster 19, Cluster 23, Cluster 22

Hi @badger101 , thanks for the sample output.

I think I got something for you.

Does this result look good?

Here’s the workflow:

Thanks for that attempt @bruno29a , while the table format is okay, but the outcome table contained a lot of intersecting groups still. For example, The Row5_Row2, Row8_Row3, Row10_Row4 and some others should be grouped together as one.

Meanwhile, Row16_Row8 and Row17_Row9 should be in one group.

The number of new groups should be less than the original number of clusters, in this case, it should be less than 31, not more or similar.

Well, it is what I originally thought, but after seeing your sample output, that was not clear.

I mean why choose Cluster 24 or Cluster 8 as New Name? I thought the results was by Cluster.

This can easily be fixed.

Results after the modifications:
image

Workflow refreshed

Note: Sorry it took a bit of time to modify it. I had already closed my Knime, and it’s slow to start. Also had to reconnect to the Hub, and it’s past 2am here.

Hopefully this should do the trick.

EDIT: I chose the new name based on the sorted values. This is alphanumeric, meaning Cluster 14 is sorted before Cluster 8, hence why Cluster 14 was chosen. If you want Cluster 8, you can tweak by retrieving 8, 14, etc as numerical in a new column, then sort by that new column.

Thank you @bruno29a , I can still see intersections there. What I’ll do is I’ll try to make an Excel table of how the result will be like according to this data here, and given that it’s already past 12 AM your time, I’m okay if you relook on another day, (unless someone else want to step in of course.)

Really appreciate the time you took!

I’ll update this post maybe in a few hours with the Excel file.

1 Like

Found the mistake @badger101 . Somehow the values in the column2 interfered with the deduping. In the end it’s not needed, so once removed for deduping, the workflow deduped properly.

The hint was that I saw Cluster 23 twice for Cluster 23 and similarly for Cluster 3.

I corrected it.

Here’s the results now:
image

Workflow refreshed:

2 Likes

Thanks a lot @bruno29a !!! Exactly similar to my Excel result :grin:

I’ve marked it as a solution. How many hours did it take for you to solve it? I probably used up at least 3 hours of my time before going to bed with no solution last night. :man_shrugging: I tried so many nodes you’ve used here, but including Cross Joiner.

Hi @badger101 , no problem.

I don’t know how much it took me, probably less than 30 mins… It’s not a big workflow and there’s not a lot of data.

I had a clear idea of what approach I wanted to take since the beginning. Once this is set, it’s pretty easy after. That is why it took me just a few seconds what to modify (Node 12 + Node 13) after your additional clarification. Similarly for the duplicates, I knew exactly where to fix this (Node 8 had to be adjusted).

Of course, you can uncover some challenges as you start implementing, but in most cases the challenges are solvable. For example, I was trying to find a way to “merge” my results after the 2nd join. I started going towards Rule Engine in my mind, but then realized that doing a Concatenate will do this for me.

I can see why you used the Cross Joiner node - it’s basically a 2-way relation, but using the Cross Joiner will break that relation, or rather create a relation between every record.

The proper approach is using 2 inner joins, or at least that’s the approach I used.

These 2 joins being inner joins meant that they would include only those who had a relation, hence why I needed that 3rd join which is a left join in order to include those that did not have a relation.

I think this can still be optimized where we could eliminate the left join. There are 2 ways I can see this:

  1. Convert the inner joins to left joins, with additional transformation
    or
  2. Modify the data at the beginning by adding a relation between each Cluster to itself

This was more of a “lazy” way since it was quite late.

The work itself took about 30 mins (Analyzing, thinking, implementing), but that spanned probably over an hour or so with the back and forth for clarification and my Knime and system being very slow (I’m running with about 2% space capacity :rofl: )

2 Likes

2% space capacity? :rofl: Time for a new computer!

1 Like

Hi @badger101 and @bruno29a,
in addition to the accepted solution: For use cases like this, the Network Component Splitter node was built. It can detect intersections like this over an arbitrary number of “jumps” and also for large data volumes. If of interest to you and you need further assistance, please let me know by tagging me.
Best regards
Arne

2 Likes

Thank you @arbe , I’ll test it out if I face the same task again in the future.

Hello @arbe,

hope you are doing fine. Just wanted to say that Network Component Splitter is a phenomenal node! And extremely fast to my surprise. My approach (for task described here) was two recursive loops and duration was around 5 hours. With Network Component Splitter execution time is around 10 seconds :sweat_smile: And that includes some manipulation nodes as my data is not in a node-node but rather node-value/attribute relation. If this node would have configuration options to handle those cases as well that would be awesome :wink:

Br,
Ivan

1 Like

Hi @ipazin,

Thanks a lot for the feedback. I really appreciate you using the node and deeming it valuable. I also love it and it is in my standard toolset for data wrangling.
Regarding your idea: That sounds good, many users will have node-attribute relations (so does this post which we are writing in). What exactly would be the desired functionality? Avoiding coincidental clashes of node and attribute names + removing the attribute to cluster assignment in the result table?

Best regards
Arne

Hello @arbe,

Correct. However workaround is not too complex. Add prefix to Node column which does not exist in Attribute column and based on which filtering is performed to leave only nodes inside clusters.

Also have couple of more ideas what this node could feature but maybe those things are two specific for my use case. Anyways if I think they are worth sharing I’ll open up a new topic.

Br,
Ivan

If possible please do that. The flow on the knime hub does not help me to full understand this (for me) new node.
br and have a great weekend

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.