Group by Chunks of pre-determined delimiters/keywords

@takbb Also, I am confused. Why are these keywords matching domain-name in the beginning?

@jarviscampbell ,

Logically it should be doing the same thing. There is always a chance that something is slightly wrong in what I supplied but it it is more likely that something went awry in the configuration because of column name changes in the datasource or something like that.

[I’ve deleted the last bit I was misreading what you said]

@takbb I think those are correct though.

They should not be matching the very first line of the code right? The first line domain-name isn’t under any of these keywords.

On the next string manipulation, I have added on my own join(“^”,$keyword$,“\s.+”) meaning, it should only look at the keywords that are in the beginning of the line.

Are you able to upload the workflow as it is at the moment?

Sure, I can. It’s about the same as yours.

I meant so I can see the data. When you were asking why is it matching the first lines. Do you mean after the cross joiner? It matches everything to everything at that point, and then the rule engine sees what really matches based on the regex pattern.

In terms of the workflow, I wanted to be sure I’m seeing exactly what you’re seeing re exactly the data that is being fed in,.

1 Like

Yes, here you go, I was exporting to a new workflow. Mine is a mess at this point :slight_smile:

GroupBy-v4.knwf (154.0 KB)

Nah, something is missing, I think… check it out

object network xyzNetwork
host 4.4.4.4
description 4.4.4.4

host 4.4.4.4 should only match < object network > right?

At the GroupBy, the column name changed between the original data and my demo so it needs to be reconfigured. That’s why you get the empty table later

If the image you posted was the output of the Cross Joiner (which I think it is), then that is doing exactly what it’s supposed to. It simply “joins” every row to every other row. You will see the total number of rows on the Cross Joiner output is the product of the rows from the two tables. (Hence why it takes up a huge amount of memory if the tables get very large).

There is no “matching” as such at that point. Its job is simply to put every keyword next to every col1 value.

Then the Rule Engine comes along and looks to see in which rows col1** matches keyword_pattern**. It’s the KNIME equivalent of throwing mud at a wall and then seeing what sticks… :wink:

1 Like

That’s is awesome @takbb

I really appreciate all the time you spent on this and I apologize I haven’t caught the end on the ‘GroupBy’ change. Thanks a lot!

See you next time.

Cheers
J.

1 Like

No worries, and there is another minor duplication I saw. I was trying to work out why 56 input rows became 58 at the end of the Sorter. It was because you have “object service” twice in the keyword list, so if you remove that it should be right.

1 Like

@takbb Indeed I had. Took care of it. Thanks for looking out! Cheers.

1 Like

Hi @jarviscampbell ,

I thought I’d take this opportunity to show-case a new component of mine. One of the less intuitive aspects of the solution for this workflow is the mechanism to join using regex patterns. The standard KNIME joiner nodes do not provide a facility for non-equi-joins (e.g. wildcards, regular expressions and ranges).

After assisting with your workflow, I decided to create a set of components which use the built-in H2 database behind the scenes to help bridge this gap.

The attached shows how the Join Regexp_Like component can simplify that part of the workflow, making it more readable and intuitive.

I have left the old mechanism at the top, and added the new variation at the bottom

GroupBy-alternative with regex join component.knwf (165.1 KB)

There are some other minor changes as a result but hopefully you may find such a component useful in future. You can see the full set in my other post on the subject:

1 Like

Hi @takbb I will take a look at your next post. However, I think I jumped the gun. I thought you did such a beautiful job displaying ur solution that I got overwhelmed about the first requirement/use case :slight_smile: The solution must group all the keywords under one single column. I should only have around 16ish columns instead of hundreds :slight_smile: I bet that is an easy fix during the ‘group-mark’ stage, isn’t it?

Hi @jarviscampbell , I don’t think it would be too difficult, but can you give an example of the output. If it helps, this is my understanding of what you’ve said (using my original simple data with a few additional lines). Is this right?

col1
z1
z2
z3
z4
A
x5
x4
x3
x1
B
x9
x4
x5
C
x2
x3
A
x4
x5
x10
C
x2
B
x5
x3

And your keywords were

A
B
C

Would the expected result be something like this result (does the heading get repeated or do you just want all the “data” lines?:

col1  col2  col3
A     B     C
x5    x9    x2
x4    x4    x3
x3    B     C
x1    x5    x2
A     x3
x4
x5
x10

(I might not be able to come back to you right away as I’ll be away from my computer for the rest of the day - yes even I have time away from KNIME :wink: )

lol, of course I understand. Whenever you have the chance. But I think that it looks like that @takbb

col1 col2 col3
A B C
x5 x9 x2
x4 x4 x3
x3 B C
x1 x5 x2
A x3
x4
x5
x10

Man @takbb that is awesome, but I must admit, I bit above my level of understanding fully. I do see that we went from 4 nodes to 3, so that did get all composed within a single new component which is great. I see you put enough work on that. Knime should hire an expert like you. :slight_smile: Thank you for your contribution to the community. I am sure a lot of us will benefit from your efforts.

Hopefully this is a step in the right direction…

I just added some more stuff on the end, making use of the joiner regex component.

It currently leaves gaps, which means that a new “grouping” always starts on the same line as other new groupings. This is just the way the method I used ends up doing it. That may be what you are after. If you want it “squashed up” so the blank cells disappear, I’ll have to think about that. There are forum posts on the subject, such as this one, but I haven’t thought this through any further to say if there are alternative approaches:

And now I must dash! :wink:

GroupBy-alternative with regex join component 2.knwf (221.9 KB)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.