Group by Chunks of pre-determined delimiters/keywords

Hello all,

I have a column/config with several hundreds of rows of configuration. Only certain chunks interest me, but since it can be thousands of lines long, I need to automate this.

I also have a list of keywords in which those might be used to match against each row as the beginning keyword, most likely it will be (currently the config supplied as sample has spaces in the beginning of rows that are sub-grouped. IF there’s no need to keep those, a trim in the beginning of all rows of the config can be performed and all treated the same, as only attending for words in the beginning of rows. IF that IS the case. Then: - it is not in the beginning of the row, it should not match and be discarded from being added).

The chunks will then be placed in different columns.

I am having a hard time when using Variables and IndexQuery. As mentioned, I am thinking about maybe using A Table for the keywords and treat it as a dictionary list variable. Then from there use the some sort of Query. I have tried but the query does not like wildcards.

Knowing now by @takbb that Rule Engine takes a different type of regex in, I have created to sets of input. This is a sample, but my original Input file is an Excel spreadsheet I am able to build these concatenated keywords quickly to feed in here. Sorry if I didn’t supply a Table, but cleaning up multiple columns from newlines seems to be very hard and unnecessary.

Thanks a lot,
J.

ps: looking at the breakdownchunks sample spreadsheet, columns B-D are the sample expected outcome from the ‘Group by Chunks’

GroupBy and Separate by Chunks based on List of Strings 2.knwf (42.2 KB)
configKeywords.xlsx (11.9 KB)
breakDownInChunksSampleData.xlsx (10.0 KB)

I don’t know IndexQuery so can’t really help with that but:

Because the Regex is not suitable in this case. With "^object network\s\w+" for example, you will see that object network xyzNetwork is the only full match (what the Engine needs) while the others are partial.

image

This is because the usage of \w is not appropriate here since it represents an insufficient range of chars.
See here:
image

The first record contains a dash and the third contains periods. As such, the strings are not recognized. If you take the equivalent of \w and add those chars, for example:

$column1$ MATCHES "^object network\s[A-Za-z0-9-_.]+" => $$ROWINDEX$$

You’ll get the proper row index now for all of them.

image

Hope this helps!

3 Likes

Thank you @ArjenEX , I found my mistake a bit ago and deleted the post. But I appreciate your reply. :slight_smile: I am now trying to get the GroupBy function to work even if I don’t use the Query, but based on a single column w keywords to look for.

Hello again @jarviscampbell :slight_smile:

I’m just ignoring all the regex etc at the moment, and just want to clarify the requirement. Does a “chunk” (i.e. a new column) begin every time one of your “keywords” is found?
You said there are lines you aren’t interested in, but once you have found the first keyword, are there some lines you want to discard, or do you just keep everything else?

For example, suppose we had the following list:

col1
z1
z2
z3
z4
A
x5
x4
x3
x1
B
x9
x4
x5
C
x2
x3

And your keywords were

A
B
C

Your workflow so far does the clean up stuff (which we can kind of ignore here), but I’m assuming it removes the initial lines z1-z4 as they are “cleaned up” because they are prior to the first keyword that you are interested in (that’s based on my understanding from what we were discussing on the previous thread about that part of this task)

So here we are at the point where you want to process the rest of the lines beginning with the “A” in “chunks” or “groups”.

Would the above end up as:

col1  col2  col3
A     B     C
x5    x9    x2
x4    x4    x3
x3    x5
x1

Is that basically it, or is there more to it than that. (And I’m deliberately ignoring pattern matching as that is just an “implementation detail” as far as I’m concerned :wink: )

1 Like

I think I am making very slow progress here but something. @takbb @ArjenEX do you guys know why would the hard coded regex work on Rule Engine but when brought in as a variable it does not?

any ideas?

Hi @jarviscampbell , it looks like the value of your flow variable includes the double-quotes. You shouldn’t include the double-quotes in the data if you are building the values from a table, as they are implicit.

1 Like

Hi @takbb Not exactly. Once I find the first keyword match, a new ‘chunk’/column should start. I need to keep the rest below, so I can keep analyzing/filtering, I don’t want to keep the previous that didn’t match. Sorry if I wasn’t clear. I apparently have trouble expressing myself lol

That is it! Exactly! How do we get that done? Again. After this portion gets figured out, we’ll need to add a ‘dictionary’ to the words that need matching because there are hundreds of lines if not thousands (a lot of chunks) that this column will be broken out to.

@takbb Perfect! That was it for that portion. :slight_smile:

@takbb Also, I have tried something like this, but based on the number of keywords I would end up w a workflow of hundreds and hundreds of unnecessary nodes and repeats

I also cannot seem to be able to keep all the generated chunks. I gotta keep going back to the original Rule Engine to re-filter from there.

Hi @jarviscampbell , there are some tricks with certain nodes that alone don’t seem to do much but together they become quite powerful for specific functions.

This workflow contains a couple of those different tricks, or “patterns”.

On the left, I used my sample data as it was simple to work with. We have this:

After the data I added a counter generation. Counter generations are good for keeping rows in the right initial order after you’ve done a load of transformations. So we give each row a number.


The other input is the set of “keywords” (initial parts of phrases to be found)
image

A String Manipulation turns all of these into whatever pattern we deem appropriate. You might change this regex if it doesn’t perfectly match your data:

image

This regex I have here is similar but not quite the same as yours and here will of require that only matches containing some whitespace and at least one other character will be found. That may not be right for you, and you can adjust it accordingly.

The seldom-used cross-joiner is up next and joins every row to every other row. If your keyword dataset is small then this should be ok. If you cross joined tables with thousands of rows each, it starts to get more inefficient and memory-hungry.

Next a Rule Engine looks at each joined row and finds those rows that actually match a keyword and places a 1 in the new column


It has to be a 1. The reason for that comes in a minute, but first we want to get rid of all those extra rows we generated:

First off we want to split out the rows that are “markers” (they have a 1 next to them) from the others:
image


The upper set contains no duplicates because they matches a pattern, but the lower set has a load of duplicates because they are the remaining rows which didn’t.

So a duplicate row filter gets rid of the duplication created by the Cross Joiner. But we still need to get rid of duplicates created from Marker rows that we attempted to match with the wrong keyword.

We can identify them by joining this lower list back with the set of Marker rows correctly identified on the top port. If we Right Join with it, we will be returned only the rows that are not “Marker” words

image

So we now have the data but we need to bring the Marker and non-marker rows back together with a concatenate
image

And make use of the Counter Generation from the start, to re-sort into the original order

Now, you remember those 1 values that we got the Rule Engine to put in?

This is where we make use of them. A Moving Aggregation node can be configured to cumulatively sum this column down the table. And the result is this:

A simple trick, but we have now given all rows associated with each marker, a “group identifier”

A row filter can now discard those “unloved” rows that appeared before the first marker, and a column filter followed by a Column Renamer can tidy up a little.
image

And at this point, we are practically done. It leaves just one more trick involving a GroupBy, Column Filter (to remove the Group Number because it will now get in the way), a Transpose, and an Ungroup.

The Group By groups each row by group id into a list, the transpose puts each of these lists into its own column, and then the ungroup makes them back into rows again.

image

With this end result:

image

Breaking rows by keywords and breaking into column groups.knwf (101.5 KB)

I could have added a RowId node to tidy the rowids, and possibly a Column Rename (Regex) to turn Row0, Row1, Row2 into Col0, Col1, Col2 or something like that… but I am told you should always leave the audience wanting more :wink:

6 Likes

Hi @takbb , Thank you so much for all the hard work and explanation you’ve put on these. I really appreciate it. I will go back several times to try to understand the workflow better and may have questions. However, trying to run it w my data, right before we go onto your ‘magic’ :slight_smile: I see a slight mix up w the numbers on the groupID. I see that on the Row_x_Rowy column, things seem to be in order keeping around 3 rows together at least as much as I could see on the thousands of lines where I could find. But on the GroupingID, check it out:

Is that how it suppose to be? Will the other nodes do something else w the misplaced GroupIDs later? If so, I can try to sanitize my data and post the results that are not correct here. Lmk thanks.

EDIT:
I need to add valuable information that I found on rows 5000 something on my data. Apologies for that. I do not think it is the case for the mix above. However, I know I said in the beginning that the data would all be together. However, I found in the middle of the code, around 350 lines that are in the group < object network > then suddenly some others in the group < object service > then back to < object network >, I believe that might be the case for some other upcoming keywords I have not yet entered. Again apologies I didn’t catch before.

Hi @jarviscampbell , can you double-check the config of the Duplicate Row Filter. It is supposed to be as below:


So it should just include “Counter” and be set to Enforce inclusion. I think I may have accidentally left it as Enforce exclusion, which means if you have any column name changes on your input data, it will not be filtering duplicates correctly. That may be partly a cause of what you are seeing. Just an initial thought.

2 Likes

Hi @takbb Changed it as it should, and this is what was altered at the same previous spot:

I notice now that the Transpose right after Col Filter at the magic spot, has a warning and an empty data table.

Col Filter is also empty, here’s a sample of the last warning lines:

WARN  GroupBy              3:18       No aggregation column defined
WARN  GroupBy              3:18       1 invalid aggregation column(s) found.
WARN  GroupBy              3:18       No aggregation column defined
WARN  GroupBy              3:18       No aggregation column defined
WARN  GroupBy              3:18       1 invalid aggregation column(s) found.
WARN  GroupBy              3:18       No aggregation column defined
WARN  GroupBy              3:18       No aggregation column defined
WARN  GroupBy              3:18       1 invalid aggregation column(s) found.
WARN  Transpose            3:19       Node created an empty data table.
WARN  Ungroup              3:20       No ungroup column selected. Node returns input table.
WARN  Ungroup              3:20       No ungroup column selected. Node returns input table.

@takbb Question: You are feeding a Simple few row text sample, I am feeding the output directly from where I cleaned up the beginning lines from last the last thread, can that be an issue?

@takbb Also, I am confused. Why are these keywords matching domain-name in the beginning?

@jarviscampbell ,

Logically it should be doing the same thing. There is always a chance that something is slightly wrong in what I supplied but it it is more likely that something went awry in the configuration because of column name changes in the datasource or something like that.

[I’ve deleted the last bit I was misreading what you said]

@takbb I think those are correct though.

They should not be matching the very first line of the code right? The first line domain-name isn’t under any of these keywords.

On the next string manipulation, I have added on my own join(“^”,$keyword$,“\s.+”) meaning, it should only look at the keywords that are in the beginning of the line.

Are you able to upload the workflow as it is at the moment?

Sure, I can. It’s about the same as yours.

I meant so I can see the data. When you were asking why is it matching the first lines. Do you mean after the cross joiner? It matches everything to everything at that point, and then the rule engine sees what really matches based on the regex pattern.

In terms of the workflow, I wanted to be sure I’m seeing exactly what you’re seeing re exactly the data that is being fed in,.

1 Like