Combining multiple rows into single row

Hi ,
I want to combine multiple rows into 1 based on some defined combinations.I have tried to explian the same with below sample data

cart1=test1,test2,test3,test4,test5,test6,test7,test8
cart2=test3,test4,test5,test6,test7,test8
cart3=test1,test5
cart4=test3
group by productid

test1,test2,test3,test4,test5,test6,test7,test8 are different rows for any productid and now I want to merge these rows into one row as per combinations defined.
eg 1)if for any product id only test1 and test5 are performed then these 2 rows should be merged into one row and column value becomes cart3(instead of test 1 and test5)
eg 2) if for any product id only test1,test2,test3,test4,test5,test6,test7,test8 are performed then these 8 rows should be merged into one row and column value becomes cart1

I thought of using groupby node but not sure how can I put different combinations since the values are repeating- like test1 is available in two combinations cart1 and cart3.
If anyone can help with the worklow, that would be a great help.

Can you use separate groupby nodes with your groupings and then append them back together?
br

1 Like

Hi Daniel,

I didn’t get you completely , can you help me with example workflow if possible?

TIA

Hi @analytics_sharma , I don’t quite fully understand what you are trying to do. Can you show what the expected results of the sample data you provided so we can get a better idea of what the output it should generate?

2 Likes

Hi @analytics_sharma,

as bruno mentioned, providing sample data and the expected result is really helpful for us to understand your problem. I think I was able to guess your intended algorithm though, please confirm:

If, and only if, some specific tests are run on a productID, they should be grouped together in “carts”. If a productID has all tests of one cart, plus extra tests, it should not be grouped.

I made up my own sample data. Here’s what I did:

  1. sort the table to order the output of the GroupBy
  2. for each ProductID: collapse all rows into a single row using unique concatenate
  3. Rule Engine to replace matching items with carts
  4. Cell Splitter to create list
  5. Ungroup list back into multiple rows


carts.knwf (27.4 KB)

2 Likes

Hi Thyme,

Thank you so much , yes this is similar to what I am trying to do, I am good to start from here will get back to you if I need more clarifications or changes in the workflow.:slight_smile:

1 Like

Hi @bruno29a ,

Sorry for replying late, I was looking something similar to what @Thyme has proposed, I am good to start from here, will get back in 2-3 days after implementing same in my actual data.
Thanks a lot for your response :slight_smile:

Hi @Thyme ,

I have a query ,there can be more than one combination for any productid. So in above output there should be 4 rows for productid1 and 2 rows for product 2 . Sample expected output is

Productid Test_sample
1 cart1
1 cart2
1 cart3
1 cart4
2 cart2
2 cart4
3 cart3
4 cart4
5 test3
5 test6
![image 447x500](upload://cpgFy5m4VvQpZFqIOFON8FYGvsu.png)

Ok, so the logic is different than I thought. Is this right?

Group tests in carts. Tests can be used multiple thymes, so that listed carts may overlap. Tests covered in carts should be dropped from the list, tests not covered should be kept.

I see 3 ways to solve this: Some convoluted Node logic (not very Node Golfy), something involving advanced RegEx (only know basics) or something with bit manipulation. I never did bit manipulation, but always wanted to! Now I get the chance.

Step 0: Hardcode everything!
Step 1: encode existing tests bitwise into an integer
Step 2: find all matching carts using bit-magic
Step 3: drop redundant tests with more bit-magic
Step 4: write output


carts2.knwf (52.4 KB)

PS: If you like bit manipulation, you might be interested in Nemeans YT video on the Fast Inverse Square Root Algorithm.

Yes , you got it almost correct. but technically we should not have tests which are not covered in any carts(in a real data) so in our sample example like test 2 is extra in product id 2 and product id 6 has test 2, test4 and test 6 which does not have any defined combinations so in real data this is not expected but I feel it is good practice to keep such data to know if we get some corrupt data.

Thanks a lot for the workflow and the learning reference .Will gonna read about bit manipulation :slight_smile:

Your are right, silently dropping errors is dangerous, and this code snippet provides an easy way of checking the inputs. A row splitter to split into carts and rows maybe? It shouldn’t be difficult to change to code to not write any tests in the output if you change your mind.

Thanks a lot for the question, it was interesting finding a solution. Do you mind marking my post as solution? thx

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Hi ,

I am looking for something to group tests column based on some groupings. So in my data if product id has some tests , in output we want to group them based on defined grouping and where grouping has some repeated tests and if some smaller group is covered in other group so there should not be the row for smaller group. Like in below data set
Input

Id Test
1 apple
1 mango
1 guava
1 grape
1 banana
1 plum
1 lemon
1 tomato
1 avocado
1 kiwi
1 berry
1 cherry
1 potato %
1 onion %
1 ginger %
1 bean %
1 chilli %
1 potato
1 onion
1 ginger
1 bean
1 chilli
1 garlic %
1 garlic
1 apple
1 potato
1 butterscotch
2 butterscotch
2 grape
2 banana
2 berry
3 apple
3 mango
3 guava
3 grape
3 banana
3 plum
3 lemon
3 tomato
3 avocado
3 kiwi
3 berry
3 cherry
![image 87x500](upload://fOc6JJ0PnfZRN2yt4Wenj4YmFfs.png)

Expcted Ouput

productid Test
1 fruit
1 vegetable
1 apple-pie#
1 cake
2 cake
2 sugar
2 berry
3 fruit
![image 337x433](upload://pQFQcAZucPiaIw1srJC649baAOl.png)

Grouping of tests
fruit=apple,mango,guava,grape,banana,plum,lemon,tomato,avocado,kiwi,berry,cherry
vegetable=potato %,onion %,ginger %,bean %,chilli %,potato ,onion,ginger,bean,chilli,garlic %,garlic
chocolate=chocolates %,chocolates,milk,dark
butter=cheese,guava
sugar=grape,banana
cottage=apple,mango,grape,berry
berry=berry
apple=apple,mango
apple-pie#=apple,potato
cake= butterscotch

so in the output for product id 1 , first tests are grouped into "fruit’ but there is no group for “berry” and “sugar”. Since these smaller groups are already covered in "fruit "group but there is group created for “apple-pie#” since the combination of tests apple and potato is unique and not covered in other groups-“fruit”,“vegetable”.

PS: The ask in this query is different from previous solution provided and test data is in text not in “test1”,“test2”…so not sure how I use previous solution to solve the problem.

I have a few questions about this:

  • is % a placeholder for something, or is “potato %” different to “potato”? This makes a difference for the grouping, for example in apple-pie#
  • why is there no cottage, sugar, berry in the output of id1? Can items be used only once per group?
  • id1 has duplicate apple and potato. Are those duplicates necessary for the apple-pie# group? (see previous question).
  • If items can be used only once, in what order should the groups be tested? For example apple, berry, sugar together have (almost) the same items as cottage.

Hi @Thyme ,

My answers below:

  1. “potato % " is different from “potato”
    2)Items can be used only once per group, if suppose berry has been used in fruits then berry as a separate group is not needed but yes if along with berry and similarly cottage is also not possible since the the combination -apple,mango,grape,berry is already been covered in fruits group.
  2. No duplicates are not necessary, if those duplicate values are not available , but since apple comes in group “fruit” and potato comes in group"vegetable" so group “apple-pie#” is needed in output
  3. From Top to bottom , like first we can check for fruits, then vegetable and so on

Let me know if you need any more clarifiaction
Thanks in advance

Ok, so this is a bit different then the previous ask, since we now have duplicate items. Each item is “single use”. If overlaping groups should be in the output, the intersecting elements need to appear more than once. That means bit trickery is off the table.

Step 0: More hardcoding
Step 1: for each group, check if all group-items exist in the input
Step 2: if that’s the case, delete those items from the input (delete once, don’t touch eventual duplicates)
Step 3: also append the group name to the output
Step 4: append all unused items to the output. assuming this didn’t change, there should be no items left. but that makes it easy to catch errors

I tried to do this with normal nodes, but KNIME is missing a loop start with one recursive and one group loop input :sweat_smile: The groups are again hardcoded. This is not optimal, they could instead be passed via a Flow Variable. Also I renamed the groups apple and berry to avoid confusion with the items of the same name.


carts (fruit basket version).knwf (58.2 KB)

Thanks a lot @Thyme

So in the workflow you have grouped by only product id since that;s how I had provided the data, but in my real I have to use many self join as well to get the data so my input to Java snippet node would be an array of items and not the not simple collection of strings like in the workflow after group By node we you get data like


in my workflow I get the data in the form of array like below

I am little novice to java hence if you can help in modifying your java snippet code according to array input , it would be of great help

Hi @Thyme ,

One more point is for product id 1 even if we do not have duplicate for apple and potato , we still expect the group - “apple-pie#” in output along with other groups - fruit,vegetable,cake since the combination of apple and potato is unique as this combination( as a whole) is not been covered in groups fruit and vegetable.
But when I removed the duplicate values for potato and apple in the product id 1 , I am not getting the “apple-pie#” group in output.
Attached is the executed workflow
carts (fruit basket version)_without duplicates.knwf 1.knwf (30.1 KB)

We can expect that there won’t be any duplicate values in the data so we really should be worried about duplicate values for any product id

I am sorry if this creates too much confusion for this :slight_smile:

I see, the task itself didn’t change at all, but the input data isn’t in a nice pseudo-numerical format?
So instead of test1, test2, test3, etc. we have normal strings (apples, oranges, etc.)?

My initial solution would still work, if we can transform the input from normal strings into a numerical format. How many tests does your actual input have? Just the 30 unique tests from the “Grouping of tests”? (and 42 total)

How does the input look before those self joins? Ideally it would be in a format similar to what I used in the table creator: 1 row per test, multiple rows per group; (and 1 column for the group name and 1 column for the test name)