I have a csv data set with hundreds of columns, where each column header is repeated several times (e.g., there is a column titled “J14” at columns #20, #64, #110, …) . I would like to combine each set of columns sharing the same header into one, where the value in each row of that combined column be the sum of the values in that row across all corresponding columns.
Any advice would be much appreciated.
Hi @behrooz12, can you provide some example data and your desired output?
It would be helpful for us to know the exact format of the duplicate column names.
Hi @elsamuel ,
Sure. Here is a sample input format (comma delimited):
B, K, D, J, K, B, J, D
1, 0, 0, 1, 0, 0, 1, 0
0, 0, 1, 1, 1, 0, 1, 0
While the expected output is:
B, K, D, J
1, 0, 0, 2
0, 1, 1, 2
This loop should get you started:
This is the original data:
This is the output:
I’ve uploaded the workflow here:
It worked! Thank you very much for your prompt response.
Hi @elsamuel, thanks for sharing this clever solution to the problem. My only thought though is on the use of the wildcards which might prevent this from working correctly in the edge case where one complete column name is the initial subset of another (unrelated) column. For example, if we had
In this situation, column B would collect the values from B, BZ, B (#1) and BZ (#1) while BZ would still collect from BZ and BZ (#1).
One possible way could be perhaps to use regex instead of wildcards and then collect the set of column names as alternatives for the regex match, so maybe B would be represented by the pattern B|B (#1) whilst BZ would be represented by BZ|BZ (#1).
I decided to try using your solution without regex and pattern matching, and instead adapted it to collect the set of column groups as string
[ ] arrays which could then be passed to the Column Aggregator as include lists.
Anyway, thanks for doing all the hard work. This is simply a refinement to it to handle a specific case that I’d noticed, and no doubt my additions could be refined further. (e.g. I don’t like the way my Regex Split gives a warning all the time! ) [edit: In fact I can’t now remember why I even changed away from your Cell Splitter and changed to Regex Split and Column Merger!!. Probably I was thinking of some other approach at the time!]
CSV Duplicate Column Aggregation 2.knwf (30.7 KB)
Nice work. I had initially thought about using a similar column list approach but I couldn’t quite get it to work the way I wanted.
Nice point @takbb . Thank you!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.