Row splitting and table dividing or other data cleaning ideas and resources

andyvanhout · October 7, 2021, 7:59pm

I know there’s a solution here, that it is probably a simple solution, but my brain is old and inflexible and apparently not able to come up with one. My apologies in advance.

I’m working with phone metrics data using end of day intervals. The data itself comes in ugly formatted spreadsheets, and I figured out how to clean that up to a point. Once the bulk of the cleaning is done and extraneous rows and columns jettisoned, I’m left with a split table that effectively consists of “column headers” and “the data.”

The “column headers” bit consists of two rows with some strings and some empty cells. I’d like to merge the rows and combine/concatenate the strings where there are strings. So if A1 is “ANS” and A2 is “%” I’d like the new A1 to be “ANS %”. The Concatenate node seems to just re-combine split rows, and my old inelastic brain hasn’t been able to properly google a solution.

“The data” potion of the of the tables also has me stumped. This table has a summary row, followed by a sub-summary row for a media type, followed by dates with call data. I’d like to remove the summary and sub-summary rows (I could probably do that with a row-splitter node), but add the media type as a new column for each date to each row it applies to (since it’ll only reply to rows that follow the sub-summary row in sequence). That last bit is where I’m stuck.

I’ve copied my a workflow here with some data if people want to poke around. But even more than a solution (which I’ll need eventually), I’ve been poking around resources for data cleaning and transformation and maybe I just haven’t happened on the right example or tutorial yet.

Again, apologies, this should be something I can figure out and maybe after typing it all out my brain will decide to start working.

ICBM-Queue-Data-Clean.knwf (83.2 KB)

bruno29a · October 7, 2021, 9:30pm

Hi @andyvanhout , and welcome to the Knime Community.

I’m not sure I fully understand what you are saying.

First of all, a few comments about your workflow.
Based on the original data, and based on what you did with the Row Filter, it looks like you want to start reading from the 10th line of the Excel file:

And also to read all the columns except column Q:

All these can be done directly in the Excel Reader node.

Basically, all the nodes that I am highlighting from your workflow can be configured directly in the Excel Reader node:

can be configured as:

Looking at your final data (Column Filter):

you don’t exactly have this:

if A1 is “ANS” and A2 is “%” I’d like the new A1 to be “ANS %”

But I can understand that column E and F represent ANS # and ANS % respectively, similarly for G and H representing ABD # and ABD % respectively.

I put something together for this part.
First, I’ve reduced this (your workflow):

to this:

And then added other processes. The whole workflow looks like this:

And the table looks like this:

Note: I had to re-create your Excel file to be able to play with the Excel Reader’s settings. I simply just wrote to an Excel file after the Excel Reader, which is not part of the workflow. It was a preparation for the workflow and was removed once the Excel was created.

Here’s the workflow (with the Excel file included in the workflow’s data folder): ICBM-Queue-Data-Clean_Bruno.knwf (53.7 KB)

The next part, I don’t understand what you mean:

I can see the summary row, but what’s the column/row that represents the media type? I’m not understanding what needs to be done.

Can you give an example of the expected output for a few lines? What’s supposed to go in the new column?

andyvanhout · October 7, 2021, 10:12pm

First, thank you so much for the reply. I haven’t fully read or made the changes to my workflow yet, but at at quick glance everything you said makes sense and I’ll be able to sort through it. In my rush to answer the bits you didn’t understand though:

Absolutely, this ask/question is easier to illustrate than it is to articulate (I realize that “media type” is never explicitly defined anywhere, not by the sheet or by me):

Start Data Upload and Picture:

sample data start.xlsx (14.7 KB)

Finished Data Upload and Picture:

sample data finish.xlsx (14.2 KB)

bruno29a · October 7, 2021, 11:50pm

Hi @andyvanhout , ok no problem, I understand for the Media Type now.

Just to confirm another thing, you don’t need the Summary row, correct? I have removed the line in the workflow via a Row Filter.

I added 3 more nodes to the workflow to add the Media type. The workflow looks like this now:

Results:

…

Here’s the updated workflow: ICBM-Queue-Data-Clean_Bruno.knwf (71.3 KB)

EDIT: Just some additional notes. The way I “recognized” a Media Type is that if the value starts and ends with a number, then it assumes it’s a date and not a Media type, otherwise it’s a Media type.

I did not want to evaluate the exact format of x/x/xxxx or xx/xx/xxxx in case your date format changes, so I only evaluate if the value starts with a number and ends with a number.

This means if you have a Media type name as 1abc2, it will not be detected as a Media type. Of course you can adjust this eventually. It’s part of the Column Expressions node (node 18) of the workflow.

andyvanhout · October 8, 2021, 1:24am

That’ll all work perfectly, I’ll tinker with the workflow a little bit if I need to and go from there. Thanks again!

bruno29a · October 8, 2021, 1:54am

You are very welcome @andyvanhout

system · October 15, 2021, 1:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.