Data Manipulation - Get single column data in multiple columns and rows

Hi Team,

I am trying to manipulate some data. The data is like this

image

and I Need to convert it into

I have been able to split the data in the column but I am unable to combine them into single row.

I tried using the missing value and Group by node but the results are incorrect as some of the data have outliers for e.g. this property does not have Area, when I use group by node and select the option to take value from next row or previous row. It takes value from other property as it’s area is missing. It’s also not showing Bed and baths for this property. what could be the best and most accurate to combines these row?

image

Below is the excel file and the workflow that I created.
Winston Hill data.xlsx (34.8 KB) Domain Data.knwf (32.2 KB)

Thanks

Hello @Ankit_smart,

can’t open your workflow. Give it another try uploading it or add it to KNIME Hub and link it.

Br,
Ivan

Hi @Ankit_smart -

Here’s a solution I came up with. I found it easier to restructure the data with a loop first, then extract the relevant bits and cleanup. Hopefully it helps you.

Domain Data SF Revised.knwf (59.1 KB)

3 Likes

@ScottF I came to a similar conclusion using the chunk loop but instead of removing the duplicate address row I was trying to figure out how to index backwards from “House” Kind of like setting the window but based on iterating when coming across “House”.

Is there a way to set a new iteration based on a variable? Similar to conditional variable loop end?

Thanks,
Jason

Thanks Scott, You made it look so easy. Learning a lot from you.

1 Like

What should be the best approach to get the data in the same format for this one(Added more data)? I have now added the remaining data. If I am using the windows loop to spread it in rows. It is not working smoothly as the data is now spread on a random basis?

Domain Data.knwf (57.2 KB) Winston Hill data.xlsx (36.0 KB)

@ScottF

I have finally solved the new bigger data using Reference row filter and few more row filter. Do you think it’s the best approach or was there something I could have implemented to more automate the process?

Winston Hill data.xlsx (33.4 KB) Domain Data - Final.knwf (59.3 KB)

Given how messy the original dataset is, I think what you have is working well. You might have to implement some fancier RegEx to deal with the variability in the format of the Beds/Baths/Parking field, but from what I can tell you are well on your way :+1:

Hello @Ankit_smart,

not sure this is the best approach as there can always show something in your data which is not on your filtering list and thus will be included. So instead of specifying what shouldn’t be on the list specify how cell structure of rows to be included should look like. For example using this expressions in Rule-based Row Filter node will give you same results:
$Data$ MATCHES "Sold [by|pior|at].*" OR $Data$ LIKE "$*" OR $Data$ MATCHES "\d+.*WINSTON HILLS NSW 2153|WINSTON HILLS NSW 2153" OR $Data$ LIKE "*Bed*Bath*" => TRUE

Also I suggest to create somewhat of a validation logic which can detect that filtering went wrong and send email for example.

Here is modified workflow for rest of processing as well without using loop but rather creating Group column using Math Formula and then GroupBy node to get desired structure. Also if you put your data in data folder which is located in workflow directory you can reference it easier from Excel Reader, don’t need to include on Forum as attachment and workflow can obviously be in state reset :wink:
Domain Data - Final_ipazin.knwf (100.4 KB)

Br,
Ivan

2 Likes

Thanks Ipazin. This is perfect and a very good improvement over the workflow that I have.

Can you tell me what does "\d+ is doing in MATCHES “\d+.*WINSTON HILLS NSW 2153|WINSTON HILLS NSW 2153”.

Also how do you send email from Knime once you have your validation model?

Hello @Ankit_smart,

MATCHES operator is based on regular expression and the one you are wondering means either find a value starting with one or more digits followed by/ending with WINSTON HILLS NSW 2153 either find only WINSTON HILLS NSW 2153. This is created based on your input table as you have certain addresses having only WINSTON HILLS NSW 2153 value. (I’m not a regex expert so probably this could/should be improved to be more bullet proof.)

For sending email you should use Send Email node.

Br,
Ivan

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.