Collection Column expansion

nxfxcom · June 17, 2021, 1:41pm

Hello,

I have a dataset with a few thousand rows. One of them is a collection column with values. some of them will have some in common and some not. So example:
[Old, Blue, Tall, Thin]
[Blue, Thin, Yellow]
and so on

what I want in the end is
Headers: Old | Blue | Tall | Thin | Yellow
Values: True, True, True, True, False
Values: False, False, False, True, True

How do I do that?

bruno29a · June 17, 2021, 2:04pm

Hi @nxfxcom , there are a few ways to do this. I kept it simple by doing a Rule-Engine for each of the conditions:

And using the data from your sample:
[Old, Blue, Tall, Thin]
[Blue, Thin, Yellow]

It yields this result, which I believe is what is expected:

And here’s the workflow: Collection Column expansion.knwf (14.0 KB)

nxfxcom · June 17, 2021, 2:20pm

Thanks Bruno,

that sounds good, unfortunately, I don’t know the values before. Here is an actual example. There might be new values I don’t know before I see them. Its for an employee alerting system!

Below is example of actual data

|2016-12-16|[Email addresses, Passwords]|
|2018-04-28|[Email addresses, Names, Passwords, Usernames]|
|2019-01-07|[Email addresses, Passwords]|
|2016-10-20|[Email addresses, Passwords, Usernames]|
|2018-11-14|[Email addresses, Employers, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2018-06-01|[Credit status information, Dates of birth, Education levels, Email addresses, Ethnicities, Family structure, Financial investments, Genders, Home ownership statuses, Income levels, IP addresses, Marital statuses, Names, Net worths, Occupations, Personal interests, Phone numbers, Physical addresses, Religions, Spoken languages]|
|2016-10-13|[Email addresses, Passwords]|
|2012-03-22|[Email addresses, Passwords, Usernames, Website activity]|
|2020-03-04|[Email addresses, Genders, IP addresses, Names, Phone numbers, Physical addresses]|
|2014-09-10|[Email addresses, Passwords]|
|2020-01-13|[Device information, Email addresses, Names, Passwords, Social media profiles]|
|2019-03-18|[Dates of birth, Email addresses, Names]|
|2016-10-08|[Dates of birth, Email addresses, Genders, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2018-02-01|[Email addresses, IP addresses, Passwords, Usernames]|
|2017-10-26|[Email addresses, Passwords]|
|2008-07-01|[Email addresses, Passwords, Usernames]|
|2017-08-28|[Email addresses, Passwords]|
|2018-04-02|[Email addresses, Passwords]|
|2017-01-01|[Email addresses, IP addresses, Names, Physical addresses]|
|2018-09-18|[Email addresses, Genders, Names, Physical addresses]|
|2015-04-14|[Dates of birth, Email addresses, Genders, IP addresses, Names, Physical addresses]|
|2020-06-22|[Dates of birth, Email addresses, Genders, Names, Password strengths, Passwords]|
|2015-10-07|[Dates of birth, Email addresses, Genders, IP addresses, Names, Physical addresses]|
|2019-02-25|[Dates of birth, Email addresses, Employers, Genders, Geographic locations, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2016-06-27|[Email addresses, Names, Passwords]|
|2012-07-11|[Email addresses, Passwords]|
|2019-09-01|[Email addresses, Passwords, Phone numbers, Usernames]|

bruno29a · June 17, 2021, 2:34pm

Hi @nxfxcom , it is indeed more challenging in this case

Would these “Email addresses” be real email addresses in the real data, or does it literally say “Email addresses”? Similary for Genders, Password, etc…

And also, are these of type Collection, or just strings in square brackets with comma+space as delimiter?

bruno29a · June 17, 2021, 3:05pm

Hi @nxfxcom , I’ve come up with something while waiting for your confirmation. I’ve had to make some assumptions in the mean time.

Assumptions as follows:

The data is how you presented it, that is it literally has the words “Email addresses”, “Genders”, “Password” etc
We gonna convert to columns only what’s after the dates and in the square brackets
The “collection” is just a string as presented in your sample
I copied the data you provided as is, meaning a whole line is just 1 record, which I then split using pipe ("|") as delimiter.

As always, we we can’t determine what the columns/structure will be, and more so because not all records would have the same columns, I always go to something like JSON, and then have Knime convert the JSON to columns where Knime will dynamically create the necessary columns automatically.

Based on the data you provided, this is what the end results of my new workflow looks like, which I think is what you expect:

This is how the workflow looks like:

And here’s the workflow: Collection Column expansion.knwf (13.7 KB)

Edit: You can also add the Missing Value node to fill the Missing strings with a value like this:

This will replace all the Missing values (?) to “False”:

It’s just a bit harder to distinguish between True and False

nxfxcom · June 17, 2021, 3:36pm

You are a genius and lifesaver. If every company had one of you around Knime would be the biggest To close the loop, yes there was no value just true or false.

Thank you so much! I wish there was a tip button here

Daniel_Weikert · June 17, 2021, 6:01pm

kudos to @bruno29a
I really liked your solution

I tried a different way which seems to work as well (I “borrowed” your workflow

bruno29a · June 17, 2021, 7:47pm

Hi @Daniel_Weikert , I have other posts where we can’t know the structure as well, where I approached them the same way, via JSON, simply because Knime is able to create the columns as needed when converting JSON to table/columns.

I’d like to see your approach as I’d like to learn other ways

Could you share your workflow?

Thanks

Daniel_Weikert · June 18, 2021, 2:59pm

Collection Column expansion.knwf (68.0 KB)
best regards and enjoy your weekend

bruno29a · June 18, 2021, 3:11pm

Thanks @Daniel_Weikert . Have a great weekend too.

system · June 25, 2021, 3:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.