Collection Column expansion

Hello,

I have a dataset with a few thousand rows. One of them is a collection column with values. some of them will have some in common and some not. So example:
[Old, Blue, Tall, Thin]
[Blue, Thin, Yellow]
and so on

what I want in the end is
Headers: Old | Blue | Tall | Thin | Yellow
Values: True, True, True, True, False
Values: False, False, False, True, True

How do I do that?

Hi @nxfxcom , there are a few ways to do this. I kept it simple by doing a Rule-Engine for each of the conditions:
image

And using the data from your sample:
[Old, Blue, Tall, Thin]
[Blue, Thin, Yellow]

It yields this result, which I believe is what is expected:
image

And here’s the workflow: Collection Column expansion.knwf (14.0 KB)

Thanks Bruno,

that sounds good, unfortunately, I don’t know the values before. Here is an actual example. There might be new values I don’t know before I see them. Its for an employee alerting system!

Below is example of actual data

|2016-12-16|[Email addresses, Passwords]|
|2018-04-28|[Email addresses, Names, Passwords, Usernames]|
|2019-01-07|[Email addresses, Passwords]|
|2016-10-20|[Email addresses, Passwords, Usernames]|
|2018-11-14|[Email addresses, Employers, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2018-06-01|[Credit status information, Dates of birth, Education levels, Email addresses, Ethnicities, Family structure, Financial investments, Genders, Home ownership statuses, Income levels, IP addresses, Marital statuses, Names, Net worths, Occupations, Personal interests, Phone numbers, Physical addresses, Religions, Spoken languages]|
|2016-10-13|[Email addresses, Passwords]|
|2012-03-22|[Email addresses, Passwords, Usernames, Website activity]|
|2020-03-04|[Email addresses, Genders, IP addresses, Names, Phone numbers, Physical addresses]|
|2014-09-10|[Email addresses, Passwords]|
|2020-01-13|[Device information, Email addresses, Names, Passwords, Social media profiles]|
|2019-03-18|[Dates of birth, Email addresses, Names]|
|2016-10-08|[Dates of birth, Email addresses, Genders, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2018-02-01|[Email addresses, IP addresses, Passwords, Usernames]|
|2017-10-26|[Email addresses, Passwords]|
|2008-07-01|[Email addresses, Passwords, Usernames]|
|2017-08-28|[Email addresses, Passwords]|
|2018-04-02|[Email addresses, Passwords]|
|2017-01-01|[Email addresses, IP addresses, Names, Physical addresses]|
|2018-09-18|[Email addresses, Genders, Names, Physical addresses]|
|2015-04-14|[Dates of birth, Email addresses, Genders, IP addresses, Names, Physical addresses]|
|2020-06-22|[Dates of birth, Email addresses, Genders, Names, Password strengths, Passwords]|
|2015-10-07|[Dates of birth, Email addresses, Genders, IP addresses, Names, Physical addresses]|
|2019-02-25|[Dates of birth, Email addresses, Employers, Genders, Geographic locations, IP addresses, Job titles, Names, Phone numbers, Physical addresses]|
|2016-06-27|[Email addresses, Names, Passwords]|
|2012-07-11|[Email addresses, Passwords]|
|2019-09-01|[Email addresses, Passwords, Phone numbers, Usernames]|

Hi @nxfxcom , it is indeed more challenging in this case :smiley:

Would these “Email addresses” be real email addresses in the real data, or does it literally say “Email addresses”? Similary for Genders, Password, etc…

And also, are these of type Collection, or just strings in square brackets with comma+space as delimiter?

2 Likes

Hi @nxfxcom , I’ve come up with something while waiting for your confirmation. I’ve had to make some assumptions in the mean time.

Assumptions as follows:

  • The data is how you presented it, that is it literally has the words “Email addresses”, “Genders”, “Password” etc
  • We gonna convert to columns only what’s after the dates and in the square brackets
  • The “collection” is just a string as presented in your sample
  • I copied the data you provided as is, meaning a whole line is just 1 record, which I then split using pipe ("|") as delimiter.

As always, we we can’t determine what the columns/structure will be, and more so because not all records would have the same columns, I always go to something like JSON, and then have Knime convert the JSON to columns where Knime will dynamically create the necessary columns automatically.

Based on the data you provided, this is what the end results of my new workflow looks like, which I think is what you expect:

This is how the workflow looks like:
image

And here’s the workflow: Collection Column expansion.knwf (13.7 KB)

Edit: You can also add the Missing Value node to fill the Missing strings with a value like this:

This will replace all the Missing values (?) to “False”:

It’s just a bit harder to distinguish between True and False :slight_smile:

6 Likes

You are a genius and lifesaver. If every company had one of you around Knime would be the biggest :wink: To close the loop, yes there was no value just true or false.

Thank you so much! I wish there was a tip button here :wink:

2 Likes

kudos to @bruno29a
I really liked your solution

I tried a different way which seems to work as well (I “borrowed” your workflow :wink:

2 Likes

Hi @Daniel_Weikert , I have other posts where we can’t know the structure as well, where I approached them the same way, via JSON, simply because Knime is able to create the columns as needed when converting JSON to table/columns.

I’d like to see your approach as I’d like to learn other ways :slight_smile:

Could you share your workflow?

Thanks

1 Like

Collection Column expansion.knwf (68.0 KB)
best regards and enjoy your weekend

1 Like

Thanks @Daniel_Weikert . Have a great weekend too.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.