Assign names back to the related columns and work with new names

Hi guys,

maybe someone can help me with my problem. I have used the Clustering workflow that you can find here (pattern analysis) to solve the problem regarding typos in company names. In the original Excel sheet there are more information (Columns) like Product, Amount, Currency, country etc and not just the company name as a column. With the help of the workflow I was able to assign misspelled names to a cluster. Because you can only work with one column for certain nodes, I no longer have the corresponding columns that I would like to reconnect. For example, if we have assigned the company name to a cluster here, then it should also use the parent name of the cluster for further processing purposes. For example, if we had the following constellation before:
Company Name, Counterparty, Port of origin, Port of destination
Siemens AG, Altanz GmbH, Germany, Netherlands
SIEmens AG, Altanz GmbH, Germany, Netherlands
Simmens ag, Altant GmbH, Germany, Netherlands

After the workflow used in pattern analysis the following should be created:
|Company name|Counterparty|Port of origin|Port of destination|
|Siemens AG|Altanz GmbH|Germany|Netherlands|
|Siemens AG|Altanz GmbH|Germany|Netherlands|
|Siemens AG|Altanz GmbH|Germany|Netherlands|

And with the help of this schema I would like to continue working.
Is there a possibility to do this here?

Thank you and love greetings
canan

Hello @anon33357744,

if I understand your request correctly, you want to join information from earlier in the workflow to transformed information later in the workflow, right? Just to make sure: did you know that one can not only perform joins on data in columns, but also on row ids? If your RowIDs have not been altered, you can just use those…

Best,
Alec

1 Like

Hi all,

I hope you can help me with my Problem :frowning:

Thanks and Kind regards,
Canan

Hi @anon33357744,

as I understand it, all you need to do is to add in your GroupBy node the Beneficiary as a Set or as Unique Concatenate. Then the information will not be lost and when your loop ends, you can expand your set (Ungroup node if you had the Set, Cell Splitter and Unpivot if you used the Unique Concatenate Method) and join the information back to the data.


Best,
Alec

Hi @Alec,

it is not working for me, i cant use beneficiary twice. Do you think i can just take the Groupby node out of my workflow?

Best,
Canan

Hi @anon33357744,

no, I don’t think you just can remove it - you need this node for the loop to do its job.
Maybe somebody else has an idea then…

Best
Alec

Hi @Alec,

ok thank you ayway. @Corey you helped me with this workflow, Maybe you also know how to solve my question or problem.

Would be great if someone could help me :confused: :frowning:

Regards,
Canan

Hi @anon33357744, glad to see work is moving forward!

The reason you can’t use the Beneficiary column twice in your group by is because we had set it to use the original column name.

A way around this that stays consistent inside the recursive loop is to add a column aggregator node before the looping begins to create a new column that is the set @Alec mentioned. Then we can modify our group by inside the loop to union those sets when we combine the clusters.

Then after everything you can use the ungroup node to get a table you can join back to your original with to correct the names without losing all your other columns.

I put boxes around the changed nodes in your workflow. Hope this helps!

TradeFinance 2.knwf (82.8 KB)

p.s. Sorry for all the delayed responses. Was a bit tied up with travel last week.

1 Like

Hi @Corey,

I hope your trip was nice.
I think that doesnt solve my problem.
the loop process creates new Row IDs, this means i cant use row id as an identifier, because to join data i Need this. After using the hierarchical cluster model I want the correctly written names to be reassigned to the data or columns. I just wanted to use the hierarchical cluster model to assign very similar misspelled names to a particular name, the one that occurs most often. after that I wanted to reassign or reassign the correctly spelled names (Beneficiary) with applicant, port of Destination and transactionid.

Kind regards,
Canan

The idea here was you would be able to use the original Beneficiary name to join instead of the row id.

Does that strategy not work? I may have misunderstood the issue.

I’m sure we’ll figure something out though :grin:

1 Like

Hi Corey,

I’ll get the following result:

i don’t know how i can link this to the data in the joiner now. This is supposed to make a difference, namely that the names that were misspelled now have the correct name or at least the name that occurred most frequently.

I was thinking a joiner like this:

which results in a table like this:

where Beneficiary is the original, (#1) is your pre-processed one, and (#2) is the result of the clustering.

2 Likes

hi @corey,

yes this is exactly what i was searching for, thank you so so much, but here i get double rows and the type of set is a question mark, thats the reason why i cant connect it with the joiner.

Did you use the Ungroup node on the set column? After that it should be type string.

1 Like

yes

i would like to upload my workflow here but it is too large, is there a way to send it to you via mail for example?

Thanks :slight_smile:

hi @Corey ,

i did this in that way:

but i am not happy with the results. As you can see here the names of some examples where changed with names that arent similar.

What did i wrong?

You may want to try playing with the distance threshhold in the hierarchical clustering node. This determines how closely related values need to be to be joined into a cluster.

image

1 Like

Great this is a good point I have to try it out thank you :blush:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.