Convert Diacritics to ASCII equivalence

HanhDo · August 23, 2021, 1:21pm

Hi everyone,
I would like to convert any character containing diacritics to an ASCII equivalent for columns, e.g.
ä => ae, å => aa.

I found that string manipulation has function named “removeDiacritic($$CURRENTCOLUMN$$)” but seems like it only removes special signs from string, e.g. Arvå => Arva, ä => a not converting.

Could you please advise me how I can make the correct conversion to ASCII equivalence.

Thank you so much for your help.
BR
Hanh

elsamuel · August 23, 2021, 1:53pm

ä => ae, å => aa

Where are you getting this from?

It’s probably going to be easiest to use the Cell Replacer node or the String Replacement (Dictionary) node. You’d create a lookup table and use that to replace the various characters as they are found.

HanhDo · August 23, 2021, 2:08pm

Hi @elsamuel ,
Thanks for your suggestion. I am working on a table containing those characters and found that those characters are usually written like that.

Maybe this is not relevant to KNIME…
you can find a reference here: Diacritical Character to ASCII Character Mapping
BR
Hanh

aworker · August 23, 2021, 2:24pm

Hi @HanhDo & @elsamuel

I was working on parallel on the same solution using exactly the same source of equivalences lol

@HanhDo I came to this solution in case you would still be interested

20210823 Pikairos Convert Diacritics to ASCII equivalence.knwf (35.0 KB)

Hope this helps

Best

Ael

HanhDo · August 23, 2021, 2:47pm

Hi @aworker ,
thank you for your help. But when I wanna replace Ä = AE, e.g. SCHÄFERS, then it gives the results SCHAFERS. I think it doesnt work with 2 letters (A and E), only the first letter (A) is used.

Best
Hanh

aworker · August 23, 2021, 2:55pm

Hi @HanhDo

Indeed you are right. This is because the dictionary table does not achieve “1 character” to “several characters” conversion. If you need this type of conversion, i.e. Ä => AE, you will need to treat them separately. My first thought to get a solution would be to use in a first instance the propose solution for all the other conversions (those compatible with 1 char to 1 char). And once those are done, then to treat individually with a “replace( …)” operator those that need to be converted from 1 char to several chars in general.

The rough idea is here but if not clear enough, just tell me please and I’ll be happy to provide the modified workflow solution.

Hope this helps.

Best

Ael

aworker · August 23, 2021, 3:05pm

Hi @HanhDo

I was too excited to leave this halfway solved so here it is the improved version:

20210823 Pikairos Convert Diacritics to ASCII equivalence.knwf (42.3 KB)

Should you have other special conversions from 1 char to several, you would need to add the conversions to the second string manipulator node, as shown here for “Æ” and “æ”:

replace( replace($Text Example Without Diacritic$, "Æ", "AE"), "æ", "ae")

Please be aware that for every special case you need to modify the initial dictionary table in the first -Table Creator- node, so that the diacritic in question is not achieved (or keep it as it is originally which is the same). For instance in the dictionary, I kept the same “Æ” and “æ”.

Hope this helps.

Best Ael

HanhDo · August 23, 2021, 3:25pm

thannks so much @aworker . It helps a lot.
And good news is that there are less than 10 special cases like this
BR
Hanh

aworker · August 23, 2021, 3:27pm

Haha ha, yes ideed !

My pleasure.

Best

Ael

SamirAbida · August 23, 2021, 3:42pm

Hello @HanhDo,

There is this wonderful workflow made by @Iris already “ready to use” :

I think it’s working with your “expectations” (i.e. ä => ae).

Br,
Samir

system · February 22, 2022, 3:42am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.