I am trying to clean up some data and I need to remove some characters that almost always appear in some of my rows. Here you have a short example:
The column objective would be “Emoji Free”. For the removal of the emoji in “Name” I used this handy component: String Emoji Filter – KNIME Hub
I do not need, in theory, to remove emojis. So, the possible paterns to deal with are:
1- Name (I want to achieve this. As you can see, there are a couple of rows like this, so they should go untouched)
2- PB_Coordi_Name Emoji
3- PB Name Emoji
4- PB Name
5- BS Name Emoji
6- BS Name
7- ZH-ES_Name
8- Coordi Name
9- Name “:)” (the component does not recognize the smiley as a emoji, as it is in plain text)
Any help would be nice, since I will never figure out how regex works
And here, there’s a person who understood that there are characters, unicode scalars and glyphs:
And here, someone who tried to handle all the unicodes of the Emojis:
In the end, I went with: strip(regexReplace(column("NAME"), "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]", ""))
which I got from:
Each of the expressions is explained there.
However, even after applying the above expression, there were a few cases left where a special character would still be left.
As always, I use hex values to be able to see what was that character as I could not see it in ASCII. After converting and comparing the results with your Emoji Free column, the extra characters that would appear were “fe0f”, which variations of “20fe0f” or “20fe0f20”, or “200dfe0f20”, etc…
hex “20” is basically a space, and hex “0d” is basically a CR (Carriage Return).
So, I basically removed the “fe0f” first, as I could use the strip() function after that to strip space and CR/NL (Carriage Return / New Line).
Note: We could get rid of the conversion to hex if we could remove hex"fe0f" directly in the Column Expressions. It is too bad that the Column Expressions does not offer functions to convert from and to hex string.
First of all, thank you both for your answers. @bruno29a, your answer is something to study. Secondly, I am sorry, as I think I did not explain myself all that good.
You stripped the emojis but what I really need is to get the “real” names, i.e: if the first row is “PB Andrea Ordóñez Brión”, I need “Andrea Ordóñez Brión”. If it says “PB_Coordi_Borja Vega Perez”, I need “Borja Vega Perez”. If it says “BS Betty Reinsberg”, I need “Betty Reinsberg”. If it says “ZH-ES_whoever”, I need “whoever”.
This is the main problem, since with the component by @takbb I achieve the removal of almost every emoji.
Again, thank you and I am sorry.
Best regards!
Hi @jorgemartcaam , no problem. I think the confusion was “The column objective would be “Emoji Free””, so I thought you were trying to figure out how to get to the values in the Emoji Free column.
Nevertheless, I understand what you want now.
“since I will never figure out how regex works”. It’s kind of the same for me. I usually go look for the Regex Expression online
Hi @jorgemartcaam , an alternative way without Regex if you are not comfortable with Regex is to just check the sub string using the substr() function:
Here, the order is important, as some of the strings you want to remove are a subset of another string. For example, “PB” and “Coordi” are a subset of “PB_Coordi”.
That would be almost perfect.
A couple of tweaks by my side (I used a combination of your column expressions without regex and string manipulation) and good to go, for now. I will have to check if with new rows it keeps the good behaviour.
Thanks!