Regex remove some characters from string

Hi there!

I am trying to clean up some data and I need to remove some characters that almost always appear in some of my rows. Here you have a short example:

image

The column objective would be “Emoji Free”. For the removal of the emoji in “Name” I used this handy component: String Emoji Filter – KNIME Hub
I do not need, in theory, to remove emojis. So, the possible paterns to deal with are:
1- Name (I want to achieve this. As you can see, there are a couple of rows like this, so they should go untouched)
2- PB_Coordi_Name Emoji
3- PB Name Emoji
4- PB Name
5- BS Name Emoji
6- BS Name
7- ZH-ES_Name
8- Coordi Name
9- Name “:)” (the component does not recognize the smiley as a emoji, as it is in plain text)

Any help would be nice, since I will never figure out how regex works :slight_smile:

Best regards

I guess it would be sth like [^a-zA-Z0-9\s] but you should provide a sample file so it’s easier for the community to help here.
br

2 Likes

Sure!

Here it is.
<<<<<<<<<<<< Edit by Admin, looked like clean text names of real persons >>>>>>>>>>>>

Not sure if accented characters qualify.

EDIT: Confirmed that accented characters do NOT qualify, so you’d also lose the accented characters

2 Likes

Hi @jorgemartcaam , I put something together that seems to work.

Final results:

Workflow looks like this:

Explanation:
First, I strip any leading and trailing spaces, this is a bit important because I am going to deal with regex and also hex values:
image

I then removed the Emojis via a Regex. This is where the main magic happens:
image

I looked at a few documentations and suggestions from other people. Mainly I looked at:
https://unicode.org/Public/emoji/13.1/emoji-sequences.txt

And here, there’s a person who understood that there are characters, unicode scalars and glyphs:

And here, someone who tried to handle all the unicodes of the Emojis:

In the end, I went with:
strip(regexReplace(column("NAME"), "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]", ""))

which I got from:

Each of the expressions is explained there.

However, even after applying the above expression, there were a few cases left where a special character would still be left.

As always, I use hex values to be able to see what was that character as I could not see it in ASCII. After converting and comparing the results with your Emoji Free column, the extra characters that would appear were “fe0f”, which variations of “20fe0f” or “20fe0f20”, or “200dfe0f20”, etc…

hex “20” is basically a space, and hex “0d” is basically a CR (Carriage Return).

So, I basically removed the “fe0f” first, as I could use the strip() function after that to strip space and CR/NL (Carriage Return / New Line).
image

image

There you have it.

Here’s the workflow:
Remove Emojis from string.knwf (26.0 KB)

Note: We could get rid of the conversion to hex if we could remove hex"fe0f" directly in the Column Expressions. It is too bad that the Column Expressions does not offer functions to convert from and to hex string.

5 Likes

I am still not sure if I fully got the question, but I tried a python node and seem to get what’s needed
Flow
flow

code
code

result
res

br, enjoy your weekend guys

3 Likes

Hi!

First of all, thank you both for your answers. @bruno29a, your answer is something to study. Secondly, I am sorry, as I think I did not explain myself all that good.
You stripped the emojis but what I really need is to get the “real” names, i.e: if the first row is “PB Andrea Ordóñez Brión”, I need “Andrea Ordóñez Brión”. If it says “PB_Coordi_Borja Vega Perez”, I need “Borja Vega Perez”. If it says “BS Betty Reinsberg”, I need “Betty Reinsberg”. If it says “ZH-ES_whoever”, I need “whoever”.
This is the main problem, since with the component by @takbb I achieve the removal of almost every emoji.
Again, thank you and I am sorry.
Best regards!

Hi @jorgemartcaam , no problem. I think the confusion was “The column objective would be “Emoji Free””, so I thought you were trying to figure out how to get to the values in the Emoji Free column.

Nevertheless, I understand what you want now.

“since I will never figure out how regex works”. It’s kind of the same for me. I usually go look for the Regex Expression online :sweat_smile:

However, this time, I wrote the Regex Expression on my own! I’m reading this document if you are interested:
https://dl.icewarp.com/online_help/203030104.htm

So, based on the list in your first post, I came up with this:
regexReplace($Emoji Free$, "^PB_Coordi_|^PB |^BS |^ZH-ES_|^Coordi ", "")

It’s saying find whatever starts with “PB_Coordi_” or with "PB " or with "BS ", etc… and replace those matches with “”

Results:

Here’s the workflow:
Remove some characters at the beginning of string.knwf (17.1 KB)

4 Likes

Hi @jorgemartcaam , an alternative way without Regex if you are not comfortable with Regex is to just check the sub string using the substr() function:

if(substr(column("Emoji Free"), 0, 10).equals("PB_Coordi_")) {
  substr(column("Emoji Free"), 10);
} else if(substr(column("Emoji Free"), 0, 7).equals("Coordi ")){
  substr(column("Emoji Free"), 7);
} else if(substr(column("Emoji Free"), 0, 6).equals("ZH-ES_")){
  substr(column("Emoji Free"), 6);
} else if(substr(column("Emoji Free"), 0, 3).equals("PB ") || substr(column("Emoji Free"), 0, 3).equals("BS ")){
  substr(column("Emoji Free"), 3);
} else {
  column("Emoji Free");
}

Here, the order is important, as some of the strings you want to remove are a subset of another string. For example, “PB” and “Coordi” are a subset of “PB_Coordi”.

Same results:

Here’s the workflow containing both solutions:
Remove some characters at the beginning of string.knwf (19.4 KB)

4 Likes

That would be almost perfect.
A couple of tweaks by my side (I used a combination of your column expressions without regex and string manipulation) and good to go, for now. I will have to check if with new rows it keeps the good behaviour.
Thanks!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.