Regex remove some characters from string

jorgemartcaam · August 13, 2021, 6:42am

Hi there!

I am trying to clean up some data and I need to remove some characters that almost always appear in some of my rows. Here you have a short example:

The column objective would be “Emoji Free”. For the removal of the emoji in “Name” I used this handy component: String Emoji Filter – KNIME Hub
I do not need, in theory, to remove emojis. So, the possible paterns to deal with are:
1- Name (I want to achieve this. As you can see, there are a couple of rows like this, so they should go untouched)
2- PB_Coordi_Name Emoji
3- PB Name Emoji
4- PB Name
5- BS Name Emoji
6- BS Name
7- ZH-ES_Name
8- Coordi Name
9- Name “:)” (the component does not recognize the smiley as a emoji, as it is in plain text)

Any help would be nice, since I will never figure out how regex works

Best regards

Daniel_Weikert · August 13, 2021, 9:05am

I guess it would be sth like [^a-zA-Z0-9\s] but you should provide a sample file so it’s easier for the community to help here.
br

jorgemartcaam · August 13, 2021, 9:12am

Sure!

Here it is.
<<<<<<<<<<<< Edit by Admin, looked like clean text names of real persons >>>>>>>>>>>>

bruno29a · August 13, 2021, 1:12pm

Not sure if accented characters qualify.

EDIT: Confirmed that accented characters do NOT qualify, so you’d also lose the accented characters

bruno29a · August 13, 2021, 2:45pm

Hi @jorgemartcaam , I put something together that seems to work.

Final results:

Workflow looks like this:

Explanation:
First, I strip any leading and trailing spaces, this is a bit important because I am going to deal with regex and also hex values:

I then removed the Emojis via a Regex. This is where the main magic happens:

I looked at a few documentations and suggestions from other people. Mainly I looked at:
https://unicode.org/Public/emoji/13.1/emoji-sequences.txt

And here, there’s a person who understood that there are characters, unicode scalars and glyphs:

And here, someone who tried to handle all the unicodes of the Emojis:

gist.github.com

https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

remove-emoji.py

#!/usr/bin/env python
"""
Remove emoji from a text file and print it to stdout.

Usage
-----

    python remove-emoji.py input.txt > output.txt

"""

This file has been truncated. show original

In the end, I went with:
strip(regexReplace(column("NAME"), "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]", ""))

which I got from:

Each of the expressions is explained there.

However, even after applying the above expression, there were a few cases left where a special character would still be left.

As always, I use hex values to be able to see what was that character as I could not see it in ASCII. After converting and comparing the results with your Emoji Free column, the extra characters that would appear were “fe0f”, which variations of “20fe0f” or “20fe0f20”, or “200dfe0f20”, etc…

hex “20” is basically a space, and hex “0d” is basically a CR (Carriage Return).

So, I basically removed the “fe0f” first, as I could use the strip() function after that to strip space and CR/NL (Carriage Return / New Line).

There you have it.

Here’s the workflow:
Remove Emojis from string.knwf (26.0 KB)

Note: We could get rid of the conversion to hex if we could remove hex"fe0f" directly in the Column Expressions. It is too bad that the Column Expressions does not offer functions to convert from and to hex string.

Daniel_Weikert · August 14, 2021, 10:49am

I am still not sure if I fully got the question, but I tried a python node and seem to get what’s needed
Flow
flow

code
code

result
res

br, enjoy your weekend guys

jorgemartcaam · August 15, 2021, 9:43am

Hi!

First of all, thank you both for your answers. @bruno29a, your answer is something to study. Secondly, I am sorry, as I think I did not explain myself all that good.
You stripped the emojis but what I really need is to get the “real” names, i.e: if the first row is “PB Andrea Ordóñez Brión”, I need “Andrea Ordóñez Brión”. If it says “PB_Coordi_Borja Vega Perez”, I need “Borja Vega Perez”. If it says “BS Betty Reinsberg”, I need “Betty Reinsberg”. If it says “ZH-ES_whoever”, I need “whoever”.
This is the main problem, since with the component by @takbb I achieve the removal of almost every emoji.
Again, thank you and I am sorry.
Best regards!

bruno29a · August 15, 2021, 1:22pm

Hi @jorgemartcaam , no problem. I think the confusion was “The column objective would be “Emoji Free””, so I thought you were trying to figure out how to get to the values in the Emoji Free column.

Nevertheless, I understand what you want now.

“since I will never figure out how regex works”. It’s kind of the same for me. I usually go look for the Regex Expression online

However, this time, I wrote the Regex Expression on my own! I’m reading this document if you are interested:
https://dl.icewarp.com/online_help/203030104.htm

So, based on the list in your first post, I came up with this:
regexReplace($Emoji Free$, "^PB_Coordi_|^PB |^BS |^ZH-ES_|^Coordi ", "")

It’s saying find whatever starts with “PB_Coordi_” or with "PB " or with "BS ", etc… and replace those matches with “”

Results:

Here’s the workflow:
Remove some characters at the beginning of string.knwf (17.1 KB)

bruno29a · August 15, 2021, 2:03pm

Hi @jorgemartcaam , an alternative way without Regex if you are not comfortable with Regex is to just check the sub string using the substr() function:

if(substr(column("Emoji Free"), 0, 10).equals("PB_Coordi_")) {
  substr(column("Emoji Free"), 10);
} else if(substr(column("Emoji Free"), 0, 7).equals("Coordi ")){
  substr(column("Emoji Free"), 7);
} else if(substr(column("Emoji Free"), 0, 6).equals("ZH-ES_")){
  substr(column("Emoji Free"), 6);
} else if(substr(column("Emoji Free"), 0, 3).equals("PB ") || substr(column("Emoji Free"), 0, 3).equals("BS ")){
  substr(column("Emoji Free"), 3);
} else {
  column("Emoji Free");
}

Here, the order is important, as some of the strings you want to remove are a subset of another string. For example, “PB” and “Coordi” are a subset of “PB_Coordi”.

Same results:

Here’s the workflow containing both solutions:
Remove some characters at the beginning of string.knwf (19.4 KB)

jorgemartcaam · August 16, 2021, 9:48am

That would be almost perfect.
A couple of tweaks by my side (I used a combination of your column expressions without regex and string manipulation) and good to go, for now. I will have to check if with new rows it keeps the good behaviour.
Thanks!

system · August 23, 2021, 9:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.