Icons & Emojis Removal (Continuation)

Hi @bruno29a it seems that my previous topic was closed before I could make an attempt on your alternative solution.

Could you take a look at the workflow attached here for me?

Remove Emojis from string 2.knwf (2.0 MB)

It’s basically your proposed workflow, I didn’t change any configuration except feeding it with my own data as input.

Here’s my comments:

  1. It works as expected. I don’t see any emojis/emoticons not being removed, looking at first glance. :ok_hand:

  2. A rare case occurred for Row 12 where an empty row was created as a result.

Would you be kind enough to check what causes this?

  1. It appears that the workflow inserts a box icon to some lines:

Is there a way to prevent this from happening?

  1. Last but not least, is there a way to transfer the column name from the preceding node as a variable and feed it into the Column Expressions Node without having to alter the script inside?

columnA

This is not an urgent request, since I’m preoccupied with completing an online certificate program at the moment (which requires me to pause this project for a few weeks). So, please take as much time as you need!

Thank you in advance! :grin:

I also gave it a try.
Normally I would go with a python node but I have no clue what this token is. It was not removed in my try. But I am sure @bruno29a (if he is not on holiday which he certainly deserves for all his support here) will help you out

by the way what is the usecase of that project? Just curious
br and good luck with your flow

1 Like

Thanks @Daniel_Weikert , the project is related to YouTube videos.

I am not a coder so I can’t delve into Python-based solutions, at least not in the near future. But if you think other people might find it useful, feel free to post the solution here. The whole notion of this post is to find alternative solutions to the one I already got from the previous thread anyways.

“if he is not on holiday”

Oh, by the way, I think as much as @bruno29a might want a holiday, the next one for Canada falls on Sept 22 (Equinox). Unless you (@Daniel_Weikert ) meant to say (weekend) break. :sweat_smile:

I haven’t seen it mentioned in this or the previous thread, I don’t think - but has anyone tried @takbb `s String Emoji Filter component? I’ve used it a couple of times on some projects and it worked great.

3 Likes

Thanks @ScottF , I’ve tested that component for the first time just now after you mentioned it. It allows choosing 3 filter options, and the option that works the best for me was filter no. 2.

Looking at first glance throughout the table, I believe it does all the job needed, except for these “stonehenge-like” symbols:

And, it seems that when I opt for No. 1, the stonehenge-like symbols are removed but the No. 1 filter leaves many other symbols (icons/emojis) intact, unlike No. 2.

I’ll have to combine the 2 options together if I am to proceed with this component. I guess I’ll have to tag @takbb for help, and I’m also curious to know as to why not combine all options together in the first place. :smile:

I’m pretty sure those other “Stonehenge” symbols are extended ASCII box-drawing characters (cf. ASCII Codes — Full list of Characters, Letters, Symbols & Signs). People will tweet all sorts of weird stuff I guess :sweat_smile:

Hopefully someone more well versed with RegEx than I am knows a filter method for those.

EDIT: Maybe this is useful? Regular expression to match any ASCII character – Bytefreaks.net

1 Like

@ScottF Okay, what an interesting link you’ve attached there. I tried the regex suggestion from there, and with a few tweaks on my own to suit my dataset, here’s the script that works for me:

regexReplace($$CURRENTCOLUMN$$,“[^\x00-\x7F y.//]”,“”)

Since this is still a regex-based solution, which belongs to the same family as the solution from my previous thread, I still am inclined to adopt and “perfectify” (nope, that’s not a word! don’t look up on the dictionary on me) @bruno29a 's unique method.

1 Like

Hi @badger101 @Daniel_Weikert , I’m not on holidays lol, just busy with work.

FYI, the next holiday here is actually on Monday, Labour Day (Only in North America including Canada - It’s on May 1st in the rest of the world).

@badger101 , I’ll take a look some time soon. But off of my head and taking a guess regarding your points:
2- A rare case occurred for Row 12 where an empty row was created as a result.
Comment: I don’t remember how the workflow works (I will when I look at it), but I’m guessing may be some rules came back as empty.

3- It appears that the workflow inserts a box icon to some lines
Comment: A “box” icon usually happens when your editor/viewer cannot properly display something, so this may still be a “special” character. You may check what’s the hex value of it to know what it is. (Again, I’ll see it when I look into it). We may want to remove it if you don’t want that character/icon.

4- is there a way to transfer the column name from the preceding node as a variable and feed it into the Column Expressions Node without having to alter the script inside?
Comment: In the Column Expressions, you can access the variables via the variable() function, and columns via the column() function. That being said, you can actually access columns dynamically using a variable like this column(variable("your variable"))
I have an old demo on this here:

2 Likes

Thank you @bruno29a . Whenever you’re ready to look into it. As I said, it’s not urgent on my side :grin:

Hi @badger101 , I’ve taken a look at the workflow.

Somehow, I’m not seeing the box icon for your point #3:

It’s how it is on my side, your version, and my modified version (see modified version below). Same results.

For point #2, it looks like the Row12 is quite particular. It’s the only row that has an odd number of characters after converting to hex, which is weird, because the hex values are usually in pairs. Somehow, Row12’s length is 1653.

To get around this, you can modify the Java Snippet (the second one, the Hex to String) as follows. The change happens on line 32:
Old code (starting from line 31):

char[] charArray = c_stripped_Name.toCharArray();
for(int i = 0; i < charArray.length; i=i+2) {

New code (starting from line 31):

char[] charArray = c_stripped_Name.toCharArray();
int length = charArray.length;
if (1 == length%2) length = length - 1;
for(int i = 0; i < length; i=i+2) {

Row12 looks good now:

For your point #4, you can either use a variable (and check how I use the variable in the Column Expressions), or alternatively, you can just add a Column Rename right after the Excel Reader, and then rename back at the end if needed :wink:

EDIT: I went back to your point #3. As I recommended, the hex values will always show you what’s there, and basically for Row574, it’s a hex code of 1c, which is the code for File Separator which is represented by “” and also the hex code 1d, , which is a Group Separator, and is represented by “”, and for Row575 and 576, it’s for the hex code 1d (Group Separator “”)

If they annoy you and you want to remove them, you can modify the String Manipulation to this:
replace(replace(replace($stripped_Name$, "fe0f", ""), "1c", ""), "1d", "")

With that, all your points should now be addressed

2 Likes

Alright @bruno29a ,

I have modified the Java Snippet scripts and added a String Manipulation node, and I saw the desired outcome.

As for the Column Expressions, the workflow wasn’t annotated, and since I don’t use the node regularly, I’m not quite sure what’s happening with the configurations. But I went back to your earlier comment regarding the column(variable(“your variable”)) suggestion which I followed, and it works.

I’ll consider this as solved, and I hope this thread may help me and anyone else looking for alternative solutions to regex in the future, for emojis/icons removal.

Thank you so much!

2 Likes

Hi @ScottF and @badger101, thanks for trying out the emoji filter and the feedback and kind words.

The component was written to try to work out how to filter Emoji and the reason for the 3 options was basically I searched the web for inspiration and suggestions and then implemented three different solutions that I found. This was in response to a forum question on how to filter Emoji.

Yes it would probably make sense to combine approaches and maybe that’ll be something to look at. I had a feeling though that some aspects of the regex used was not necessarily right in all circumstances and could end up filtering out more than required, so feedback on specific characters is always welcome.

As for the “Stonehenge” :slightly_smiling_face: stuff, yes they were as per @ScottF’s link what counted as ascii “graphics characters” for drawing 2d boxes, and designing forms in the days of fixed-pitched fonts, so technically aren’t emoji. Strictly speaking they shouldn’t really be filtered by an “emoji filter”, but useful to know that option 1 does remove them.

Ideally I think the component should allow the user to select “classes” of characters to be removed and that might be something for me to look into on a “rainy day”.

Thanks again for the feedback

4 Likes

Thanks @takbb for clarifying. Yes, classification of characters by class sounds useful! :ok_hand:

1 Like

Hi @badger101, Talking of classes caused me to look a renewed look at this subject. I have another component “Regex String Stripper” which is really just a convenience component for performing a regex removal (which you can do via other means in KNIME), but in this case wraps a java snippet.

The java regex [\p{So}] will match the “class” of characters defined by Regex as “Symbols-Other”. I don’t have full details of which symbols are considered “other”, but it certainly seems to work for removal of Emoji and your “stonehenge” characters. Maybe this will be something I could use in a future incarnation of the Emoji Filter.

In the meantime, take a look at this workflow where I have passed your input data from your uploaded workflow and passed it to “Regex String Stripper”.

I would be interested in your thoughts on how will this works for your data compared with the other methods you’ve tried.

Remove Emojis from string using Regex String Stripper.knwf (28.9 KB)

1 Like

Hi @takbb , based on the results I’m seeing from the screenshot, it works pretty well. My only concern is that the user has to know regex scripts (if I understand the configuration window correctly).

1 Like

Yeah, I understand that concern. I was more thinking that if this seems to work well, I could modify the emoji filter to use it. :slight_smile:

Oh I see. Then yes, certainly! I look forward to it :grin:

Note: Just in case this thread is closed (due to solution being found), you can always introduce the component by making your own thread like I did with mine here: Pinterest Extensions: An Introduction

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.