String Emoji and Character Class Filter

Hi all,

Further to a number of forum posts about the removal of Emoji and other characters from strings, I wanted to let you know I have extended my original (prototype) “String Emoji Filter” based on some feedback received and created a new component “String Emoji and Character Class Filter” on the hub at

This new component makes use of a java snippet and a number of regex “categories” to allow you to select the classes of character to be filtered. See the help documentation on the component for information about where you can read up on the categories used. I confess that I haven’t managed to test an example of every category covered but I’m hoping that the java regex works as described in “Unicode Categories” section of the the article Regex Tutorial - Unicode Characters and Properties (regular-expressions.info) which, along with the recent forum post Icons & Emojis Removal (Continuation), was my inspiration for reworking this component.

Anybody who uses the previous String Emoji Filter may wish to take a look at this new component, and please do let me know if you discover anything that the String Emoji Filter managed to filter out that this new component doesn’t.

(It’s still not “perfect” but if you have any specific regex patterns (with a “category name”) that you think might be useful generally for inclusion, please let me know.)

8 Likes

Nice addition to Knime components! Highly recommended.

4 Likes

Hi @takbb , for the ‘Connector Punctuation’, is it supposed to remove space characters? I found this effect accidentally while playing around the component. If that’s an expected behavior, that’s fine with me.

Here’s my 2 cents:

  1. Since there are so many classes, which is good by the way, I think you might want to define (i.e. describe) each class and provide some examples in a demonstration workflow, utilizing annotation boxes. Also, would appreciate it if layman terms are used in the definitions to help people like me who’s not so familiar with some of the jargons.

  2. I also recommend that you set up the data flow for the user. At the moment, the users have to manually connect the arrows to and from certain points. Took me a few minutes to figure out what goes where.

Hope that helps to improve the already helpful component of yours! Today I used it for ‘Control Characters’ class. (P.S. I had to test each class one by one till I get the results, so that’s why I recommended for a proper instruction notes as explained above.) :ok_hand: :grin:

Hi @badger101 , thanks for the feedback.

The “Connector Punctuation” was a bug. Well spotted, and thanks for letting me know. A rogue space had crept into the regex string when I pasted it in. This has now been fixed.

The classes… oh boy… well I don’t actually know what they all do so I’m still learning them too and right now the “suck it and see” :slight_smile: approach is all I can suggest but you are right, some information on what the different classes do would improve things.

In the help doc for the component it did mention a regex site but in itself that site did not explain what they are, so I’ve found a slightly better resource which might be of use:

This lists the various categories (I use a subset of them) and if you click on a category it gives examples. I have now learned that a couple of the categories I implemented have exactly zero examples, so are a little redundant at the moment. :wink:

When I get a chance though, I’ll try to put together a demo with the various categories. I’m also trying to think of a way to cut them down. Unfortunately limitations of this being a component rather than an actual node limits my options in terms of the configuration dialog. Maybe we will discover which of the categories are generally useful and which aren’t, and then they could be “pruned”. I don’t know how this will go at the moment!

I wasn’t sure what you meant in item (2) about “the users have to manually connect the arrows to and from certain points. Took me a few minutes to figure out what goes where.” can you elaborate on that.

There are a couple of issues with the component which I think are a limitation of the method I’m using for creating the list of classes. Sometimes if you open the component config it doesn’t show all the classes. This is generally resolved by ensuring it has a datasource attached to it, and then a column configured. It is then sometimes necessary to come out of the config, execute the component and then go back into the config at which point it finds all the classes. I’m not sure if there’s anything I can do about that. As I said I think it may be a limitation but it is a minor issue I think.

Thanks again for the feedback, and hopefully over time our joint knowledge will improve it!

Hi @takbb , regarding the part where I mentioned about the manual work that users have to do, I just realized that it occurs only when the users copy and paste the nodes in the components’ original workflow to their own workflow. In doing so, these connecting lines disappear:

So, I think you can disregard my comment, since what I could otherwise do was simply refer back to the original workflow to figure out which arrow points to which. My bad :sweat_smile:

I have looked at the website page you attached. As I was skimming through, I found that one of its links directs me to the official unicode site, and from there, I found this official PDF source:

It’s a big file, so it’ll take a few minutes to load on your browser (depends on your internet speed). I believe this official file may be able to help you in creating the examples and pruning. The table of content in the first few pages can inspire you with categorization of the classes, too. :ok_hand:

P.S. I see that you updated the Component already. May I know how you did it without deleting the current component from your Space list and uploading a newer version? (i.e. auto replacement)

Wow, @badger101, that pdf is going to be some read! :slight_smile:

I suspect it will be useful for some (detailed) background info, although for this component I’m trying to utlise the set of pre-defined regex unicode character classes available in Java so that I can keep myself (at least initially) from getting down at too low a level. But thanks. It will probably come in handy.

With respect to the connectors within the component, I wouldn’t normally expect any users to be looking “inside” the component (other than out of curiosity).

Normally you would just drag the component off the hub and onto your workflow, and then you see the component as in the attached sample.

String Emoji and Character Class Filter testing.knwf (520.3 KB)

In terms of updating the component, once you have put a component “up on the hub”, by sharing it. When it is your own component, you have write access to your hub space and to update it on the hub you simply edit the copy of the component you have locally in your copy of knime, and then share it to the hub again, keeping the name the same. It overwrites the copy that is already up there and KNIME does the rest in terms of making the update available to other users (provided their downloaded copy of the component is still linked to the hub)

2 Likes

Ah I see, that’s the trick. Okay, learned something new today, thanks!

1 Like