Removing Icons from GetRequest content

Hi KNIMERS, kindly help me to remove these icons from my cells (obtained from GetRequest results):

What I’ve tried:

Attempt (1) Manually copied the content to a .txt file ( tesstt.txt (32.0 KB) ) before reuploading to KNIME & choosing all available encoding options but had no luck since the icons were changed to other kind of special characters instead.

Attempt (2) I tried to use the String Manipulation’s URL decoding function after converting the cell content to dummy URL links but I got this error:

Illegal hex characters in escape (%) pattern - Error at index 0 in: " o"

Thanks in advance!

@badger101 one idea could be to use regex and define a set of acceptable characters and remove all others

4 Likes

Thank you @mlauber71 , I adjusted the script a bit to my liking [^a-zA-Z0-9-'@:/// ] though. But basically regex works!

3 Likes

Hi @badger101 , I think what you are seeing is basically a result of wrong character encoding.

The way I tackled these kinds of issues in the past was to check their hex values in order to identify them, and the remove them.

Alternatively, you can do the other way around, which is to instead identify what you want to keep (which @mlauber71 has already suggested and that you have implemented) and remove the rest.

If you are interested in looking at some sample threads that used the hex values, here are some:

1 Like

Hi @bruno29a , the first attachment seems to be a proper solution. I would make an attempt on it in the future on a new dataset. Thank you so much! If that thread had a word like ‘icons’ or ‘emojis’ or ‘emoticons’ in it, I would have came across it when browsing through the forum yesterday. But hey, better late than never!

Update: Apologies. Seems that it had the word ‘emojis’ in it. I missed them somehow.

1 Like

No problem @badger101 , indeed one of the threads that I provided dealt with removing emojis.

The issue with creating an “anything but” list (whitelist) is that you have to include a lot of the accepted characters, which is not easy.

As you have done it, you had to modify @mlauber71 's expression to add “@” and a few other characters. This still does not include characters such as accented characters. It looks like your whitelist does not include the “#” character which your data seems to contain. There’s also the “|” character, the “$” character, the “,” character (for “$10,000” in the content), and there are brackets too “(” and “)”. These are in the data, but not part of your regular expression. There are a few other characters (quotes, dots, underscore, question mark (?) to denote a query string in a URL among other characters, and what about spaces?)

In your case, because what you want to remove are exceptions (and exceptions usually means their numbers should be less that the normal cases), it’s probably best to identify them and to remove them (blacklist)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.