Filter out values with non-latin characters

stevelp · April 15, 2020, 6:49pm

Hi all,

I’m trying to filter a list of web URLs to exclude those that contain non-latin characters (e.g., Chinese, Japanese, Cyrillic, Arabic, etc.).

I tried using the Tika Language Detector, but since URLs aren’t written in full sentences, it didn’t work very well.

Thanks in advance for your suggestions!

izaychik63 · April 15, 2020, 7:02pm

You can use this as an idea

stevelp · April 15, 2020, 7:32pm

Thanks for the suggestion! I’ve been trying to learn more about regular expressions, and hadn’t realized that the MATCHES operator used them. I’ve tried it out with the syntax you suggested:
$URL$ MATCHES “.*[^0-9A-Za-z ].*” => TRUE

and it doesn’t seem to be filtering anything out. For example آموزش-رایگان/نامه-ی-انگیزشی-یا-motivation-letter-چیست-؟ wasn’t filtered out.

izaychik63 · April 15, 2020, 7:37pm

Please add after dots * the site its some symbols.

stevelp · April 15, 2020, 7:48pm

Yeah, I did have those too. Here’s a screenshot of what I did:

izaychik63 · April 15, 2020, 7:51pm

You need to add to the current list all symbols from your URL like .-%?/ so on except symbols you want to exclude. For / use double //

izaychik63 · April 15, 2020, 8:40pm

I’ve got some correction here. Use it the below way.
MATCHES “[^-,#0-9]+”

stevelp · April 15, 2020, 8:43pm

This is great @izaychik63! I’ve changed the syntax a little, and I’m adding in the characters that I want to from valid URLs. I’ll just keep feeding more URL lists in here until I make sure I have all valid URLs. Here’s what it looks like at the moment:

$Clean URL$ MATCHES "^([a-zA-Z0-9_?=&\-\./&=+~_ é\[\]’]+)" => TRUE

I’ll update this when I’m done, in the off chance someone wants to filter URLs exactly like I do

Thanks for your help!

stevelp · April 15, 2020, 10:31pm

Alright, this is where I ended up. There may be a better way to do this, but this filters 99% of the way I’d do it manually.

$Clean URL$ MATCHES "^[a-zA-Z0-9_?@&/=+~'’ àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ\[\]\-\.]+" => TRUE

Thanks again @izaychik63 for your help!

izaychik63 · April 15, 2020, 11:00pm

You welcome, @stevelp. Happy KNIMIing.

system · October 15, 2020, 11:00am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.