Filter out values with non-latin characters

Hi all,

I’m trying to filter a list of web URLs to exclude those that contain non-latin characters (e.g., Chinese, Japanese, Cyrillic, Arabic, etc.).

I tried using the Tika Language Detector, but since URLs aren’t written in full sentences, it didn’t work very well.

Thanks in advance for your suggestions!

You can use this as an idea

2 Likes

Thanks for the suggestion! I’ve been trying to learn more about regular expressions, and hadn’t realized that the MATCHES operator used them. I’ve tried it out with the syntax you suggested:
$URL$ MATCHES “.*[^0-9A-Za-z ].*” => TRUE

and it doesn’t seem to be filtering anything out. For example آموزش-رایگان/نامه-ی-انگیزشی-یا-motivation-letter-چیست-؟ wasn’t filtered out.

Please add after dots * the site its some symbols.

Yeah, I did have those too. Here’s a screenshot of what I did:image

You need to add to the current list all symbols from your URL like .-%?/ so on except symbols you want to exclude. For / use double //

I’ve got some correction here. Use it the below way.
MATCHES “[^-,#0-9]+”

1 Like

This is great @izaychik63! I’ve changed the syntax a little, and I’m adding in the characters that I want to from valid URLs. I’ll just keep feeding more URL lists in here until I make sure I have all valid URLs. Here’s what it looks like at the moment:

$Clean URL$ MATCHES "^([a-zA-Z0-9_?=&\-\./&=+~_ é\[\]’]+)" => TRUE

I’ll update this when I’m done, in the off chance someone wants to filter URLs exactly like I do :smiley:

Thanks for your help!

Alright, this is where I ended up. There may be a better way to do this, but this filters 99% of the way I’d do it manually.

$Clean URL$ MATCHES "^[a-zA-Z0-9_?@&/=+~'’ àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ\[\]\-\.]+" => TRUE

Thanks again @izaychik63 for your help!

2 Likes

You welcome, @stevelp. Happy KNIMIing.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.