Filtering out Chinese, Korean, Japanese, Thai characters with Python Script Node or Rule Engine using regular expression

xli · February 8, 2023, 10:17pm

Hi all, I was trying to filter data records that contain any Asian characters and later on use a translator on them…I tried with Python Script with exact same code as what I do in Spyder, which works in Spyder but not in the Python Script in KNIME.

here is the code:
output_table_1 = input_table_1.copy()
def contain_chinese(check_str):
for ch in check_str:
if ‘\u4e00’ <= ch <= ‘\u9fa5’:
return True
return False

def contain_korean(check_str):
for ch in check_str:
if ‘\uac00’ <= ch <= ‘\ud7a3’:
return True
return False

def contain_japanese(check_str):
for ch in check_str:
if ‘\u0800’ <= ch <= ‘\u4e00’:
return True
return False

output_table_1 = output_table_1[output_table_1[‘Customer Name’].apply(lambda x : contain_chinese(x))]

And I also tried using regular expression but it doesn’t seems to work well neither…
here (example for Japanese), it would give me results that totally does not contain Japanese characters: ^.([\u0800-\u4e00]).

Therefore I end by with filtering out all latin characters with this Rule Engine Row Filter but it would still give me latin characters
here is what I put and i chose exclude TRUE matches: $Customer Name$ MATCHES “^[a-zA-Z0-9_?@&.,，~()（）^:;/=+~'’ àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-]+” => TRUE

If anyone can give some advice, it would be really appreciated!
Thank you in advance!

mlauber71 · February 8, 2023, 11:07pm

@xli you might want to check if you will have to double escape the Unicode codes.

There are several other threads and examples about RegEx that might help.

xli · February 9, 2023, 8:06am

hi thanks for your reply.
I tried double escape but it still doesnt work. It’s not like it doesnt work at all, with the rule engine row filter, I did filtered some records with Asian form, but it would also give records that are all in English.

xli · February 9, 2023, 8:06am

hi thanks for your reply.
I tried double escape but it still doesnt work. It’s not like it doesnt work at all, with the rule engine row filter, I did filtered some records with Asian form, but it would also give records that are all in English.

system · May 10, 2023, 8:07am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.