String Split (Regex): Split Chinese Characters from String

Er3n · October 10, 2024, 7:57am

Hi everyone,

I was looking for geo data (longitude & latitude) for Chinese cities to create a map in PowerBI. I found a dataset on Geonames: GeoNames

I’m interested in the field “alternatenames” which contains the city names in Chinese. The data there looks like this:

Kanbaduo,Oiser,Oisêr,Wosai,Wosai Xiang,kan ba duo,wo sai,wo sai xiang,坎巴多,窝塞乡,窝赛

I’ve been trying to utilize the node String Splitter (Regex) to separate the Chinese names from the rest, but no Regex code I tried worked so far. The desired output would be 2 columns, one for the Chinese names and one of the remaining alternate names:

Rest	Chinese
Kanbaduo,Oiser,Oisêr,Wosai,Wosai Xiang,kan ba duo,wo sai,wo sai xiang	坎巴多,窝塞乡,窝赛

Any tips on how to write the Regex code for this?

Best,
Eren

Er3n · October 11, 2024, 1:37am

The following regex code somewhat works, but it only returns the last Chinese character of every string in a new column:

^.*([\u4e00-\u9fa5]+)$

In the case of…

Kanbaduo,Oiser,Oisêr,Wosai,Wosai Xiang,kan ba duo,wo sai,wo sai xiang,坎巴多,窝塞乡,窝赛

…it only returns 赛, not 坎巴多,窝塞乡,窝赛

I seem to be missing sth here

takbb · October 11, 2024, 7:38am

Hi @Er3n , you are close with the regex but you have two potential problems:

The initial .* (to match any characters) will be “greedy” which means it will try to consume as many characters as possible whilst still allowing the whole expression to succeed.
You have commas within the set of chinese characters

Let’s simplify things (you may already know much of what I write below, but a simple example may be of use to others)

If you had the string AAABB,BB,BBB and you wanted to collect the part containing the Bs, an equivalent to your current regex would be

^.*([B]+)

But if you used this it would collect only the final B because everything up to the final B is consumed by the “.*”

You can stop the .* from being “greedy” by following it with a question mark “?”

You now have
^.*?([B]+)

and this would collect the final three Bs (but not the others) because the expression can only match a continuous sequence of Bs, but the data contains commas.

So we can tell it to also match commas which are within the Bs

^.*?([,B]+)

and now it will collect: BB,BB,BBB

So, replacing the B in my regex expression with your Unicode range, the equivalent for your regex would be:

^.*?([,\u4e00-\u9fa5]+)$

Er3n · October 11, 2024, 8:06am

@takbb thanks a lot for your help and the detailed explanation! Learned a lot on how to use Regex through it.

In the meantime I had come up with this code:

[a-zA-Z0-9_?@&.,，~()（）^:;/=+~'’ àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-],([\u4e00-\u9fa5]+).

It would return the first group of Chinese characters, e.g.:

坎巴多

Wasn`t perfect, but I could work with it.

Best,
Eren

takbb · October 11, 2024, 8:38am

Glad you got something workable, and thanks for marking the solution.

system · October 18, 2024, 8:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.