Tagging legal entities with regex

I hope you are having a nice day.

If you have a minute, can you please help me with a regex – I believe I’ve made a minor progress with writing regex expressions last week, but I still can’t make it work as I need :blush:

The task seems to be rather common.

There’s a text like:
Japanese companies, including electronic component maker Ibiden Co (4062.T), will work with Taiwan Semiconductor Manufacturing Co (TSMC) (2330.TW) to develop chip manufacturing technology in Japan. Ibiden Co plans to start manufacturing later this year. (Just a sample to share what I’m tying to achieve).

NE Tagger sometimes fail in identification of ORGANIZATION and fairly speaking often these fails are rather basic, like in the example above (not sure – I didn’t check this precise piece of text) it may fail to identify the full name of the company “Taiwan Semiconductor Manufacturing Co”.

I’m playing with Wildcard tagger in attempt to select words stating with capital letters only and finalizing with “Co”.

“\b[A-Z](?:[A-z]*)\sCo” – this regex brings “Ibiden Co” but not “Taiwan Semiconductor Manufacturing Co” (brings only the first result). If I delete “Ibiden Co” from the sample text – regex brings the entire string starting with a capital “T” in “(2330.TW)” to the end of the name of “Taiwan Semiconductor Manufacturing Co”.

I hope I made you smile, but last night I was desperate in search of solution. Yet, failed :blush:

Also, it would be nice to have a chance to tag (in the end of the sample text) only “Ibiden Co” not “Japan. Ibiden Co”

If you have a minute, please help me with this regex – I’m trying to learn it but so far not so good :frowning:

Have a great day!

As far as I can tell, your RegEx is designed to find only a single capitalized word, followed by a space then “Co”: It helps to use a RegEx tester so you can see what’s going on:

To really get this to work, you’d need to allow for matching sequential capitalized words by modifying and nesting the expression:

(?:\s[A-Z][a-z]+)+\sCo

Alternatively, you can select consecutive capitalized words (without specifying Co) using the following expression:

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

6 Likes

@elsamuel , thank you very much for detailed explaination!

Your regex works!

Thanks again for your help!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.