Split text data and transform into one list

Hi,

I have data with one column of codes and one of text. Example below:

J9031, J9271 | Bcg (intravesical) per instillation, Injection, pembrolizumab, 1 mg
J9030, J9060, J9206 | BCG live intravesical instillation, 1 mg, Injection, cisplatin, powder or solution, 10 mg, Injection, irinotecan, 20 mg|

I would like to find a way of splitting the text to get a list like this.
J9031 | Bcg (intravesical) per instillation
J9271 | Injection, pembrolizumab, 1 mg

etc.

Have tried using cell splitter and string manipulation with regex but have not have success.

Any help is appreciated, Thanks!

What does the wider uniformity of the data look like? Does the 1st split always end in “instillation” or break at the 1st comma? Do subsequent splits always break at “ mg”? The challenge is that commas are both used to list and as a delimiter, so you need to define more complex delimiter rules.

1 Like

The text does vary quite a lot, one pattern would be starts with “injection” and end with “mg”. But there is no uniformity in the text , it is only with certain buzzwords that would indicate a code. The instillation was only used as a example but would assume in regex could have that contained within “.*?”.

I’m afraid you are probably looking at complex positive/negative lookahead/lookbehind logic in Regex to get this. It requires a lot of samples with input and expected output to capture as much rules as possible to build a somewhat solid function.

1 Like

Step number one would be to review the data and ID the “buzzwords” that could be used as split delimiters or within a regex formula approach. You can post a workflow with sterilized data that is representative of your buzzword challenges if you would like more specific help.

Unfortunately with the multi use of commas and non-uniformity of text, you will not be able to get there using a simple “wildcard” regex approach like “.*?”.

Unless you can define clean buzzwords and an “OR” logic that defines the column break points, you will need a pretty complex regex approach as @ArjenEX points out. If you are just starting out in regex then I recommend doing this in a multi-step approach. If column one has a different buzzword approach then the remaining columns, then focus there first.

1 Like

I just like to point out that your sample data has one thing that’s consistent - the fact that each split text starts with a capital letter, followed by all lowercases for the following substrings. If this is true for the whole dataset, then @ArjenEX and @iCFO might be able to help with the corresponding regex. (comma, space, capital letter as the signal for possible delimiter maybe?)

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.