How to parse through a complicated string, eliminate unneeded text, and get a result in a new column

I have a set of product descriptions. I need to parse through everything but the “configuration” which is included in the description.

These usually look like #X#

So,

9X10
10X10
5X10

I need the whole number on the left in a separate column.

Here is some example data and the result I’m hoping for.

Divalproex Sodium DR Tablets, USP 500mg 9x10 = 9
Oxycodone/APAP 10/325mg 10x10 C-II = 10
Oxycodone HCL Tablets 5mg 10x10 C-ll (Amneal) = 10
FINASTERIDE 5MG UD TABLETS 5X10 = 5

Are you familiar with RegEx?

I’ve had success using the Regex Extractor node.

The Regex I used is (?<=\s)\d{1,}(?=[X|x])

This matches any number (with 1 or more digits) that is preceded by a space and followed by an uppercase or lowercase x.

image

4 Likes

Thank you! I am just beginning to learn Regex. I’m going to try to look up some guides to learn the syntax. I’ve installed the software package you mentioned. Your expression worked great, but there are a few outliers where it was unable to find a match.

Here’s some examples.

Benztropine Mesylate Tablets, USP 0.5mg 10x10
BACLOFEN 10MG TABS UD [MAJOR]10X10
CARB/LEVO 25/250MG TAB U/D [MAJOR]10X10
BACLOFEN 10MG TABS UD [MAJOR]10X10

The problem here is that I assumed that the number being extracted would always be preceded by a space, but it turns out that this is not the case.

Removing that condition gives the RegEx \d{1,}(?=[X|x]) which works for all 8 strings that you’ve posted here, but I’d be worried that it’s not specific enough. Is there a chance that there’d be multiple #X# sequences in a single string?

Thanks for the help!

It’s possible, but not super likely. I’m only testing with 500 rows right now, I’d have to pull more data to be absolutely sure.

Is there a way to make sure the data is valid? It should always result in the number 10 or less for this workflow.