Extract substring from pattern - RegEx perhaps?

I have records that contain a string of characters that is supposed to be a list of metadata with delimiters, where in each piece of metadata is separated by a different delimiter, a single delimiter character indicates the beginning of the metadata, a double delimiter indicates the end:

!PS_DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.O.1.2^^|02-102310-00029||<P.96<<>R.32>>[66]

Of course, not every string contains all of the same segments, nor are the always in the same order.
I think I have the RegEx worked out to isolate the different pieces, for example:
(?<=\^).*(?=\^\^) will return VA.912.O.1.2
(?<=\|).*(?=\|\|) will return 02-102310-00029
(?<=#).*(?=##) will return Three Dimensional Studio Art 2
etc.

But I’m struggling on how to use that information in Knime to extract that data into new columns. I understand the String Replacer implantation of RegEx is matching only, so I’m looking at the Regex Split node, but I can’t seem to figure out the syntax to use here and keep getting no results mixed with some console errors like:
WARN Cell Splitter 3:73 Specify the column to split.
WARN Regex Split 3:66 150971 input string(s) did not match the pattern or contained more groups than expected
WARN Regex Split 3:66 150971 input string(s) did not match the pattern or contained more groups than expected

Any guidance is greatly appreciated :slight_smile:

EDIT:
I found another thread that might shed some light on my issue

Which indicates that the Regex Split should account for the entire string, so I have been trying something like:
((?<=^).(?=^^))
((?<=|).
(?=||))
((?<=#).(?=##))
((?<=!).
(?=!!))
((?<=$).(?=$$))
((?<=<).
(?=<<))
((?<=>).*(?=>>))

Playing with groupings, adding .* at the end and/or beginning, but I’m still not getting results other than the errors mentioned above.

Hi @serendipitytech,

I assume you’d like to convert that into a proper table to make use out of the data. The challenge here, as you noted, is that not always all values are present which would result in values appearing in the wrong column.

I’d recommend to:

  1. Normalize the delimiter by applying a string replacer
  2. Split into columns using the cell split

https://hub.knime.com/mw/spaces/Public/latest/Knime%20Forum%20Support/Split%20variating%20string%20into%20table

Kind regards
Mike

2 Likes

Hi @serendipitytech,

Regarding the solution by @mw , you can use the String Manipulation node and this expression:
regexReplace($column1$, "[^a-zA-Z\\d. \\-_]([a-zA-Z\\d. \\-_]+)[^a-zA-Z\\d. \\-_]{2}", "$1___")

Then the Cell Splitter having “___” (3 underscores) as delimiter will split the string into separated columns.

:blush:

P.S. The regex pattern is created based on your current example. You can modify it later.

6 Likes

Hmm, thanks for the thought there.

I was thinking around the normalization of the delimiter but was having trouble accounting for the single delimiter at the start and double at the end pattern. The regex example from @armingrudd was helpful to get my mind going in that direction, but because the data values aren’t always in the same order, I can’t be sure what values are in what columns after splitting, or if a value is missing completely, then some rows will have a different number of columns.

But, the idea did get me thinking about the task from a different angle - I did find a long way around by stringing a [Cell Splitter-Cell Splitter-Column Filter-Column Rename] group 8 times to account for all the possible delimiters (I missed one in my original post [ and ]) to get to a point that was kinda workable.

So basically, I used that group of nodes to accomplish the normalization and splitting lie this:
Cell Splitter 1- Target the data field and find the instance of two delimiter
Cell Splitter 2- Target the result of Cell Splitter 1 and find the single instance of the delimiter
Column Filter - Filter out the extra split columns
Column Rename - Rename the desired split column to the data point it contains
Results: [original_data] [extraced_data_point]

Repeat for every pair of delimiters and I’ve gotten what I need. To check for any extra/missing, I can calculate the total length of the original data field and compare it to the length of the results columns, accounting for the delimiters. Any non 0 results I can check back in the original data column for what extra pieces I wasn’t aware of.

Thanks so much @armingrudd and @mw ! I needed that point of view and regEx example.

1 Like

Are the values inside the same delimiters in all rows of the same type? (do delimiters demonstrate columns?)

If yes, I think I can provide you with a more simple workaround than what you have already found.

:blush:

Hi @serendipitytech,

I had the same challenge in the past. Talking about the approach would take more time than working on some sample data. Can you provide some?

Cheers
Mike

@armingrudd I’m not sure I follow your question. Delimiters do indeed indicate where the columns should be broken.

Here is one example record:
!TAM!!$0101340$$#ThreeDimenStudioArt2##^VA.912.S.3.1^^>R.48>><P.8<<[81]|821389455a0c8cbc||

The result I’m looking to get is:
TAM 0101340 ThreeDimenStudioArt2 VA.912.S.3.1 821389455a0c8cbc P.8 R.48

Another example record where the elements are in a different order:
!PS_DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.O.1.2^^|02-102310-00029||<P.96<<>R.32>>[66]

@mw here is a set of a few records if that helps?

!TAM!!$0101340$$#ThreeDimenStudioArt2##^VA.912.S.3.1^^>R.48>><P.8<<[81]|821389455a0c8cbc||

!PS_DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.O.1.2^^|02-102310-00029||<P.96<<>R.32>>[66]

!FL DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.F.3.4^^|3530ba414b65c7e3||>R.25>><P.96<<[86]

!TA!!$0101340$$#ThreeDimenStudioArt2##^VA.912.S.3.11^^>R.2>><P.96<<[80]|00eb125b751337e2||

!FL DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.S.3.1^^|982fc9c191175510||>R.53>><P.83<<[81]

!TA!!$0101340$$#ThreeDimenStudioArt2##^VA.912.C.2.4^^>R.31>><P.98<<[35]|3f906e1ac94d4d8b||

!DDA!!$0101340$$#3DStArt##^VA.912.S.1.4^^|f4eb6e1b7e33b422||

How about this:

split_delimiter.knwf (35.1 KB)

I created a list of delimiters and then different regex patterns are created in each loop based on the delimiters. Only for the values inside the square brackets I had to create the expression separately since this one’s format is different from the other ones.

:blush:

3 Likes

@armingrudd - That is super clever, and much more future proof than the long solution I had!

I can’t wait to play with it, jut realized I’m still running 3.7.2, even though check updates doesn’t show anything I’ll get it all updated so I can try out your solution.

Thank you so much for taking the time to work that out!

2 Likes

For KNIME AP 4 you need a fresh installation.

Here is the expression inside the String Manipulation (Variable) which is then used as the expression for the first Column Expressions node:

join("if(regexMatcher(column(\"column1\"), \".*[\\\\",$${Sdelimiter}$$, "][a-zA-Z\\d. \\-_]+[\\\\",$${Sdelimiter}$$ ,"]{2}.*\")) regexReplace(column(\"column1\"), \".*[\\\\",$${Sdelimiter}$$, "]([a-zA-Z\\d. \\-_]+)[\\\\",$${Sdelimiter}$$ ,"]{2}.*\", \"$1\")")

And the second Column Expressions:

if(regexMatcher(column("column1"), ".*\\[[a-zA-Z\\d. \\-_]+\\].*"))regexReplace(column("column1"), ".*\\[([a-zA-Z\\d. \\-_]+)\\].*", "$1")

As you can see, I have created the expressions with regexMatcher and regexReplace functions. If we only use the regexReplace function in a String Manipulation, and if the string does not contain the delimiter, we would have the whole string instead of missing. So the matcher function checks if the string contains the delimiter and if it does not exist then a missing value will be returned.

For the output column name of the first Column Expressions node, I have used currenIteration variable from the loop and for the second Column Expressions I have entered it manually (it can be done automatically as well - use row count of the delimiter table as the column name).

:blush:

1 Like

Good morning @serendipitytech,

I updated the workflow in the hub

There are some unmatched values, numbers enclosed in “[\d+]”, that can not be mapped. What should be done with them?

Edit: Adding a little bit of background. The challenges are that you’ve got:

  1. Variating data, values may not always be present or appear in another order
  2. There are variating delimiters

So you first got to harmonize data. By using the patterns available you reduce complexity by converting it into an almost flat table. Categorize / Classify the data cells based on their delimiter and sanitize the data cells afterwards. Then a simple pivot is required to get what you need.

If you face a complex challenge, try to break it down :wink:

Cheers
Mike

2 Likes

Good morning @mw and @armingrudd

Just got Knime all updated so I can explore this solution more.

The regexMatch/Replace combo makes sense once I understood that the match only returns a 1/0 response.

I’m not very good at RegEx, funny enough I was just starting to explore those this week when this problem fell in my lap, so I only have a basic understanding of RegEx in general. These expressions are stretching my brain in that good way :slight_smile:

As for the enclosure, that is an outlier, and in the entire dataset I have at the moment it only occurs 34 times out of about 126,000 records. And of course this is the one that doesn’t end with a double delimiter.

1 Like

For the outlier you might simply use a column combiner and selecting the columns to combine via RegEx or wildcard as all start with a bracket. Unfortunately that “pollutes” the values slightly.

My favorite approach would be:

  1. Create Collection Column (Enables to skip missing values)
  2. Collection to String (gives full flexibility over separators)

Happy data cleansing
Mike

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.