I have records that contain a string of characters that is supposed to be a list of metadata with delimiters, where in each piece of metadata is separated by a different delimiter, a single delimiter character indicates the beginning of the metadata, a double delimiter indicates the end:
!PS_DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.O.1.2^^|02-102310-00029||<P.96<<>R.32>>[66]
Of course, not every string contains all of the same segments, nor are the always in the same order.
I think I have the RegEx worked out to isolate the different pieces, for example: (?<=\^).*(?=\^\^) will return VA.912.O.1.2 (?<=\|).*(?=\|\|) will return 02-102310-00029 (?<=#).*(?=##) will return Three Dimensional Studio Art 2
etc.
But I’m struggling on how to use that information in Knime to extract that data into new columns. I understand the String Replacer implantation of RegEx is matching only, so I’m looking at the Regex Split node, but I can’t seem to figure out the syntax to use here and keep getting no results mixed with some console errors like:
WARN Cell Splitter 3:73 Specify the column to split.
WARN Regex Split 3:66 150971 input string(s) did not match the pattern or contained more groups than expected
WARN Regex Split 3:66 150971 input string(s) did not match the pattern or contained more groups than expected
Any guidance is greatly appreciated
EDIT:
I found another thread that might shed some light on my issue
Which indicates that the Regex Split should account for the entire string, so I have been trying something like:
((?<=^).(?=^^))
((?<=|).(?=||))
((?<=#).(?=##))
((?<=!).(?=!!))
((?<=$).(?=$$))
((?<=<).(?=<<))
((?<=>).*(?=>>))
Playing with groupings, adding .* at the end and/or beginning, but I’m still not getting results other than the errors mentioned above.
I assume you’d like to convert that into a proper table to make use out of the data. The challenge here, as you noted, is that not always all values are present which would result in values appearing in the wrong column.
I’d recommend to:
Normalize the delimiter by applying a string replacer
Regarding the solution by @mw , you can use the String Manipulation node and this expression: regexReplace($column1$, "[^a-zA-Z\\d. \\-_]([a-zA-Z\\d. \\-_]+)[^a-zA-Z\\d. \\-_]{2}", "$1___")
Then the Cell Splitter having “___” (3 underscores) as delimiter will split the string into separated columns.
P.S. The regex pattern is created based on your current example. You can modify it later.
I was thinking around the normalization of the delimiter but was having trouble accounting for the single delimiter at the start and double at the end pattern. The regex example from @armingrudd was helpful to get my mind going in that direction, but because the data values aren’t always in the same order, I can’t be sure what values are in what columns after splitting, or if a value is missing completely, then some rows will have a different number of columns.
But, the idea did get me thinking about the task from a different angle - I did find a long way around by stringing a [Cell Splitter-Cell Splitter-Column Filter-Column Rename] group 8 times to account for all the possible delimiters (I missed one in my original post [ and ]) to get to a point that was kinda workable.
So basically, I used that group of nodes to accomplish the normalization and splitting lie this:
Cell Splitter 1- Target the data field and find the instance of two delimiter
Cell Splitter 2- Target the result of Cell Splitter 1 and find the single instance of the delimiter
Column Filter - Filter out the extra split columns
Column Rename - Rename the desired split column to the data point it contains
Results: [original_data] [extraced_data_point]
Repeat for every pair of delimiters and I’ve gotten what I need. To check for any extra/missing, I can calculate the total length of the original data field and compare it to the length of the results columns, accounting for the delimiters. Any non 0 results I can check back in the original data column for what extra pieces I wasn’t aware of.
Thanks so much @armingrudd and @mw ! I needed that point of view and regEx example.
@armingrudd I’m not sure I follow your question. Delimiters do indeed indicate where the columns should be broken.
Here is one example record:
!TAM!!$0101340$$#ThreeDimenStudioArt2##^VA.912.S.3.1^^>R.48>><P.8<<[81]|821389455a0c8cbc||
The result I’m looking to get is:
TAM 0101340 ThreeDimenStudioArt2 VA.912.S.3.1 821389455a0c8cbc P.8 R.48
Another example record where the elements are in a different order:
!PS_DDA!!$0101340$$#Three Dimensional Studio Art 2##^VA.912.O.1.2^^|02-102310-00029||<P.96<<>R.32>>[66]
I created a list of delimiters and then different regex patterns are created in each loop based on the delimiters. Only for the values inside the square brackets I had to create the expression separately since this one’s format is different from the other ones.
@armingrudd - That is super clever, and much more future proof than the long solution I had!
I can’t wait to play with it, jut realized I’m still running 3.7.2, even though check updates doesn’t show anything I’ll get it all updated so I can try out your solution.
Thank you so much for taking the time to work that out!
As you can see, I have created the expressions with regexMatcher and regexReplace functions. If we only use the regexReplace function in a String Manipulation, and if the string does not contain the delimiter, we would have the whole string instead of missing. So the matcher function checks if the string contains the delimiter and if it does not exist then a missing value will be returned.
For the output column name of the first Column Expressions node, I have used currenIteration variable from the loop and for the second Column Expressions I have entered it manually (it can be done automatically as well - use row count of the delimiter table as the column name).
There are some unmatched values, numbers enclosed in “[\d+]”, that can not be mapped. What should be done with them?
Edit: Adding a little bit of background. The challenges are that you’ve got:
Variating data, values may not always be present or appear in another order
There are variating delimiters
So you first got to harmonize data. By using the patterns available you reduce complexity by converting it into an almost flat table. Categorize / Classify the data cells based on their delimiter and sanitize the data cells afterwards. Then a simple pivot is required to get what you need.
If you face a complex challenge, try to break it down
Just got Knime all updated so I can explore this solution more.
The regexMatch/Replace combo makes sense once I understood that the match only returns a 1/0 response.
I’m not very good at RegEx, funny enough I was just starting to explore those this week when this problem fell in my lap, so I only have a basic understanding of RegEx in general. These expressions are stretching my brain in that good way
As for the enclosure, that is an outlier, and in the entire dataset I have at the moment it only occurs 34 times out of about 126,000 records. And of course this is the one that doesn’t end with a double delimiter.
For the outlier you might simply use a column combiner and selecting the columns to combine via RegEx or wildcard as all start with a bracket. Unfortunately that “pollutes” the values slightly.