Extract Data From Right/Middle of String Based on Spaces not Characters

This one always plagues me during my data manipulation. I have no problem extracting data from the left to right. However, my data that I want to extract is often on the end or in worse case in the middle. The real challenge is that the string is dynamic, so a cell splitter based on spaces just creates a jumbled mess to work through. In the case outlined below, I do know that I need the data from the right of the string based on spaces (always 4), but not character counts (too dynamic). I have tried the String Manipulation with a “-4” but it just removes the first 4 characters. I have played around with ReGex, but to no avail. Basically looking to find the 4th space from the right & extract only the characters. I know I am making this way too hard…

substr($Description_Arr[3]$, indexOf($Description_Arr[3]$," " ) -4)

image

hi @davehansen ,
would this be a correct result?
immagine
or this one

Awesome, yes the second one is the desired end state. Those values will later have calculations built off of them.

to extract the last 4 groups of non-blank chars you can use this regex in a String Manipulation node
regexReplace($Description_Arr[3]$, "^.+ ([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)$", "$1 $2 $3 $4")
and then you can split the output string in a Cell Splitter node using a blank delimiter
If the regex doesn’t match $Description_Arr[3]$, that is, if $Description_Arr[3]$ contains less than 4 blanks, it returns the entire $Description_Arr[3]$. You can filter out these cases after the split, because the rightmost column (the 4th group) will contain a null value
Hope this solution works for you

2 Likes

Great stuff here! For my learning what does the ^ denote?

Square brackets allow us to list several possible characters ([0-9] stands for any digit)
inside of square brackets, ^ negates that selection ([^äöüÄÖÜ] stands for any character except Umlaute)
outside of square brackets, ^ means at the beginning of a line; $ denotes the end of a line

Don’t forget to select @duristef’s post as solution :wink:

1 Like

Thanks @Thyme
@davehansen , in your case: ([^ ]+) is a capturing group of non-blank chars. I replace the original string with the four rightmost groups of non-blank chars separated by one or more blanks. It would be difficult to capture the same groups would they be placed in the middle of the string, unless they had a well-defined pattern. For example, you could capture a sequence like this

  • beginning of the string
  • zero or more chars
  • Group 1: only digit(s) or digits,dot,digits - preceded by a word delimiter or at the beginning of the string
  • blank(s)
  • Group 3: exactly 2 uppercase chars
  • blank(s)
  • Group 4: digit(s)
  • blank(s)
  • Group 5: only digit(s) or digits,dot,digits
  • zero or more chars
  • end of the string

using this regex (which works in the former case too)

regexReplace($column1$, "^.*?(?<=\\b)(\\d+(\\.\\d+)*) +([A-Z]{2}) +(\\d+) +(\\d+(\\.\\d+)*).*$", "$1 $3 $4 $5")

You can see the result in the Regex Extractor node

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.