extracting substring from searchstring

Dear Team,

I have to extract substring from string and when extract is numeric I have to store it into another column

Table
column value
5 abc
5abc
abc
def 5 abc
def d abc
def d5 abc

Searchstring = “abc”, result = 2 letters before searchstring startposotion.

Table with expected results anways searching for “abc”
5 abc gives 5
5abc gives 5
abc gives missing, no result
def 5 abc gives 5
def d abc gives missing, because result is not numeric
def d5 abc gives missing, because result is not numeric

I tried with string manipulation node / string manipulation multiple columns, but unfortuanlety without success.

Any idea high appreciated.

BR,

Heinz

Hi @Heinz , I think String Splitter (regex) should work for you, based on your sample data, with the following regex:

.*(?<!\S)(\d)(?=\s?abc).*

I uploaded a very slightly modified version of your question to chatgpt, and then had a conversation to refine the regex.

You can read the conversation here, which also gives an explanation of the regex.

What I’m not entirely sure about though is that you refer to the 2 characters before abc, but the examples you have are only for a single digit. Are you also wanting 2 digits to be found,

i.e. what would you want to return for each of the following?:

column value
xyz12abc
xyz 12 abc
xyz 12abc
xyz12 abc
1 Like

Thanks for your assistance,

xyz12abc should give 12
xyz 12 abc should give 2
xyz 12abc should give 12
xyz12 abc should give 2

I think I have to integrate searchstring, because I only want to have the digits infront searchstring to avoid running into trouble when

xyz 12 abc 23 def

occurs.

In VBA the solution was:
pos = instr(column_value,“abc”)
if pos > 0 then
result = trim(mid(column_value, pos-2,2))
end if
if isnumeric(result)=false then
result = 0
endif

Hi @Heinz, I’m not convinced that the VBA code you have given would return null for “def d5 abc”, since the trim of the two chars to the left of abc would give “5”, and there is nothing in that vba code to check if the 5 is preceded by a non-numeric.In your examples you say it should return “missing” rather than 5.

I would also never have guessed your expected outcomes for the “12” values based on your earlier example, but I can try to adapt for the required outcomes :wink:

If you are ok doing this with some code you can achieve it with String Manipulation as follows:

toInt(
regexMatcher(strip(substr($column value$, indexOf($column value$,"abc")-3,
		indexOf($column value$,"abc")>2?3:indexOf($column value$,"abc"))),
		".?\\d{2}|\\d{1}"
		)=="True"
?strip(substr($column value$, indexOf($column value$,"abc")-2,
		indexOf($column value$,"abc")>1?2:indexOf($column value$,"abc"))
		)
:toNull("")		

This applies all the rules that I’ve so far ascertained from your examples. It looks at the three characters prior to “abc” to determine that they are only either whitespace or numeric, and then if they are, it returns the trimmed two characters to the left of “abc”

If you want to try the new Expressions node (KNIME 5.3 onwards), the following expression is the equivalent of the above String Manipulation. You can see significant similarities in the structurem although the keywords have changed, and the other important difference is that the index of characters in a string start at 1 for Expressions, whereas they start at 0 for String Manipulation, so you will see there is a small adjustment:

parse_int(
    if(regex_match(strip(substring($["column value"], find($["column value"],"abc") - 3,
                            if(find($["column value"],"abc")>3, 3, find($["column value"],"abc")-1) ) ),
            ".?\\d{2}|\\d{1}"
            ) 
    ,strip(substring($["column value"], find($["column value"],"abc") -2,
            if(find($["column value"],"abc")>2, 2, 1) )
            )
    ,empty_to_missing("")	
    )
)

For a “no-code” version, an alternative is to use StringSplitter (Regex) with the following pattern:

.*?(\d{2})abc.*|.*?\d(\d)\sabc.*|.*\b(\d)\s?abc.*

This then returns three columns (one for each captured group of digits, according to the pattern found).

The three columns can then be merged using a Column Aggregator or a pair of Column Merger nodes.

In the attached workflow, I have created a flow variable called “pattern” in the Variable Creator, and this is then included within the code and the regex using the variable value rather than the “abc” literal, to make it more flexible.

Find numeric in regex.knwf (88.8 KB)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.