Reges based on blanks and forward slash

ecsdaehn · October 7, 2016, 5:35pm

Hi All,

I would need advise on splitting out an non equally formatted number and "/" combination.

In a database I have cells containing always 2 numbers, being separated by a slash and always embedded into other text.

This number combination part shall be extracted to a separate column, ideally by regex split.

Unfortunately the number parts are not in unique format, i.e. length.

Samples are e.g.:

OTHERTEXT 20/1156 OTHERTEXT

OTHERTEXT 1506/02 OTHERTEXT

OTHERTEXT 4000/00001 OTHERTEXT

The bits I want to extract are always surrounded by a " " (tab)

I tried loads combinations with wildcards, smth like .*(" ") .* ("/").*(" ").* but either it had a sytax error or didnt work and no result came.

Has anybody an idea how to achieve this (can be also other node).

Thanks a lot,

Jürgen

ecsdaehn · October 11, 2016, 2:56pm

Hi All,

a quick update on this one..I was able to solve it, maybe not in the most elegant way but worked.

I used the REGEX split by:

-look up the scenarios possible in the source data, e.g. 00/00, 000/00, 0000/00 etc.

-defined the conditions in the regex node for all scenarios, using the symbol for "or" (" | ").

For each scenario defined Knime attached one column "split_X".

The output data was then populated in the respective added colum wherever one of the conditions was met...e.g. data for 00/00 format was in the appended column "split 1", all the rest was empty, 000/00 was in "split2 and "split 1" and all the rest was empty.

Then I just used column combiner for all columns containing "split_n" to be concatenated to one (using regex there as "split_.*" automatically adding all split column), add a string replacer as a next step to remove the "?" symbols for blanks and then a column filter to get rid again of the split columns..all wrapped into a meta node...

Jürgen

knime_number_split_out.png

thor · October 12, 2016, 10:36am

The "problem" with .+ and especially .* in regular expresions is that they are greedy. This means they try to match as many characters as possible. This means a .* eats up as much characters as it can and if you have more .* in the regex they will match the remaining (empty) string. Moreover, if you only need parts of the string I recommend using the String Replace node and let it create a new column instead of the splitter. For your example something like ".*?\s+(\d+/\d+)\s+.*" as search pattern and "$1" as replacement should do the job.

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.