Extract Specific # of Characters Within String After Appearance of Specific Letter (inclusive)

wisemanleo · July 19, 2022, 4:15pm

I have a column that contains rows of mixed double/strings that looks like this:
950.00 ND046693R94660A
400.00 ND056293R94540D

From the first row, I’m hoping to extract R946, and then the last letter, A, to make R946A.
From the second row, I’m hoping to extract R945, and then the last letter, D, to make R945D.

The R is constant with every row, and the extracted quantity of characters is always 4 plus the last letter.

Have wrestled with string manipulation, cell splitter, and regex, but haven’t been able to crack this nut yet.

Any help would be appreciated.

Thank you.

lelloba · July 19, 2022, 4:30pm

Hello @wisemanleo ,

if the strings have this structure they are regular and this solution can work for you:

Have a nice evening,
RB

bruno29a · July 19, 2022, 4:35pm

Hi @wisemanleo with just 2 sample rows, and them being similar in structure, does not give any idea of how the data looks like.

There are a few ways to do this, but some would depend on the data structure.

If the positions are always fixed (same amount of characters, and the R is always at the same position), you can directly use the substr() function.

Otherwise, you could still use substr() but with combination of indexOf() function to get the position of R, that is as long as you only have ONE R per row.

Ultimately, Regex will do.

qqilihq · July 19, 2022, 4:36pm

Hi wiesemanleo,

As you’re already using the Palladian nodes, here’s a solution using the Regex Extractor

(?<start>R\w{3}) # letter R followed by three digits
\d*              # arbitrary number of digits
(?<end>\w)       # one letter

As there’s characters in between which are not relevant, I’m extracting the prefix (e.g. R945) and the suffix (e.g. D) separately. Then you can join them into one string using the String Manipulation:

join($start$,$end$)

Result:

Workflow:

bruno29a · July 19, 2022, 4:51pm

The substr() is probably much faster than regex, as regex would need do to a search. But the direct use of substr() only works if the position is fixed.
This (via String Manipulation) join(substr($column0$, 15, 4), substr($column0$, 21)) should do

lelloba · July 19, 2022, 4:54pm

6n91ql

wisemanleo · July 19, 2022, 4:57pm

You guys are all awesome and hilarious with the meme as well.

The position of R appears fixed, but I wanted to find it as opposed to call it by position. The length of the string starting with R is fixed, however.

@qqilihq’s solution worked great! This is the one I tried first, but @lelloba appears to work as well!

bruno29a · July 19, 2022, 5:00pm

Hi @wisemanleo , it’s definitely good to know how not to depend on the position. However if the position is fixed, the substr() function will do the job much faster than regex, as substr() is just going at the exact position, while regex needs to first do a search through the string each time.

Both will work, but substr() will be the most efficient

wisemanleo · July 19, 2022, 5:18pm

Perfect, thanks @bruno29a. The workflow overall is simple enough to allow me to run both, which I’ll do for learning as well as to explore the tipping point at which the data size becomes large enough where the processing speed begins to diverge significantly.

bruno29a · July 20, 2022, 5:46am

No problem @wisemanleo , and as I said, it’s good to know how to be able to do this via different solutions - that’s the common thing about Knime, there are almost always different ways to reach a solution.

Keep in mind though, if you do have more than 1 R, only the substr() with fixed position will work.

For example:
950.00 NR046693R94660A

The substr() solution will give you the expected value. Regex won’t (you can test it for yourself), unless the Regex also include some positional info.

Happy Kniming.

system · July 27, 2022, 5:46am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.