Identifying Duplicate Entries with a varying ending suffix

Rokank · March 4, 2020, 10:40pm

I have a column of invoices where duplicates are created with abbreviations A, B, and sometimes A2.
I want Knime to help create a second column where it identifies the original invoice.

For example:
Column 1:
ABC12345
ABC12345A
ABC12345B
EFG12345
EFG12345A
EFG12345A2

Knime Output:
Column 1: Column 2
ABC12345 : ABC12345
ABC12345A : ABC12345
ABC12345B : ABC12345
EFG12345 : EFG12345
EFG12345A : EFG12345
EFG12345A2 : EFG12345

Thanks in advance!

elsamuel · March 4, 2020, 11:52pm

Hi @Rokank

If the stem of the invoice numbers is always the same length (apart from the A/B/A2, so in your examples here, the length is 8) then you can use the substring function in the String Manipulation node to extract the leftmost 8 characters and leave behind the A/B/A2.

arofab · March 5, 2020, 9:19am

Hi @Rokank
If your invoice ids are of a fixed length (excluding the duplicate suffix) or follow a pattern like a set of letters followed by a set of numbers, optionally followed by the duplicate suffix (A/B/A2…), then it could be possible with a regex.

If this is not the case, it can be difficult to clearly define what is a duplicate.

Rokank · March 5, 2020, 2:33pm

Hi,

The Invoice IDs are not fixed length. They vary depending on the vendor.
There could be “JKL1234567” and “JKL1234567A”.

However, there would not be “JKL12345”

ipazin · March 5, 2020, 3:19pm

Hi there @Rokank,

if you are aware of all suffixes that can happen you can use Regex Split node with following syntax that could use improvement as I’m far from regex expert

(^[A-Z0-9.]+)([A2]{2}|[AB]{1})

Br,
Ivan

system · September 4, 2020, 3:19am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.