Extracting String from a field of text

Hi, I am knew to KNIME and I have a problem that I am trying to solve with KNIME. I have an excel file with 4 columns case number, date opened, account and description. The description column has text that contains VIN numbers. I would like to extract the VIN and create another excel file that has the same columns but also a column called VIN that contains only VIN and nothing else(no other text). The description column from the input file can have cells with multiple VINs in them. So in the final Excel sheet it would be ok and normal to have , for example 4 rows with the same case number, date opened and account values but different VINs (4 VINs) . Not sure how to approach this problem? any suggestions or help would be much appreciated.

Is there possible more than 1 VIN in a column for a given row - or just 1 VIN per row?

Are these North American VINs or EU VINs or ISO 3779 VINs or a mixture or ?

If they’re N.A. VINs, do you need to calculate their validity?

1 Like
  1. Yes it is possible to have more than 1 VIN in a column for a given row.

  2. These are US VINs

  3. I don’t need to check there validity.

The issue is something like this
Original Data
SR-Number Date Opened Description Account
1234 6/17/2019 16:50 Hi, Issue with VINS 1234,5678,32432 xyz

Required output
SR-Number Date Opened VIN Account
1234 6/17/2019 16:50 1234 xyz
1234 6/17/2019 16:50 5678 xyz
1234 6/17/2019 16:50 32432 xyz

Here’s a workflow that does the low hanging fruit version of it; if you don’t understand regular expressions, it would be good to get to learn them for future projects.
adi’s vins.knwf (8.2 KB)

2 Likes

Hi @Adiwakar

I created a workflow and I think it wil work. Have a look ! I’m not familiar with VIN numbers. I took some random numbers from http://randomvin.com/

extract_vin.knwf (28.6 KB)

More Nodes, less Java like @quaeler :upside_down_face:

gr. Hans

2 Likes

@HansS and @quaeler Thank you both for your inputs. That was helpful. I see the logic that both of you were using and I have incorporated with the rest of my workflow.

2 Likes

@quaeler The Java snippet works well however, there seemed to be a few outliers, situations in which the Vin was not extracted.

Scenario 1:
Man Reed/114026/1G1ZD5ST9KF131503/

Scenario 2:
1GCHSBEA8K1211793 Not Found
1GCTRCQA9K1216744
1GCTRCQA9K1213452

Scenario 3:
VIN: 1GCTRCQA9K1216744.
VIN:1GCTRCQA9K1216744

Not sure why they were not picked up

The regexs are defined to have the VINs be bounded by spaces or commas, and also line end and line start (i thought) - so Scenario 2 is surprising… you could play fast and loose and change them to match any 17 length span of numbers or letters - which would only then fail if you had text that was not a VIN but had more than 17 letters/numbers - Pattern.compile("[a-zA-Z\\d]{17}") or you could change the boundary to [^a-zA-Z\d] instead of the space or comma ([\\s,]) - but this boundary being enforced would still need ORing (|) to include the cases in which the VIN borders the start or end of the string, whereas the “take all 17-length things” would not.

1 Like

Do you mean something like this [\s,][a-zA-z\d]{17}[\s,]|[\s,][a-zA-z\d]{17}$|^[a-zA-z\d]{17}[\s,]|^[a-zA-z\d]{17}$|[^a-zA-Z\d][a-zA-z\d]{17}[\s,]

Seems to be working but the data also pulled out names (Full names) that are 17 characters long with no spaces in them.

I tried this too
[\s,][a-zA-z\d]{17}[\s,]|[\s,][a-zA-z\d]{17}$|^[a-zA-z\d]{17}[\s,]|^[a-zA-z\d]{17}$|[^a-zA-Z\d][a-zA-z\d]{17}[^a-zA-Z\d]

but this returned full names and email ids.

Thoughts?

Is there way don’t return a 17 character long string if it does not have a digit? Because US vins will have a few digits for sure. Would this work?

Ya - if you don’t include a boundary condition (beginning of line, end of line, ‘not-a-number-and-not-a-letter’) then you will also capture portions of text runs that are longer than 17 characters.

Give the attached a shot.adi’s vins.knwf (8.6 KB)

ps. make sure that when you write script that need be exact in the forum (like regexs, surround it with backticks (aka backquotes) so that nothing is lost by the forum trying to mark things up - when surrounded by a backtick your text will look like this)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.