Information Extraction from Filename

regex
java
#1

forumnFeed.knwf (65.5 KB)

Hi Guys, I have been struggling to extract 5 digit numbers from filenames. It seems to be working in some cases and breaks at other instances.

0 Likes

#2

Hi @shubhamss,

I think there is a wrong Pattern.compile entry: \b matches word-boundary but the underscore (_) is a word character (word character: [a-zA-Z_0-9]). I’m using \D (non-digit-character) outside the group-brackets to separate the 5digit matches. I’m also using a String-List to build the string-array.

Bildschirmfoto%20von%202019-07-09%2022-02-30

The following workflow is a little bit shorter

Bildschirmfoto%20von%202019-07-09%2022-01-53

and the String to Number node only matches the Split Value columns

Bildschirmfoto%20von%202019-07-09%2022-02-07

I hope it helps
Andrew

3 Likes

#3

Thanks, Andrew, it seems now pull the 5 digit characters from the string but some unwanted symbols are still left in the data. Also, I am not using Knime 4.0.

Where ArrInt is the size of each potentially 5 digit char. The code used is as follows:

0 Likes

#4

Hi @shubhamss,

you are using a similar Pattern.compile entry as in your first workflow.

Bildschirmfoto%20von%202019-07-10%2021-20-29

  • “(\\D\\d{5}\\D)” matches all 5 digit strings with 1 non-digit-char before and behind
  • m.group(0) returns all strings like “12345”, " 12345 ", "#12345 " and so on … 5 digit strings with 1 non-digit-char before and behind
  • Put the round brackets around the 5 digit definition
  • “\\D(\\d{5})\\D” matches all 5 digit strings with 1 non-digit-char before and behind
  • m.group(0) returns all strings like “12345”, " 12345 ", "#12345 " and so on …
  • but m.group(1) returns your 5 digit string like “12345” without any non-digit-char before and behind

Best regards
Andrew

3 Likes

#5

Thanks Andew, that works. One last questions is an email column is being shown like this:
image

but when I copy the cell value and paste in a rule engine it gives me something like this:
image

Would you happen to know why??

0 Likes

#6

It looks like different character sets … US-ASCII (1 Byte) and UTF-8 or UTF-16 (2 Bytes) …

0 Likes

#7

Hi there @shubhamss,

to avoid copy paste maybe you can use Rule Engine (Dictionary) instead :wink:

Br,
Ivan

1 Like

#8

I was just copy pasting as a means to checking why any cell with a “@” and a “.com” were not being extracted. The reason is there are unwanted symbols possible due to different character sets in place.

1 Like

#9

Hi,

ok. I get it now. Did you solve it or have a workaround? How did you obtain that column btw?

Br,
Ivan

0 Likes