String Splitter (Regex): Mutliline Extraction does not work

mwiegand · August 1, 2024, 6:21am

Hi,

chances are I made a mistake but cannot spot it. Hence, as everything looks plausible to me and was double confirmed in an editor, I raise it as a bug.

Despite the RegEx matching and the node being setup to extract either a list or multiple rows, it only ever gets the first result.

RegEx
^([^\n]+\s\d{13}[^\n]+)\n

Data (Arabic Language)
Source: Assistance Needed: Extracting Specific Highlighted Text from Arabic PDF Files

4�ϰϟ·�1��ϥϣ�ΔΣϔλϟ΍�ϡϗέ 2007  : ΔγέΩϣϟ΍�ίϣέ ΓέλΑϟ΍  : ΔϳέϳΩϣϟ΍

  Ε΍ίϳϣΗϣϠϟ�ϑϼϳϻ΍�Δϳϭϧ Ύ Λ     : Δ˰˰˰˰˰˰γέΩϣϟ΍
΍ ϲϣϠϋ  : ωέ˰˰˰ϔϟ΍

2024/2023 ϝϭϷ΍ έϭΩϟ
ϝΩόϣϟ΍ ωϭϣΟϣϟ΍ ΔΟϳΗϧϟ΍ ΕΎϐϠϟ΍ ˯Ύϳίϳϔϟ΍ ˯Ύϳϣϳϛϟ΍ ΕΎϳοΎϳέϟ΍ ˯ΎϳΣϻ΍ ΔϳίϳϠϛϧϻ΍ ΔϳΑέόϟ΍ Δϳϣϼγϻ΍ Ώ˰˰˰˰˰˰ϟΎρϟ΍�ϡγ΍ ϲϧΎΣΗϣϻ΍�ϡϗέϟ΍ ϝγϠγΗ

0 0 Ωϳόϣ έϔλ έϔλ 74 76 έϔλ 95 84 94 έΑΎΟ�έϭλϧϣ�ϝϳΑϧ�ϑΎϳρ΃  1624122007001 1
89.71 628 ΢ΟΎϧ 51 70 79 92 96 99 95 97 ϥϳγΣϟ΍�ΩΑϋ�ΩΣ΍ϭϟ΍�ΩΑϋ�ϥϳγΣϟ΍�ΩΑϋ�˯ϻ΁  1624122007002 2

0 0 Ωϳόϣ 36 28 37 50 42 78 86 ϡ�ύ ϑϠΧ�ϥΎΣέϓ�˯ϼϋ�ϡΎϳέ΍  1624122007003 3
69.14 484 ΢ΟΎϧ 50 50 50 70 50 87 84 93 ω΍ίϫ�ϥϳγΣϟ΍�ΩΑϋ�Ω΋΍έ�Ϫϳ΍  1624122007005 4

0 0 Ωϳόϣ ϡ 87 ϡ 96 93 100 94 100 ϲΗΑγ�ϥΎϧΩϋ�ΩϟΎΧ�ϥΎΑ  1624122007006 5
0 0 Ωϳόϣ 72 100 ϡ 100 100 97 99 100 ϲΑϧϟ΍�ΩΑϋ�ϙϟΎϣ�ϡίΎΣ�ϝϭΗΑ  1624122007007 6
0 0 Ωϳόϣ 56 ϡ 55 έϔλ 83 94 75 98 ΩΣ΍ϭϟ΍�ΩΑϋ�ϡυΎϛ�έΩϳΣ�ϥϳϧΑ  1624122007008 7
99 693 ΢ΟΎϧ 89 100 98 100 95 100 100 100 έϛγϋ�Ϫϳρϋ�ϲϠϋ�ϪϧΎϣΟ  1624122007009 8
0 0 Ωϳόϣ 80 έϔλ 91 100 έϔλ 99 98 100 Ωϭ΍Ω�αϳϗ�ϲλϗ�ϪϧΎϣΟ  1624122007010 9
0 0 Ωϳόϣ 86 έϔλ 61 έϔλ 84 99 έϔλ 97 ϥϳγΣϟ΍�ΩΑϋ�έΎΑΟϟ΍�ΩΑϋ�έΎϣϋ�Ύϳϟ΍Ω  1624122007011 10
0 0 Ωϳόϣ 71 έϔλ έϔλ 100 96 98 91 100 ϰϔρλϣ�ΩϟΎΧ�Ωϳϟϭ�˯ΎϋΩ  1624122007012 11
0 0 Ωϳόϣ έϔλ έϔλ έϔλ 98 94 97 96 100 ϲϠϳϓ�ϲϘΗ�υϓΎΣ�ϝϳΣ΍έ  1624122007013 12
0 0 Ωϳόϣ ϡ έϔλ 96 100 91 98 98 100 ϥϳγΎϳ�ΩΟΎϣ�ΙέΎΣ�Ϫϳϗέ  1624122007014 13
0 0 Ωϳόϣ 73 έϔλ 95 100 94 95 96 100 ϥϭϳΧ�Ϫϔρϛ�ϭΑ΍�ϡγΎΑ�ϥ΍ϭέ  1624122007015 14
0 0 Ωϳόϣ 73 έϔλ 91 95 98 100 100 98 ϲρϼΧ�ϱΩϳέϛ�ϥΎϧΩϋ�ϥ΍ϭέ  1624122007016 15
0 0 Ωϳόϣ έϔλ έϔλ έϔλ έϔλ έϔλ 94 έϔλ έϔλ ϲϠϋ�ϡγΎΟ�ΩϣΣϣ�ϥ΍ϭέ  1624122007017 16
0 0 Ωϳόϣ 82 ϡ έϔλ έϔλ 94 100 100 100 αϧϭϳ�ϥγΣ�ϰοΗέϣ�ϥ΍ϭέ  1624122007018 17
0 0 Ωϳόϣ 79 ϡ 82 έϔλ ϡ ϡ 93 ϡ έϳοΧ�ϡϟΎγ�α΍έΑϧ�ϡΎϳέ  1624122007019 18
0 0 Ωϳόϣ 81 έϔλ 81 95 90 98 96 97 ΩϭϣΣϣ�έϛΎη�ΩϭϣΣϣ�ϡίϣί  1624122007020 19
0 0 Ωϳόϣ 80 έϔλ 80 100 88 100 95 99 ϡυΎϛ�ϡγΎΟ�ΩϣΣ΍�˯΍έϫί  1624122007021 20
0 0 Ωϳόϣ έϔλ έϔλ έϔλ 72 ϡ 95 88 93 ϑΎγϋ�ϝϳϣΟ�έ΋ΎΛ�˯΍έϫί  1624122007022 21
0 0 Ωϳόϣ 55 έϔλ 93 95 97 96 97 100 ϊρΎϛ�ϡΛϳϣ�ϡγΎΟ�˯΍έϫί  1624122007023 22
0 0 Ωϳόϣ 84 έϔλ έϔλ 100 ϡ 95 98 99 ΕϭΑέΑ�ΏγΎΟ�ΩϣΎΣ�˯΍έϫί  1624122007024 23
0 0 Ωϳόϣ 77 έϔλ 92 83 89 97 95 100 έϛΎη�ϡγΎϗ�ௌ�ΩΑϋ�˯΍έϫί  1624122007025 24
0 0 Ωϳόϣ 50 έϔλ ϡ έϔλ έϔλ 85 69 έϔλ ϱ΍Ωϋ�ϡυΎϛ�ϲϠϋ�˯΍έϫί  1624122007026 25

ϥγΣ�ϲΑϳόϟ�έϳϣϷ΍�ΩΑϋ
ίϛέϣϟ΍�έϳΩϣ```

![Screenshot 2024-08-01 081811|690x414](upload://lxMVmPPFMAI0NjCesZHFELehY6e.png)


Best
Mike

mlauber71 · August 1, 2024, 6:59am

@mwiegand have you checked for possible double quotes needed?

mwiegand · August 1, 2024, 7:25am

The PDF has none. Though, I believe the issues cause might be rooted in the character processing as something isn’t rendered properly.

I am working on a sample workflow right now to confirm my suspicion.

mlauber71 · August 1, 2024, 7:37am

@mwiegand no the regex code in knime often needs double escaping the \d has to be \\d …

mwiegand · August 1, 2024, 7:51am

Got you. But that would mean that the tested RegEx wouldn’t extract anything et all but it does. I just tested that idea and double escaping is not doing the trick. The node description only mentions double escaping if the literal character should be used like here:

Here is a test workflow that produced even more strange results

The test data from the Arabic PDF only ever extracted the first while the second example only extracted the last match “8”.