Hello everyone,
I would like to extract a certain data set from a PDF (converted with Tika Parser) with Regex Extractor.
Text with Tika Parser
...Marketing (regional) ___Investitionen/Aufwand 13.750.875,14 € 6,0% ...
My Regex
(?:\(regional\) ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
This does not work.
By doing this
(?: ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
It works, but I need the addition "(regional)".
What is the correct expression?
Many thanks!
Hi @sabsab ,
Which node are you trying to use the regex with, as syntax sometimes has to be modified slightly to work with some nodes.
Also, can you tell us what part of the output you are actually wishing to capture as you have just said “a certain data set” but not told us what that is.
It is easier to assist if you can show input and expected output along with a description of the requirement.
On the face of it, the regex that you say doesn’t work has more spaces between (regional)
and ___Investitionen
than your data does, so is that possibly part of the problem?
edit: I wonder if this is what you want (e.g. using String Replacer) , but I don’t actually know what result you are after
.*(?:\(regional\) ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
Hi @takbb,
using Regex Extractor for this with Knime 4.7.
Sorry, the different spaces only happened here in the blog due to copying back and forth.
The original text with Tika is this (its a table)
... Marketing (regional) ___Investitionen/Aufwand 13.750.875,14 € 6,0%
The expression is this
(?:Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)type or paste code here
The result is
Invest
13.750.875,14 € 6,0% (its ok, it works)
The problem is
There are too many items with “Investition/Aufwand” in the document, so I need the additional delimitation by “(regional)”.
My idea was
(?:\(regional\) ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
but this does not work, can no longer make an assignment
br
Hi @sabsab,
Using your regex:
(?:\(regional\) ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
when I plug this into Regex Extractor on 4.7.8, it appears to work ok, so I’m not sure what doesn’t work:
This is what I see:
what result are you getting?
my result:
match
with this
(?:___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
no match
with that
(?:\(regional\) ___Investitionen/Aufwand\s)(?<Invest>(?:(?!___).)*)
The only difference I see is that my entire text is in just one row. Maybe that’s the problem?!
Hi @sabsab, could you perhaps upload a small workflow containing some anonymized sample text as this will make it easier for people to assist.
hi @takbb
here is my example. I don’t know where the error lies.
br
regex_demo.knwf (19.4 KB)
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.