Extract specific data from Multiples PDF

Ankit_smart · September 5, 2020, 8:47am

Hi Team,

Thanks for your support over a period. I have been able to learn of new things which has increased my overall productivity. I have been trying to extract some data from multiple PDF’s from 2 weeks, but I hadn’t got any success. So I did it manually, but now since I have the time I wanted to understand who can I extract specific data from PDF.

I have seen the example pdf_extract.knwf (57.8 KB) which matches with my case.

I understand that I need to follow this sequence to get my data.

The problem i have is I am not able to generate the regex to extract my specific data.

I need to extract

Name of the Trust - Page 1
Income of Trust estate - Page 2
Total tax losses carried forward to later income years - Page 3

Link to the PDF file: Blank Test 1.pdf - Google Drive

Can some please help me creating the regex code for this. Hopefully, which this I will be able to understand extraction technique and then I can add few more items that I need to extract?

Thanks
Ankit

JanDuo · September 7, 2020, 6:17am

Hi @Ankit_smart
There are online regex testers available. I use https://regex101.com/ often to get my regex working. Maybe this works for you too.

Andrew_Steel · September 7, 2020, 8:13pm

Hi @Ankit_smart,

this regular expression creates your 3 values

(?:Name of trust)(?<NameOfTrust>(?:(?!Australian business number).)*)|(?:Income of the trust estate A\s)(?<IncomeOfTrustEstate>(?:[0-9,])*)|(?:Total tax losses carried forward to later income years J[\s.,]*)(?<TotalTaxLosses>(?:[0-9,])*)

Description for further expressions:
(?:Name of trust) - anchor before your value
(?<NameOfTrust> - name of your value
(?!Australian business number) - anchor that terminates your value (if needed)
(?:(?!Australian business).)*) - matches your value with the terminating anchor
(?:[0-9,])*) - matches your value without terminating anchor (digit and thousands separator)
‘|’ - separates your values

Best Regards
Andrew

Ankit_smart · September 9, 2020, 1:40am

Thanks Andrew, Thanks for explaining me the logic behind the regex code. However, it does not seems to be working. I was trying to test 1 line at a time but none of them seems to be working.

I am also attaching my workflow for reference.

PDF file Extractor.knwf (16.4 KB)

izaychik63 · September 9, 2020, 12:02pm

You can also look at

node and examples.

Andrew_Steel · September 9, 2020, 12:46pm

Hi @Ankit_smart,

in my Regex Extractor config i didn’t use any Flags.

Best regards
Andrew

Ankit_smart · September 12, 2020, 11:18pm

Thanks a Ton Andrew. I did’t knew what Flagging does, but thanks for point me to the right direction. This helps me heaps. Never thought I could do. Really love knime

Ankit_smart · September 12, 2020, 11:21pm

Thanks Izaychik63. This one looks interesting as well. I will give it a shot soon.

system · September 19, 2020, 11:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.