Extract specific data from Multiples PDF

Hi Team,

Thanks for your support over a period. I have been able to learn of new things which has increased my overall productivity. I have been trying to extract some data from multiple PDF’s from 2 weeks, but I hadn’t got any success. So I did it manually, but now since I have the time I wanted to understand who can I extract specific data from PDF.

I have seen the example pdf_extract.knwf (57.8 KB) which matches with my case.

I understand that I need to follow this sequence to get my data.

image

The problem i have is I am not able to generate the regex to extract my specific data.

image

I need to extract

  1. Name of the Trust - Page 1
  2. Income of Trust estate - Page 2
  3. Total tax losses carried forward to later income years - Page 3

Link to the PDF file: https://drive.google.com/file/d/1n5-MBs5R4Fhv_IuakRgiewzhCUTCVwxS/view?usp=sharing

Can some please help me creating the regex code for this. Hopefully, which this I will be able to understand extraction technique and then I can add few more items that I need to extract?

Thanks
Ankit

1 Like

Hi @Ankit_smart
There are online regex testers available. I use https://regex101.com/ often to get my regex working. Maybe this works for you too.

1 Like

Hi @Ankit_smart,

this regular expression creates your 3 values

(?:Name of trust)(?<NameOfTrust>(?:(?!Australian business number).)*)|(?:Income of the trust estate A\s)(?<IncomeOfTrustEstate>(?:[0-9,])*)|(?:Total tax losses carried forward to later income years J[\s.,]*)(?<TotalTaxLosses>(?:[0-9,])*)

Description for further expressions:
(?:Name of trust) - anchor before your value
(?<NameOfTrust> - name of your value
(?!Australian business number) - anchor that terminates your value (if needed)
(?:(?!Australian business).)*) - matches your value with the terminating anchor
(?:[0-9,])*) - matches your value without terminating anchor (digit and thousands separator)
‘|’ - separates your values

Best Regards
Andrew

6 Likes

Thanks Andrew, Thanks for explaining me the logic behind the regex code. However, it does not seems to be working. I was trying to test 1 line at a time but none of them seems to be working.


I am also attaching my workflow for reference.

PDF file Extractor.knwf (16.4 KB)

You can also look at


node and examples.
1 Like

Hi @Ankit_smart,

in my Regex Extractor config i didn’t use any Flags.

Best regards
Andrew

Thanks a Ton Andrew. I did’t knew what Flagging does, but thanks for point me to the right direction. This helps me heaps. Never thought I could do. Really love knime

Thanks Izaychik63. This one looks interesting as well. I will give it a shot soon.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.