Hello Everyone, I’m new to knime and have been struggling with extracting certain data from a Bach of pdf files that then need to be exported to excel. The data from each file extracted has to then be matched with the file it came from when exporting to excel.
So far I’ve created a workflow using tika parser, string manipulator, row filter and excel writer but it’s not giving me clean results. I have attached a picture of one of the pages in the pdf file. I want to extract full name I marked in red and address. All the pages and files are the same format.
Hello thank you for the reply, so this invoice can not be done without python? All I need is the information I marked out in red. It’s in the same location for every page and file.
Okay I see, I don’t know ho to use the R node. Well yes this is just an example. I need to be able extract names and addresses from any location of the page. Sometimes there’s a “name:” in front of the name but sometimes there isn’t. I need to figure out which node will show me the rows of a page and then I can possibly choose which row of text in the page I can extract from? In this case I’d want whatever rows have the name and address in this invoice.
Note: when I choose these type of files through tika parser. It puts all the data all in one column, the content column. Could I use column splitter and try and use a delimiter to split the lines of text. Any insight on what delimiter I could use? And then what other node could I use to then choose what row I want to extract
@NabilEnn you will most likely need some sort of structure to identify the content you want. This might be a position or a key word. Another thing you could try is feed the extracted text (or text parts) into a LLM model and instruct it to answer with a JSON file that would contain the content.
Question is if you could provide a sample of PDFs that would represent an overview of your challenges without spelling any secrets.
I can’t seem to find anything downloadable online I’d have to create a pdf using docs or something. I’m also running into issues where using string manipulation node and entering the correct regex or keywords for the column I want to extract.
It will automatically extract all table that are in list of PDFs and you also can define areas from which to extract text - also on multiple pages. Maybe you can give this approach a try:
extracting data from PDF files will almost likely never result in some structured data. Even well organized PDFs containing well structured tables like this:
Whilst PDF are essentially some sort of XML, each “PDF-Printer” generates slightly different results. Adobe still holds some sort of monopoly on it. Hence, assuming PDFs being parsed perfectly when the display is neatly designed will almost always fail miserably.
Hence, I developed some principles to get around most of the limitations which are:
Do not parse line breaks resulting in multiple cells. Instead read everything with column and row separators
Use String Manipulation to inject unique markers like “$PAGE_BREAK$” i.e. via RegEx Replace
Factor in that the string replacer and manipulation node struggles to cope with line breaks within cells
Sanitize your data by removing unwanted strings and (temporarily) replacing line breaks “\r\n” with unique markers like “$LINE_BREAK$”
Re-establish the structure by splitting columns and cells based on the unique markers injected before