Using OPENAI to analyze PDFs

MartinDDDD · October 3, 2024, 10:51am

well… let’s say it depends.

When you use a node like Tika Parser it extracts all information in text format - that means it includes the information in tables as well, but without maintaining the “tabular” structure. When asking the LLM to summarise the information will be included. In general LLMs tend to be fairly poor in understanding tabular data and as far as I know this has not changed.

With regards to images: Yes, there are LLMs that can be fed with images as inputs. That said right now, at least with the nodes included in the Gen AI extension that was developed by KNIME, it is not possible to send images to an LLM. As a work around I can point you again to my extension above, which includes a vision model prompter node or alternatively there is an example workflow from @roberto_cadili that takes care of this via POST request:

So in your context there are some additional challenges:

Your PDF contains Text, Tables and Images => picking one PDF apart and extracting Text, Tables and Images I think is challenging at least if you entirely want to go for the low code way (no Python scripts etc…)
Even if you manage to solve 1) it will likely be a tricky set up to feed this data to an LLM in a structured way so it interprets things and the relation between text, tables and images correctly

So my thought right now is:

try and see how far just sending the text extracted via TIKA parser gets you - if the results are good enough and solve your use case - perfect
If 1) does not work out:
Maybe try and convert the PDFs to images and feed these images to an LLM with vision capability alongside your prompt. Vision has improved significantly in the last 6 months and text recognition, table recognition and image recognition seems to work very well.
Sorry again for some more shameless self promotion, but I happen to have experimented with vision models on various use cases and published an article and a video on it - the last test I did was feeding vision models an image of a PDF that contained text and graphs - take a look here (24:05):
https://www.youtube.com/watch?v=ueDN0jsQiHE&t=1447s

I did test gpt-4o-mini, but not the stronger gpt-4o models - from an expectation perspective I think that the stronger gpt-4o models probably perform similar to Anthropics Claude 3.5 Sonnet (which did very well on this task when I tested it in the video above)

Edit: If you also think it’d be great to have the ability to prompt vision models using the KNIME-developed GenAI extension I’d appreciate for you to “vote” for my feature request here