Hi,
I’ts possible to read the information of a PDF file specifying the page range?
I have a PDF with information about one person every two pages, that is, I need to iterate two pages only, to analyze the information there.
I am trying to do it with tika parser, but this takes all the PDF information in a single row.
Does anyone know how I can configure it to achieve what I want?
Or does anyone know how to do it other than with tika parser?
Hi @armingrudd,
The PDF is in a standard format, so i know how to extract the information there with string manipulation, but I need to get from the entire full PDF only two pages for each iteration that i do, since every 2 pages, there is a new person to parse so i can extract the information how i want to.
I have found a Python library called PyPDF2, using this code in Python edit variable node, I can get what I want to achieve:
from PyPDF2 import PdfFileReader
pdf_document = flow_variables[‘Location’]
with open(pdf_document, “rb”) as filehandle:
pdf = PdfFileReader(filehandle)
info = pdf.getDocumentInfo()
pages = pdf.getNumPages()
Could very well be you would have to use a loop or something. Maybe you could provide us with an example of a PDF you want to use and an extract you would expect as a result.
It looks as if the output in this case is a KNIME flow variable. Would that be sufficient or would you need a table?
How many pages are we typically talking about. A KNIME loop might be more costly than a loop inside Python. But that depends on what you want to extract.
Hi @mlauber71,
Yes, I plan to use a loop, with the number of pages of takes per iteration as a variable. I know how to do it, but I don’t know how to change the code, because it only takes one page, I need it to take a range of pages (which I would later convert to variable).
I can’t show you an example of a PDF, because it contains real information of people, but don’t worry about its structure, because the only thing I don’t know is the change in the code mentioned above about the range in the number of pages to take for each iteration, after this, as you say, I see the information as output from the knime variable, not as a table, and then I am proceeding to do string manipulation to extract the information I want.
Do you know what change I must make in the code to read not only a page, but a range of PDF pages?
pdt, the PDF only contains 50 pages, it is not so extensive.
I would assume it is the getpage command but have not tested it.
It could be you would need two python scripts one determining the number of our pages and one to loop thru them.
A working example would still go a long way. If you could provide one representing your case without spilling secrets that would be good and might other people to understand these tool.