Read PDF page by page

daviddelos · April 7, 2020, 10:31pm

Hi,
I’ts possible to read the information of a PDF file specifying the page range?
I have a PDF with information about one person every two pages, that is, I need to iterate two pages only, to analyze the information there.
I am trying to do it with tika parser, but this takes all the PDF information in a single row.

Does anyone know how I can configure it to achieve what I want?
Or does anyone know how to do it other than with tika parser?

izaychik63 · April 8, 2020, 12:45am

Looks like it is not possible using just KNIME. See below

armingrudd · April 8, 2020, 5:12am

Hi @daviddelos,

Not sure what exactly your case is but if there is a pattern, you can use regex to extract and split the sections you need.

A PDF example could be help here.

daviddelos · April 8, 2020, 3:17pm

Hi @armingrudd,
The PDF is in a standard format, so i know how to extract the information there with string manipulation, but I need to get from the entire full PDF only two pages for each iteration that i do, since every 2 pages, there is a new person to parse so i can extract the information how i want to.

daviddelos · April 8, 2020, 3:23pm

Hi @izaychik63,

I have found a Python library called PyPDF2, using this code in Python edit variable node, I can get what I want to achieve:

from PyPDF2 import PdfFileReader

pdf_document = flow_variables[‘Location’]
with open(pdf_document, “rb”) as filehandle:
pdf = PdfFileReader(filehandle)
info = pdf.getDocumentInfo()
pages = pdf.getNumPages()

print (info)
print ("number of pages: %i" % pages)

page1 = pdf.getPage(0)
print(page1)
print(page1.extractText())
flow_variables['content'] = page1.extractText()

Do you have any idea how not to take a specific page, but a range of pages with this code?

izaychik63 · April 8, 2020, 3:25pm

Sorry, but I’m not a Python guy. Redirect it to @mlauber71.

mlauber71 · April 8, 2020, 3:29pm

Could very well be you would have to use a loop or something. Maybe you could provide us with an example of a PDF you want to use and an extract you would expect as a result.

It looks as if the output in this case is a KNIME flow variable. Would that be sufficient or would you need a table?

How many pages are we typically talking about. A KNIME loop might be more costly than a loop inside Python. But that depends on what you want to extract.

daviddelos · April 8, 2020, 3:44pm

Hi @mlauber71,
Yes, I plan to use a loop, with the number of pages of takes per iteration as a variable. I know how to do it, but I don’t know how to change the code, because it only takes one page, I need it to take a range of pages (which I would later convert to variable).
I can’t show you an example of a PDF, because it contains real information of people, but don’t worry about its structure, because the only thing I don’t know is the change in the code mentioned above about the range in the number of pages to take for each iteration, after this, as you say, I see the information as output from the knime variable, not as a table, and then I am proceeding to do string manipulation to extract the information I want.

Do you know what change I must make in the code to read not only a page, but a range of PDF pages?
pdt, the PDF only contains 50 pages, it is not so extensive.

mlauber71 · April 9, 2020, 5:50am

I would assume it is the getpage command but have not tested it.

It could be you would need two python scripts one determining the number of our pages and one to loop thru them.

A working example would still go a long way. If you could provide one representing your case without spilling secrets that would be good and might other people to understand these tool.

I would have to see if I can construct one later.