Converting a Transcript to Text and Retaining References

Off the outset, I’m new. Like really new. So please forgive me if my question is basic.

To provide some context, I have a PDF of a deposition transcript where the left margin contains line numbers, and page numbers are located at the bottom of each page. Moreover, the speakers’ names are capitalized and followed by a colon. My objective is to convert this PDF into an Excel file while preserving the page and line number references. If a sentence or paragraph spans multiple lines (and, as a result, multiple cells [at least as how I am envisioning it]), I’d like it to be clear that they are all part of the same idea and not distinct pieces of data because they’re on different lines - if that makes sense.

Is this even possible? And, if so, how? I’ve been banging my head against the wall for days on this.

Thanks a ton in advance.

Hi @pgs2011,
Welcome to the KNIME Forum! That is actually an interesting problem and will maybe need some complex processing. Our nodes for retrieving text from PDFs, such as the Tika Parser or PDF Parser, only output the contained text, which may make it difficult for you to parse out the required information, as any layout info is lost. But maybe give it a try and see if with some clever processing you can extract what you need.
Alternatively, you may use a more sophisticated method such as Azure Form Recognizer. It not only outputs the plain text, it also gives you the bounding box for each word, so you can identify individual lines or columns of text. Some time ago I published a component for the Azure Form Recognizer on our Community Hub:
Azure Form Recognizer – KNIME Community Hub. Give it a shot and let me know how it goes!
Kind regards,
Alexander

2 Likes

Thanks a ton for the response. I used PDF Parser and Document Data Extractor and all of the information from the PDF was loaded into a single cell. Now the issue is figuring out how to split the cell based on the needs I outlined above. In glancing around, I’ve seen about 50 different approaches. Some suggest using String Manipulation, others suggested Rule Engine or Regex Split.

So yeah, it’s all a bit overwhelming. Any thoughts?

Hi,
Can you share a sample of your data? Otherwise it will be tough to come up with ideas.
Kind regards
Alexander

Here’s a Google drive link to what I’m working with: https://drive.google.com/file/d/1lYjPueXyiwcTKWEtnvZUWin6l0ee0q8N/view?usp=sharing

It’s an excerpt of a PDF deposition transcript. It’ll give you a good idea of the formatting constraints I’m working with. Thank you so much, seriously

Also, I saw you had requested access to view the PDF. I thought the permissions were open, but granted your request just in case. Thanks a ton!

Hi,
Took a while, but I got a proof of concept. I used the free tier of Azure Form Recognizer, which only parses two pages, but for those the result looks good. Please have a look at the attached workflow. I commented it, but here is what it does:

  1. Read the file and sends it to the Azure Form Recognizer API for OCR (I did not want to share my credentials, so I included the result of the API call as a file in the workflow)
  2. Parse the JSON response and extract the individual words and their location
  3. Identify lines
  4. Find persons in lines
  5. Associate all following lines that do not start with a person’s name + colon with the last mentioned person
  6. Group everything together

The result can easily be exported using an Excel Writer. You may want to check if you want to filter certain parts of the text based on the coordinates and how it looks, e.g. the text EXAMINATION is a headline and currently associated with the text block coming before it.

Let me know if you have any questions!
Alexander

Deposition Parser.knwf (599.6 KB)

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.