Text mining from PDF documents and results places

Hello everyone,

I would like to ask for some help with a text search.

I have some long PDF files in which I would like to search for words.

Is there a way to export the places/rows results of the search?

I would like to see which page/row the results are on.

For example, the text I searched for text (like volvo).

And the results will be listed on a different page where I can see which rows of the documents contains that word „volvo” (e.g.: 2. page, 5. page, 13. page etc.)

Thank you for instance!

Tamas

There is no easy way to do that (which I know), but why would you want to do this? Are you trying to see the context in which the word Volvo was used for instance? You could do that with a context window. See the challenge 38 of Just KNIME It to do so.

If you want general information on how to extract from a PDF here is the webinar I ran in August:

4 Likes

@janszky_nav I adapted an older example (How to read multiple lines from PDF File - #6 by mlauber71) using R code to extract the text from PDF files and then search for the page and line where a special word (in this case “Procter”) would appear.

5 Likes

I want to do some string manipulation not word embeddings or other text-processing. I just made it up for the challenge like a small lab for string, characters, punctuation and word manipulation etc., :slightly_smiling_face:
Thanks for the webinar info.

1 Like

Dear mlauber71,
Thank you your solution.

1 Like

Dear Victor,

Thank you for your answer, it was very usful, and I saw the Webinar too.
Thank you.

May I have one more question?

If I have some pdf-s like the sample and I would like to extract text message/call data such as numbers and text.
The call text is located before the text “A jogosult ügy…”.
Whic one cell splitter the best option?

Thank you.
20221027_split_part_of_text.knwf (39.6 KB)
This is a sample pdf, unfortunatley i can’t upload the pdf file. Sorry.
KNIME_PDF_Sample_m.txt (316.4 KB)

I would target the phrase “A jogosult ugy” and then extract 20-30 characters (or more) after it. Then get the data you want from that window.

Unfortunately, the data in the txt file seems corrupted?

Screen Shot 2022-10-27 at 9.23.58 AM

To do the window targeting, I would use Regex.

Here is my cheat sheet for Regex:

Here is what you can use to target that text and extract words after it:

Finally, challenge 38 from Just KNIME It! may help you accomplish the same task.

3 Likes

No, it isn’t corrupted. It is a pdf, i simply change the file extension. (pdf->txt)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.