Assistance Needed: Extracting Specific Highlighted Text from Arabic PDF Files

Hi guys,
I am reaching out to seek assistance with extracting specific text from OCR-processed Arabic PDF files. I need to organize the extracted information into an Excel sheet, as described below:

Thanks

Hi @Ahmed_Kadhim,

welcome to the community. Just out of curiosity, is that a “one time job” offering or are you seeking more generic help / support by the community?

If the later is what you aim for, you might want to submit your request in this category instead:

In case this is a paid job / task, I’d be happy to assist you.

Best
Mike

Hi Mike,

Thank you for your response and for welcoming me to the community.

I was initially seeking assistance with a specific task, but if you are able to help streamline the process, I would be open to discussing a paid arrangement. I can provide you with all the details of the requirements, and we can agree on a suitable compensation for your time and expertise.

Looking forward to your thoughts on this.

Regards,

Ahmed

Hi Mike,
I would appreciate it if you could let me know how to reach you to discuss my requirements further.

Ahmed
Email: ahmed.kadhim001@gmail.com
WhatsApp: +9647705634227

Hi Ahmed,

apologize for my late reply. I have been quite busy with my full time job, vacation is ahead, and also providing feedback to the Knime developers since some issues slipped through in the 5.3 release.

Though, I already had a look at your PDF and Knime, despite me having the Arabic Language Pack for Text Processing installed, struggles to display it.

The Tika Parser, or the font used in Knime, does not display the characters properly:


The new PDF Parser works too but the Arabic letters are not displayed which in turn make providing support a little bit more challenging.

Trying to change the font, in an attempt to make the text in the Tika parser render properly, made me get lost.

@DanielBog can you or a colleague of yours provide me some guidance how display Arabic properly?

Best
Mike

Hi Ahmed,

It sounds like you’re working with some interesting NLP challenges.
Extracting data from OCR’d PDFs can be tricky, especially with Arabic script.

I would be happy to take this on.

You can reach out to me on my email here

Colin

Hi Colin,

Thank you for your interest! I’m excited about the opportunity to work on these challenges.
Please let me know when you’re available so we can set up a meeting to discuss the details.

Looking forward to your reply.

Best regards,
Ahmed

Hi Mike,

Thank you for your effort in trying to help with this. I appreciate it! Is there any way we could explore to find a solution? Perhaps we could use a different OCR tool or a script to improve the text extraction? Let me know if you have any ideas or if you’d like to brainstorm together.

Thanks again.

Regards,
Ahmed

Hi Ahmed,

I have explored, out of curiosity, a few things but, based in the current situation, believe the issue either originates from the data itself, your PDF, Knimes ability to process or display Arabic characters or a combination of these.

I have already checked a different workflow and it seems that there is a bug since I face similar issues using it:

Can you try to run the above workflow and check if the Arabic text is displayed correctly for you please?

Best
Mike

Hi Mike,

I have good news! I was able to obtain the Arabic symbols using PDF Parser only as the following settings:

Now, I would appreciate it if you could assist me with extracting the words located at the top of the table, along with the table’s information:

And exporting them to an Excel file.

Thank you for your continued support.

Regards,
Ahmed

Hi Ahmed,

that is strange as I used the very same setting as you did but got and still get strange characters …

Can you share your workflow? I would apply the parsing process to convert the unstructured data into a table.

Best
Mike

Hi Mike,

You are right, I made a mistake. I was pulling in another file that was already processed with a different program. I apologize for the confusion—this was my fault.

Best regards,
Ahmed


Hi Mike,

After looking into the issue, it seems that Knime, may not be capable of handling this type of file properly. Do you have any suggestions for alternative software that might be more suitable for this task?

Additionally, do you know of any freelancers who could assist with this? Your recommendations would be greatly appreciated.

Looking forward to your thoughts.

Best regards,
Ahmed

@Ahmed_Kadhim not sure about the nature of the tables in the pdfs of they are a picture or a separate structure. You could try these approaches


Hi @mlauber71

The issue is that I’m not fully familiar with all the functionalities of KNIME as I’m still a new user and don’t have enough knowledge about how to use it effectively. The tasks you’ve asked for seem to require someone with more expertise in KNIME.

I’m not sure if you are an expert with the skills to help with this, but I could send you the PDF file. You might be able to assist with it.

Let me know if you can help or if there’s someone else you would recommend.

Best regards,

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.