OCR - Tess4j Plugin does not work on Analytic Platform [4.3.3 an 4.4] - MacOS Catalina 10.15.7

patrick-roch · July 29, 2021, 1:32pm

Hello Knime Nation!

I encountered the following error while trying to test and run the Tess4j plugin. I get the error below.

ERROR Tess4J               0:1209     Error initializing Tesseract.
ERROR Tess4J               0:1209     Execute failed: Unable to load library 'tesseract': Native library (darwin/libtesseract.dylib) not found in resource path (/Applications/KNIME 4.4.0.app/Contents/MacOS//../Eclipse/plugins/org.eclipse.equinox.launcher_1.6.100.v20201223-0822.jar)

I searched the forum and noticed that few folks encountered the same error a while back, but there wasn’t any resolution.
Note: I have tesseract installed, and I can run it from the terminal without any issue.

If you’ve experienced this particular issue, how did you go about resolving it?

Any suggestion or tips is much appreciated.

Best,

Patrick

sjporter · July 29, 2021, 2:40pm

Hey @patrick-roch,

This is a known issue, and an issue has been opened on our backlog for it.

As for a quick fix, you can use a Python Script node along with the pytesseract package to perform the same task. Check out the contents of this Python Script node:

import base64
from io import BytesIO

import pytesseract
from PIL import Image

# copy input table to output table
output_table_1 = input_table_1.copy()

# encode the image data
encoded_image = base64.decodebytes(input_table_1['Image'].iloc[0].encode())

# create a byte stream from the image and convert it to the Image type (pillow)
stream = BytesIO(encoded_image)
image = Image.open(stream).convert('RGBA')
stream.close()

# define the command for starting tesseract
pytesseract.pytesseract.tesseract_cmd = r"./data/tesseract"

# extract the text from the image
try:
	output_table_1['Image Text'] = pytesseract.image_to_string(image)
except:
	output_table_1['Image Text'] = None

# save the image to the node outport
output_image_1 = encoded_image

In the example above, I’m decoding a PNG that was sent to my workflow via REST as a base64 encoded string. You may not need to perform the same decoding and type conversion steps; it’s just an example. Once the image is in a usable state, I’m using pytesseract to extract the text.

Please note - the following line is critical:

# define the command for starting tesseract
pytesseract.pytesseract.tesseract_cmd = r"./data/tesseract"

The line of code above points Python to the tesseract executable, which needs to be downloaded to your PC (see this documentation for more details). In the example above I put tesseract inside the data/ directory within my workflow so that it could be easily referenced with a relative path. If you don’t have a valid executable for your operating system downloaded and referenced in this manner, pytesseract will fail to execute.

Hope this helps!

Cheers,

@sjporter

patrick-roch · August 18, 2021, 12:34am

Thank you for the above steps and code sample. I’ve not been able to make it work. In my use case, I’m iterating over several png images from the Tika node output.
My python script runs without errors; however, there isn’t any text output. When I run Tesseract locally, I’m able to output text.
Here is my python script code:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# Copy input to output
output_table_1 = input_table_1.copy()


# Loop over every cell in the 'Img' column
for index,input_cell in input_table_1['Image'].iteritems():
	image = input_cell.array
	
	
# define the command for starting tesseract
pytesseract.pytesseract.tesseract_cmd = r"/usr/local/bin/tesseract"


# extract the text from the image

try:

	output_table_1['Image Text'] = pytesseract.image_to_string(image)

except:

	output_table_1['Image Text'] = None



# save the image to the node outport

output_image_1 = image

Any additional help is greatly appreciated.

Best,

Patrick

patrick-roch · August 19, 2021, 2:17pm

@sjporter I tried a similar approach to your implementation without success. see above.
Let me know if I’m missing anything on my end. I can’t extract any text iterating over server jpg files.

sjporter · August 19, 2021, 4:47pm

Hey @patrick-roch,

Try replacing…

try:
	output_table_1['Image Text'] = pytesseract.image_to_string(image)
except:
	output_table_1['Image Text'] = None

with…

try:
	output_table_1['Image Text'] = pytesseract.image_to_string(image)
except:
	output_table_1['Image Text'] = 'whoops!'

My gut feeling is that you aren’t seeing text output because an exception is being raised and quietly handled by the try/catch. If you see whoops! as the output then we can confirm.

Cheers,

@sjporter

Popo_Lino · August 19, 2021, 5:28pm

yeah, unfortunately a common error

patrick-roch · August 19, 2021, 5:54pm

@sjporter I can see the 'whoops!' all over.

What to do?

sjporter · August 19, 2021, 9:44pm

Great, at least we can pinpoint that the exception is occuring within your try/catch block. Go ahead and print the exception instead of whoops! like so…

try:
    output_table_1['Image Text'] = pytesseract.image_to_string(image)
    output_table_1['Success'] = True
    output_table_1['Error Message'] = None
except Exception as e:
    output_table_1['Image Text'] = None
    output_table_1['Success'] = False
    output_table_1['Error Message'] = str(e)

Please keep in mind I’m writing the code above from memory

patrick-roch · August 19, 2021, 10:56pm

sjporter:

try:
    output_table_1['Image Text'] = pytesseract.image_to_string(image)
    output_table_1['Success'] = True
    output_table_1['Error Message'] = None
except Exception as e:
    output_table_1['Image Text'] = None
    output_table_1['Success'] = False
    output_table_1['Error Message'] = str(e)

@sjporter As per your suggestion above, now I got interesting errors. It looks there is an issue with the images data type?

patrick-roch · August 20, 2021, 4:23am

@sjporter Looking at the error message above, I’ve modified the python script to read as followed:

import pytesseract
from PIL import Image

# Loop over every cell in the 'Img' column
for index, input_cell in input_table_1['Image_normalized'].iteritems():
	image = input_cell.array
	image = Image.fromarray(image[0:,0]).convert('RGBA')

# define the command for starting tesseract
pytesseract.pytesseract.tesseract_cmd = r"/usr/local/bin/tesseract"

# extract the text from the image
try:
    output_table_1['Image Text'] = pytesseract.image_to_string(image)
    output_table_1['Success'] = True 
    output_table_1['Error Message'] = None
except Exception as e:
    output_table_1['Image Text'] = None
    output_table_1['Success'] = False
    output_table_1['Error Message'] = str(e)
    
# save the image to the node outport
output_image_1= image

Below find the output but I can’t extract the image text.

sjporter · August 20, 2021, 4:48pm

Hey @patrick-roch,

From the looks of it, the text in the image is sideways and it looks to be very low resolution(?).

The Tesseract OCR algorithm is only going to be able to do so much in terms of text extraction - there are preprocessing steps that may need to take place, and if the text is unreadable to the human eye it will almost certainly be unreadable to the algorithm.

Please check out this guide on KNIME Image Processing for more details & concepts: KNIME Image Processing | KNIME

Cheers,

@sjporter

patrick-roch · August 23, 2021, 5:07am

@sjporter Thank you for all your help. I’m able to extract the text. the issue was the data type. In my use case, I could run tesseract via the terminal without any pre-processing and get the desired output. On the other hand, in knime, I want to batch the different images from a PDF, and store them in an array and iterate over the images from the input node, and image = input_cell.array had to be changed to image = input_cell.array.astype(np.uint8) to make Tesseract extract the text.

How to output extracted text for individual images in their unique Extracted Text columns?

see the below image 2 extracted text is also showing in the Image 1 extracted text which is not ideal

Here is my updated code:

import pytesseract
from PIL import Image
import numpy as np
import pandas as pd

# Copy input to output
output_table_1 = input_table_1.copy()


# Loop over every cell in the 'Img' column
for index, input_cell in input_table_1['Image'].iteritems():
	# get image from cell
	image = input_cell.array.astype(np.uint8)


# define the command for starting tesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'
    
   # extract the text from images
text = pytesseract.image_to_string(image)


try:
    
    output_table_1['Extracted Text'] = text
    output_table_1['Success'] = True
    output_table_1['Error Message'] = None

except Exception as e:
    output_table_1['Extracted Text'] = None
    output_table_1['Success'] = False
    output_table_1['Error Message'] = str(e)

sjporter · August 23, 2021, 5:38pm

Hey @patrick-roch,

The reason you’re seeing duplicated output is that you are reading an image in a for loop but processing it outside of the for loop, so you are only ever performing OCR on the contents of the last image read in the loop.

However, indenting the code highlighted in orange above won’t solve the problem. Your code needs a bit of refactoring to work in the way you’re expecting.

Here’s an example from GeeksforGeeks which demonstrates how to calculate a new column for a dataframe in a row-wise fashion: Create a column using for loop in Pandas Dataframe - GeeksforGeeks

Please also see the documentation for the pandas.DataFrame.apply function which is commonly used in scenarios such as this one: pandas.DataFrame.apply — pandas 1.3.2 documentation

Cheers,

@sjporter

patrick-roch · August 25, 2021, 2:52am

@sjporter, I figured as much. Thank you for the additional resources and pointers. I’m not a python expert yet. I’m merely a “Bricoleur”. I was hoping to get this script working. So that I can deploy it to our knime server to enable the average joe to extract text; it looks like it will take more time than I expected.

Do you know when the dev team will fix the Tess4j node for Mac OS? Like every other node in Knime, I expected it to just work
I currently have over >1000 pdf docs that we need to extract text from.

Here is the script that worked with one document: But it is unstable; I sometimes get Execute failed: An error occurred during serialization. See log for errors. and sometimes it works. It needs to be refactored, and any help from an expert in this forum is greatly appreciated.

import pytesseract
from PIL import Image
import numpy as np
import pandas as pd
from skimage.util import img_as_ubyte

# Copy input to output
output_table_1 = input_table_1.copy()

output_column = []
# Loop over every cell in the 'Img' column

for index, input_cell in input_table_1['Image'].iteritems():
	# get image from cell
	image = input_cell.array.astype(np.uint8)
	# define the command for starting tesseract
	pytesseract.pytesseract.tesseract_cmd = r'/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'
    # extract the text from images
	result = pytesseract.image_to_string(image)
	output_column.append(result)
	try:
		output_table_1['Extracted Text'] = output_column
		output_table_1['Success'] = True
		output_table_1['Error Message'] = None

	except Exception as e:
		output_table_1['Extracted Text'] = None
		output_table_1['Success'] = False
		output_table_1['Error Message'] = str(e)

djarrett · September 16, 2021, 3:29am

@patrick-roch did you get this to work?

patrick-roch · September 17, 2021, 3:36pm

@djarrett Thanks for the follow-up. I did not get it to work.

djarrett · September 17, 2021, 8:04pm

Ok. Thanks. I am going to keep fiddling to see if i can get it to work. I have a similar use case as you.

system · June 2, 2023, 9:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.