I have seen several posts with a simliar question, but was not able to get a solution.
We are using an external service called Mindee, where documents are processed and extracted data from the document is sent back.
Was someone in the meantime able to get a working solution?
A other way I tried was with the new Python Nodes. But the error message I received was:
Execute failed: TypeError: expected str, bytes or os.PathLike object, not ArrowSourceTable
Source Example:
from mindee import Client, PredictResponse, product
# Init a new client
mindee_client = Client(api_key="my-api-key-here")
custom_endpoint = mindee_client.create_endpoint("my_endpoint", "my_user")
# Load a file from disk
input_doc = mindee_client.source_from_path("/path/to/the/file.ext")
#Load a file from disk and parse it.
# The endpoint name must be specified since it cannot be determined from the class.
result: PredictResponse = mindee_client.parse(product.CustomV1, input_doc, endpoint=custom_endpoint)
# Print a brief summary of the parsed data
print(result.document)
>
When I use the os Modul in the node to get the files.The following Error ocurred:
KnimeUserError: Output table ā0ā must be of type knime.api.Table or knime.api.BatchOutputTable, but got <class āmindee.parsing.common.document.Documentā>
So I need to get the result.document into a Knime-Table.
Thanks for your reply.
I adapted the Python Node and have now a Pandas Dataframe, but currently Iām not able to pass just one file thru: input_doc = mindee_client.source_from_path("/path/to/the/file.ext")
for every_file in files_to_process.iterrows():
file_to_check = Path(str(files_to_process.iloc[[i]]))
print(file_to_check)
i += 1
then when I execute: input_doc = mindee_client.source_from_path(file_to_check)
the following error occures:
[Errno 2] No such file or directory: ' Path\nRow0 G:\\Meine Ablage\\Projekte\\KNIME...'
Traceback (most recent call last):
File "<string>", line 23, in <module>
File "C:\Users\Sven\.conda\envs\Mindee\lib\site-packages\mindee\client.py", line 460, in source_from_path
input_doc = PathInput(input_path)
File "C:\Users\Sven\.conda\envs\Mindee\lib\site-packages\mindee\input\sources.py", line 250, in __init__
self.file_object = open(filepath, "rb") # pylint: disable=consider-using-with
FileNotFoundError: [Errno 2] No such file or directory: ' Path\nRow0 G:\\Meine Ablage\\Projekte\\KNIME...'
file_to_check, the variable I like to use is as follows:
FileNotFoundError: [Errno 2] No such file or directory: ā Path\nRow0 G:\\Meine Ablage\\Projekt\\KNIME...'
It does look like the āfile_to_checkā does contain a Row0 and a path in one cell. This is being provided to the code and this will not work because āRow0ā will not make a lot of sense.
What is it that these information is coming from? Is it possible to just provide the path.
@mlauber71 the information is comming from a List/Files Folders Node.
My goal isā¦
Send every file to the api endpoint
Retrieve the OCR data of the file
Use the data in the following nodes
It looks like that this Row0 is the index. But is also there, when I do: file_to_check = Path(str(files_to_process.Path.iloc[[i]])) where Path is the Path of the List Files/Folders Node
or file_to_check = Path(str(files_to_process.Location.iloc[[i]])) where Loction is a string of Path to String Node.
In the Python Node I use: files_to_process = input_table_1.copy()
Currently used Code:
from mindee import Client, PredictResponse, product
from pathlib import Path
import os
# Copy input to output
files_to_process = input_table_1.copy()
# Init a new client
mindee_client = Client(api_key="....")
custom_endpoint = mindee_client.create_endpoint("endpoint", "user")
i=0
for every_file in files_to_process.iterrows():
print(every_file)
file_to_check = Path(str(files_to_process.iloc[[i]]))
print(file_to_check)
i += 1
# Load a file from disk
input_doc = mindee_client.source_from_path(file_to_check)
What does input_table_1 look like?
iterrows normaly creates tuples. I would assume every_file[1] contains the data? but itās probably easier to debug with a sample for people in the forum to help.
Without seeing the data i just guess you could try to use