I find it kind of strange, that some quite popular nodes like the PDF Parser node (1300+ workflows on the hub use it) are still not able to set the input relative to (the workflow / workflow data area).
Here is a post from June 2020 (!) that mentions it is in the pipeline:
However, still does not seem to be there
Trying the workaround mentioned in that very post does also not work with the PDF Parser node, it clearly throws an error mentioning that the path provided by the flow variale (a string??) is not valid.
With the rise of agentic AI (and the growing need to store e. g. pdfs online to store their information in Vector Stores) it is kind of annoying that this very basic feature that is available in lots of other nodes is not available in either the PDF Parser or Tika Parser nodes.
Any recommendation how to solve this annoying problem?
@kowisoft I believe the issue is because youâre passing the full file path instead of just the directory into the Directory field. Try providing only the folder path, and the PDF Parser should work as expected.
For me specific use case I now use a combination of local (creating the vector store then uploading it) and remote on the KBH (data app that lets you chat to the document).
But this is somehow now sustainable as I would envision a solution where the user could upload any document in a (different) data app.
@kowisoft Sorry, I completely missed the Business Hub part earlier.
Unfortunately, this is still not possible with the PDF Parser node. The solution above only works with local files.
On Business Hub, the PDF Parser can access files only inside the workflow data area, e.g.:
knime://knime.workflow/data/
but not paths like:
knime://knime.workflow/../data/
We have an internal ticket AP-14476 to add a filesystem connection port to this node so it can properly support Business Hub execution.
As a workaround, you can transfer the files to a temporary folder inside the workflowâs data area and access them from there.
I did a tutorial on how to set up a RAG pipeline for a chatbot on the hub.
I did set up a structure where new PDFs / PPTs / Excel etc. could be saved to a specific Folder on SharePoint.
A RAG Ingestion Workflow would then parse the data (I used a python script that uses Microsofts Markitdown package, which is incredibly flexible in turning lots of different file formats into markdown format, which can then easily be chunked, vectorised and then updated into an existing vector store. It would then save / overwrite the old vector store model that lives ârelative toâ the current workflow.
A second workflow (same folder as RAG Ingestion) then is then the Chat app deployed as Data App on Business Hub / Team Plan whatever.
I came up with that structure when thinking about how I would have done something like that in the infrastructure of one of my past employers⌠Maybe your infrastructure is not that different
Links to the folder with both workflows:
Update:
I also tried something more hacky - trying to build a path using information from Extract Context Properties node - issue is that the absolute path to the workflow if executed on hub goes to the job that was created so then again it cannot access anything relative to the original workflowâŚ
Ok, this did the job for me. I could now envision I enhance part 1 of my workflow to a data app that stores a pdf through a simple file upload into the workflowâs data area.
Wow @MartinDDDD - I was aware of the video but couldnât make the connection. This is exactly what I needed (except for the Conda environment, as this is not possible within my environment). Very very cool, thanks a lot.
Glad to hear - if you canât use conda env propagation then youâll âjustâ miss out on the great functionality of markitdown⌠all the other pieces like moving files from sharepoint to workflow data area should still work (and you can then point PDF / Tika Parser there to processâŚ
Quick question: When using AI privately I basically also convert nearly everything to .md files. Have you seen a significant increase in performance compared to other file formats? Does the AI understand it âbetterâ (for the sake of another term)?
late to the party but nice to hear that you found a working solution. Nevertheless, I totally agree that this node as well as some others should receive an update to be able to work with our KNIME urls easier. There is already a ticket for it (internal reference: AP-15946) and I will try to push priority for it.
I canât prove it for sure, but my rational is that LLMs have been trained on a lot of Github content, which contains many .md files. .md Files also provide in general a good structure.
I think the main benefit of why I chose markitdown over Tika parser etc. is just the flexibility you get in terms of the document types you can reliably parse. PDF, PPT, Excel, CSV, HTML, JSON, you name it markitdown doesnât care and converts everything reliably. I mean if you have smart graphics in powerpoint it even has a logic to understand how the shapes overlap and information about text boxes / labels and when you connect it with a vision LLM, it even processes any images and provides a description.
Once it is in markitdown it can be easily chunked, vectorised and then dumped into your vector storeâŚ