We’ve encountered a performance bottleneck when using KNIME to read Parquet data from local Azure Blob Storage, and we’d like to ask if anyone has optimization solutions:
Data Scale: A total of 33 Parquet files, with a total size of less than 100M and about 900,000 rows in total
Current Issue: It takes about 1 hour to read the data in the local environment; even when executed on KNIME Hub, it still takes dozens of minutes
Temporary Workaround: Currently we can only reduce the waiting time by compressing the data volume, but this is not a long-term solution
I wonder if anyone has performance optimization suggestions for Azure Blob + Parquet reading? For example, node configuration adjustments, parallel reading strategies, or other efficient reading methods?
Hi,
ActionAndi
Regarding the Java heap space configuration, we have followed your instructions and updated the -Xmx setting in the knime.ini file, followed by a full restart of KNIME. In our current tests, increasing the heap size did not lead to a noticeable improvement in the Parquet reading performance, and the issue still persists.
As a next step, we plan to follow your suggestion to copy the Parquet files from their current storage location to a local temporary directory and then read them from there. We are currently in the process of requesting the necessary permissions to download the files.
Based on our initial observations so far, storage access latency along the current access path is likely a significant factor contributing to the overall performance bottleneck. Once the comparison tests between local and remote reads are completed, we will share the results and further findings.
At the moment, the Parquet files are read directly from Azure Blob Storage via a network path, and we also suspect that the connection and storage access path might be the main source of the performance bottleneck. Following your suggestion, we will next try copying the files to a local temporary directory and then reading them from there.
We are currently requesting the necessary permissions to download the files. Once the comparison between local and remote reads is completed, we will share the results to confirm whether the issue is indeed related to the connection or network latency.
I’d also try connecting via DuckDB,s jdbc + operate on the data using DB nodes (or just DB reader + knime native nodes if you prefer)
You can put all the config statements in the SQL executor node. Then you’ll likely know if the issue is with how knime implements the connectors or if the issue is indeed related to network etc.