Parquet file reader - taking a long time

stu_marroquin · December 12, 2019, 5:16pm

We’re trying to read in parquet files with sensor data. It’s a mix of analog readings and true/false values. Each parquet file contains 30,000 rows. It is taking the reader over 3 minutes to read each file. This is a problem because we are receiving new files every minute. Is there a way to speed up the parquet file reading process?

Thanks,
Stu

mlauber71 · December 12, 2019, 6:06pm

Could you give us more context about the size of the files, the configuration of the machine (RAM, SSD). Are they loaded via a remote connection or a cloud drive.

Which version of KNIME are you using?

stu_marroquin · December 12, 2019, 6:51pm

Our smallest files are 3MB. That takes about 1 minute 20 seconds to read.
28 cores Intel® Xeon® Gold 5120 CPU @ 2.20GHz
512GB of RAM.
Knime version 4.02
Files are coming in over the network, 2 10gbe connections from a NAS that is a few rows away in the DC.

mlauber71 · December 12, 2019, 7:58pm

Two more questions. Can you try to load the file file from a local hard drive.

How much RAM do you have assigned to Knime.

Could you try to upgrade to version 4.1 and see if that helps.

stu_marroquin · December 12, 2019, 8:36pm

How can I configure the parquet reader to read off my Knime server? I’m using the Analytics platform on my desktop and don’t see an option to point the reader to the Knime server. Whenever I configure the reader it only let’s me pick files on my local desktop.
The Knime server is installed on a Linux OS on bare metal so it has access to all the resources I mentioned above.
28 cores Intel® Xeon® Gold 5120 CPU @ 2.20GHz
512GB of RAM.

mlauber71 · December 12, 2019, 9:52pm

What is the speed if you load the same file with your desktop KNIME. Are there any aggressive virus scanners present?

ScottF · December 12, 2019, 9:54pm

Hi @stu_marroquin -

Let me escalate this internally, and see if we can get one of the devs to dig into this a little deeper with you. Sorry for the trouble.

stu_marroquin · December 12, 2019, 10:28pm

The speed is about the same.

stu_marroquin · December 12, 2019, 10:28pm

Thanks Scott! I appreciate it!

mlauber71 · December 13, 2019, 5:26pm

My impression was that there were certain challenges with Parquet and KNIME in the past. It would be great if there could be a more stable environment. Since Parquet is also useful as and internal storage format for KNIME.

stu_marroquin · December 13, 2019, 5:39pm

So this is a known issue with parquet files and how Knime handles them?

mlauber71 · December 13, 2019, 6:02pm

I would not call it a known issue but there were some reports of problems with parquet and KNIME before and I had some strange issues that I was not able to systematically reproduce.

Also I had the impression that when I used parquet as the internal storage I had problems on my Mac (the Column store approach with Parquet is still marked “experimental”).

Today I had the short impression that one Parquet file I used had misplaced some columns, I had not investigated further since a colleague was able to use it without issue.

daria.goldmann · December 16, 2019, 8:16am

Hi @stu_marroquin,
How many partitions and what size they are in the parquet files?
Best,
Daria

PS to read from KNIME Server, use the KNIME Server connection node in combination with the Parquet Reader node

mareike.hoeger · December 17, 2019, 8:27am

Hey @stu_marroquin,
we eager to reproduce an work on this issue.
Could you give me a bit more details about the files, what is the schema of the Parquet file? How man columns do you have, and what type? Or may it even be possible to share the file and reader settings with us? Did you use the default type mapping?

best regards
Mareike

natanzi · December 17, 2019, 9:06am

Interesting question, Your flow like A billing mediation platform which could process data online. I dont khow its possible to do it by knime…

stu_marroquin · December 17, 2019, 7:16pm

Sorry for the delayed response.

Column headers: timestamp, sensor_name
Columns contain timestamp and sensor reading, either boolean or float.
Each file contains 30,000 and each file can contain from 600 to 2500 columns (possibly more).
File size varies from 3MB to 12MB. Files come in at a rate of 1 file per minute, which is why it’s critical to get the files read in under 30 seconds.

Any help you can provide would be greatly appreciated.

Thanks,

Stu

mlauber71 · December 17, 2019, 8:41pm

One workaround you could test is to use the R package arrow to import the data first into R and from there to KNIME. I have constructed a small workflow to demonstrate how that could be done.

–
Edit: KNIME and R — installation across operating systems — some remarks | by Markus Lauber | Low Code for Data Science | Medium

mareike.hoeger · December 18, 2019, 1:58pm

Hey,
one first step would be to switch to KNIME 4.1 we had some changes regarding the type mapping, that resulted in noticeable performance gains. I tried to build a table similar to your use case:
A 30MB file, 2500 columns 20000 rows.
in 4.0 reading it from local disc took 111121 milliseconds in contrast to 26585 in 4.1 So the first step would be to try 4.1. However this might still not be optimal and I will have a deeper look into the performance issues.
But for now I recommend to try the current KNIME version.
best regards Mareike

stu_marroquin · December 19, 2019, 3:24pm

Thank you Mareike. I have upgraded to KNIME 4.1. It has improved significantly. I am able to open files in less than a minute now. Obviously the time varies depending on the amount of columns. But the same data in CSV format opens almost instantly.
Thanks,
Stu

stu_marroquin · January 6, 2020, 3:10pm

I upgraded to the latest version of KNIME analytics platform and performance improved slightly. We’d like to run the entire workflow from the KNIME server, however, we can’t get parquet files to read directly from our KNIME server. Even if we have the workflow on the KNIME server it will still only let us point to our PC’s local drives in the parquet reader configuration. It works fine if we use the regular file reader node, but not the parquet reader. As a result, we’re unable to fully test the performance of our workflow utilizing the KNIME server.
Anyone have any thoughts on this?

Thanks,
Stu