KNIME workflow taking too much time to read from S3 bucket

Wizard_dk · September 14, 2020, 11:39am

Hello KNIME Community,

I am reading a csv file from S3 bucket of size 4.6GB.I am reading file in following way.

first, i am reading using S3 connection node and file reader node, it is taking around 15 minute to read.
second , i am reading from python source code node using code.

import s3fs
import time

#start = timeit.timeit()
start = time.time()
s3_conn_obj = s3fs.S3FileSystem(key=<your_access_key>, secret=<your_secret_key>)

def download_file(source, destination):

     s3_conn_obj.get(source, destination)

def upload_file(source, destination):

     s3_conn_obj.put(source, destination)

dest_path = <local_path>
download_file(<source_path>,dest_path)
t1 = time.time()
print(“Time to download” )
print (t1-start)
import pandas as pd
output_table = pd.read_csv(dest_path)
t2 = time.time()
print(‘Time to read’)
print(t2-t1)
print("Total time ")
print(t2-start)

+++++++++++++++++Print statement results+++++++++++++
INFO Python Source 0:2 Time to download
INFO Python Source 0:2 110.61795043945312
INFO Python Source 0:2 Time to read
INFO Python Source 0:2 90.32301354408264
INFO Python Source 0:2 Total time
INFO Python Source 0:2 200.94096398353577

so i am able to download and parse it into pandas dataframe in ~200 seconds ,
but the overall execution taking more than 15 minutes, i have tried all 3 option of serialization library(csv, apache arrow ,flatbuffer column serialization) and also RAM consumption is pretty high around ~14-15 GB.
Any help will be appreciated

P.S :- KAP is running on EC2 machine in AWS environment.

Thanks,
Wizard Dk

Luca_Italy · September 14, 2020, 12:27pm

Using file reader node for large csv is quite slow and ram hungry compared to other solutions like alteryx or python or R ingestion.

+1 for help there

ipazin · September 14, 2020, 12:43pm

Hello,

how about using Simple File Reader if possible?

Br,
Ivan

Wizard_dk · September 15, 2020, 9:13am

Hi @ipazin
Thanks for your help , i was working on 4.1 version.Now updated to 4.2 and use simple file reader as you mention.It work like charm
The same file now can be downloaded in 6-8 minutes and RAM consumption is also limit to 2GB.
From now on i will keep my things SIMPLE
@Luca_Italy you can also give a try, might be helpful in your case also .

Thanks,
Wizard Dk

ipazin · September 15, 2020, 9:58am

Hello @Wizard_dk,

glad to hear that!

Br,
Ivan

system · September 22, 2020, 9:58am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.