KNIME workflow taking too much time to read from S3 bucket

Hello KNIME Community,

I am reading a csv file from S3 bucket of size 4.6GB.I am reading file in following way.

first, i am reading using S3 connection node and file reader node, it is taking around 15 minute to read.
second , i am reading from python source code node using code.

import s3fs
import time

#start = timeit.timeit()
start = time.time()
s3_conn_obj = s3fs.S3FileSystem(key=<your_access_key>, secret=<your_secret_key>)

def download_file(source, destination):

     s3_conn_obj.get(source, destination)

def upload_file(source, destination):

     s3_conn_obj.put(source, destination)

dest_path = <local_path>
download_file(<source_path>,dest_path)
t1 = time.time()
print(“Time to download” )
print (t1-start)
import pandas as pd
output_table = pd.read_csv(dest_path)
t2 = time.time()
print(‘Time to read’)
print(t2-t1)
print("Total time ")
print(t2-start)

+++++++++++++++++Print statement results+++++++++++++
INFO Python Source 0:2 Time to download
INFO Python Source 0:2 110.61795043945312
INFO Python Source 0:2 Time to read
INFO Python Source 0:2 90.32301354408264
INFO Python Source 0:2 Total time
INFO Python Source 0:2 200.94096398353577

so i am able to download and parse it into pandas dataframe in ~200 seconds ,
but the overall execution taking more than 15 minutes, i have tried all 3 option of serialization library(csv, apache arrow ,flatbuffer column serialization) and also RAM consumption is pretty high around ~14-15 GB.
Any help will be appreciated :slight_smile:

P.S :- KAP is running on EC2 machine in AWS environment.

Thanks,
Wizard Dk

Using file reader node for large csv is quite slow and ram hungry compared to other solutions like alteryx or python or R ingestion.

+1 for help there

Hello,

how about using Simple File Reader if possible?

Br,
Ivan

Hi @ipazin
Thanks for your help , i was working on 4.1 version.Now updated to 4.2 and use simple file reader as you mention.It work like charm :grinning:
The same file now can be downloaded in 6-8 minutes and RAM consumption is also limit to 2GB.
From now on i will keep my things SIMPLE :wink:
@Luca_Italy you can also give a try, might be helpful in your case also .

Thanks,
Wizard Dk

2 Likes

Hello @Wizard_dk,

glad to hear that!

Br,
Ivan

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.