Hello KNIME Community,
I am reading a csv file from S3 bucket of size 4.6GB.I am reading file in following way.
first, i am reading using S3 connection node and file reader node, it is taking around 15 minute to read.
second , i am reading from python source code node using code.
import s3fs
import time
#start = timeit.timeit()
start = time.time()
s3_conn_obj = s3fs.S3FileSystem(key=<your_access_key>, secret=<your_secret_key>)
def download_file(source, destination):
s3_conn_obj.get(source, destination)
def upload_file(source, destination):
s3_conn_obj.put(source, destination)
dest_path = <local_path>
download_file(<source_path>,dest_path)
t1 = time.time()
print(“Time to download” )
print (t1-start)
import pandas as pd
output_table = pd.read_csv(dest_path)
t2 = time.time()
print(‘Time to read’)
print(t2-t1)
print("Total time ")
print(t2-start)
+++++++++++++++++Print statement results+++++++++++++
INFO Python Source 0:2 Time to download
INFO Python Source 0:2 110.61795043945312
INFO Python Source 0:2 Time to read
INFO Python Source 0:2 90.32301354408264
INFO Python Source 0:2 Total time
INFO Python Source 0:2 200.94096398353577
so i am able to download and parse it into pandas dataframe in ~200 seconds ,
but the overall execution taking more than 15 minutes, i have tried all 3 option of serialization library(csv, apache arrow ,flatbuffer column serialization) and also RAM consumption is pretty high around ~14-15 GB.
Any help will be appreciated
P.S :- KAP is running on EC2 machine in AWS environment.
Thanks,
Wizard Dk