How to read a file in Knime with Python?

Hi there,
I’m really new to this tool and the related functions.

Is it possible for Knime to run a Python script which fetch a file in my local folder?
I tried to add a CSV reader before the scipt node but it was not working.

Here is part of my script and the workflow:

"import json
import os
import csv
import sys
import codecs
import requests
#import urllib.request

Long=
Lat=
addr=
shipto=

f = open (‘feed2.csv’, encoding = ‘utf-8’)
for row in csv.reader(f,delimiter=’,’, quotechar=’"’):
print (row[0])
addr.append(row[0])"

image

@Dison welcome to the KNIME forum.

If you use the csv reader your file from the data flow will be present in python with the name

input_table_1 (like in this example)

(If you add more ports to the node it will be input_table_2 … and so on)

You can also import files within the python node directly if you provide the path. Here is an example on the KNIME hub with parquet files

More about KNIME and python here:

And watch out for the new 4.5 release coming out this week which should have additional python integration features.

2 Likes

Hi @mlauber71
thank you for the prompt reply.

I changed the file name to match the table name. But it still doen’t work.
Is there anything i can do?

The whole story is that I tried to integrate the action 2 in Knime below:

  1. manualy update an excel file A.
  2. use python to run the file A and genetrate a file BA.
  3. use Knime to read the file B and cross filter with some other files to generate file C.

@Dison in your example I think you will have to provide a path in order for python to know where the address file might be.

The example with the parquet file should provide one. I could see if I can make a scaled down example later.

If you could share the file in question with maybe dummy data one could work on your specific problem.

Concerning KNIME and python and Excel I have a lot of examples in my hub repository with the names

kn_example_python_excel_<…>

You might want to check them out

1 Like

Thank you @mlauber71 (just added your hub to my favorite !!)
It seems that the forum only accpets certain kinds of files.

For the feed2.csv, there is only address (manually input) in the first column.
The target is to fetch latitude and longitude based on the addresses I put in “feed2.csv”.

The steps are like:

  1. manually input address in feed.csv.
  2. run python to fetch latitude and longitude from google.
  3. python generate another csv file with address/latitude/longtitude.

Here is the full script of the python.

# -*- coding: utf8 -*-
# coding: utf8


import json
import os
import csv
import sys
import codecs
import requests
#import urllib.request



Long=[]
Lat=[]
addr=[]
shipto=[]

f = open ('feed2.csv', encoding = 'utf-8')
for row in csv.reader(f,delimiter=',', quotechar='"'):
    print (row[0])
    addr.append(row[0])


    
export = codecs.open ('address_geolocation_dison.csv','a',"utf-8")
#output1 = addr[0] 
#export.write(codecs.BOM_UTF8)#
#export.write(output1)
#export.write('\n')
#export.close()




n=0
cnt=0
wrng=0
t=-1
for m in addr:
    t=t+1
    address = m
    n=m
    print (m)
    endpoint = 'https://maps.googleapis.com/maps/api/geocode/json?'
    api_key = 'AIzaSyD4u_ThlfI40oFPXR6XCki9MB2cK-9iVgc'
#    api_key = 'AIzaSyBOT9pZR8deQocwJrnfDzma17k4eqaqHHY'
    geoloc = 'address={}&key={}'.format(address,api_key)
    site = endpoint + geoloc
    print (site)
    response = requests.get(site)
    #response = urllib.request.urlopen(site)
    #print ("II")
    #print (response)
    geolocation = response.json()#json.loads(response)
    print ("II")
    a = geolocation['results']
    for i in a:
        lat = str(i['geometry']['location']['lat'])
        lng = str(i['geometry']['location']['lng'])
        
        #output = shipto[t] + ',' + addr[t] + ',' + Long[t] + ',' + Lat[t] 
#        export = codecs.open ('address_geolocation.csv','a','utf8')
#        print (m.encode('utf8').decode('utf8'))
        if lat!= "" :
            export.write(m)
            export.write(',')
            export.write(lat)
            export.write(',')
            export.write(lng)
            export.write('\n')
            cnt=cnt+1
            print (lat)
            print (lng)
            print ('\n')
        else:
            wrng=wrng+1

print ("total successful pt")
print (cnt)
print ("total blank pt")
print (wrng)

export.close()

I wouldn’t bother reading files in python unless there is some critical performance related reason to do so.
In current KNIME version you pay a pretty large penalty when moving data between KNIME and python. So if you are reading in millions of rows, then yes reading them in in python vs a file reader before will help a lot with performance. But else you are just not benefiting form the “easy of use” of knime.

If the csv reader doesn’t work in KNIME you could also try the file reader or File reader (complex format). The first is much faster but has less options to configure.

Hello @kienerj,

Latest KNIME version (4.5.0) makes Python just as fast in KNIME as it is anywhere else :wink:

Br,
Ivan

5 Likes

Just to complement what @ipazin just posted: KNIME Python Integration Guide. We’re happy to receive your feedback!

4 Likes

I just tried one of my use-cases and I’m actually a bit surprised as the speedup is rather much lower than expected. for to_pandas only 20% faster. to_arrow requires major code change and I’m not really familiar with pyarrow so what I did is probably not optimized but lead to a 25% speedup.
To be fair yes it does “heavy processing” so the serialization overhead becomes less important. I guess the exact use-case will have a huge impact on the speed improvement.

What would help is a guide how to work with the pyarrow table especially iterate it and generate new columns based on existing columns (eg pandas.apply)

2 Likes

We can only speed-up the data transfer in/out itself, not the actual execution of the python script. That should be equally fast as outside KNIME.

Out of curiosity: did you enable the columnar backend as well?

2 Likes

Hi Christian, yes it’s clear only the data transfer was made faster. I was just assuming for this specific case that it was actually a larger part of the overall runtime than it really was.

Columnar backend is enabled, yes with maximize performance.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.