Export data from KNIME for processing in Python (externaly)

TheLeo · September 23, 2022, 7:17am

Hello all,

in our environment we often have the use case, that incoming data is prepossessed in Knime and then later analyzed using python.
Now I know that you can run python within knime but that is not exactly what we need.
In our case the data needs to be exported from knime and then analyzed on a different machine with python.
The question now is how to do that efficiently? Saving the data as CSV is not optimal since it doesn’t store the column type.
I know the best way would probably be a dedicated database but if that’s not an option what would you recommend to use?

Thanks a lot!

mlauber71 · September 23, 2022, 7:44am

@TheLeo you could use Parquet to keep the column types. Or SQLite database:

ORC could also be an option but it might be more complicated. Both ORC and Parquet would allow to read and write data in chunks.

If you absolutely must have a text only file. ARFF is another option:

If you want to explore more about exchange between KNIME and Python you might have a look here:

TheLeo · September 23, 2022, 8:38am

@mlauber71
Thanks!
Text Only file is not necessary.
In a best case scenario the file format would be:

Easy/fast to export from knime
Easy to get into a pandas dataframe in python without setting all datatypes manualy
some sort of compression since csv files get big really fast just by having the same text in each row for example.

I’ll look at your proposals!

mlauber71 · September 23, 2022, 8:59am

@TheLeo then maybe Parquet is your best option. I works like this:

import pandas as pd 
import pyarrow.parquet as pq 

# in Jupyter / Python
df = pq.read_table("your_local_file.parquet").to_pandas() 

# in KNIME node (or again Jupyter / Python)
df.to_parquet("your_file_from_knime.parquet", compression='gzip')

TheLeo · September 23, 2022, 9:07am

@mlauber71
Thanks for your input, I already tested it with “pd.read_parquet()” which works but when I use your way I get an error: “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xb3 in position 11: invalid start byte”
Is there a reason you didn’t recommend read_parquet?
Now when I look at the imported data (using read_parquet) I discovered unfortunately that the column types in fact are not correctly transferred: An exported INT becomes after the import a float. Am I missing something?

thanks again for your help!

Daniel_Weikert · September 23, 2022, 5:01pm

Do you have a sample to share?
br

DiaAzul · September 24, 2022, 12:25pm

@TheLeo

Your suggestion to use a database is the most appropriate solution and will save a lot of grief in the mid to long term. It doesn’t have to be difficult and I frequently deploy temporary containerised databases to test client workflows. Once you have a database you can work to optimise its structure and indexing so that downstream processing becomes more efficient.

As a half-way solution you might want to consider storing your data to an SQLite database. In KNIME you would extend your workflow to add an SQLite connector which defines the file into which the data is written; and a DB Writer that outputs the data to the defined table name. You could include more than one table in the SQLite database if that is required.

Screenshot 2022-09-24 131528

In Python you would use SQL Alchemy to import the data from the SQLite database. The following example loads the data into a Pandas DataFrame.

import sqlalchemy as db
import pandas as pd

engine = db.create_engine("sqlite:///D:\\Data\\tempDatabase.sqlite")
with engine.connect() as db_connection:
    table_name = 'TestTable'
    df = pd.read_sql(f'SELECT * FROM {table_name}', con=db_connection)

Using this approach you do not need a network based database; however, if you do install one at a later date then both KNIME and Python code will need minimal changes.

DiaAzul

Geo · September 25, 2022, 12:43pm

Depending on the data types you use in KNIME, another solution is to save the data using an R node. To retain the most common data types, you could resort to the package haven (R) and write e.g. a Stata file (dta). The Stata format has a decent set of data types (byte, int, long, float, double, str, Lstr, etc.).

This solution is a one liner in the relevant R node (easy) and should be pretty fast. Not sure whether the files will be small though. The SPSS format offers compression but its set of data types is also less refined than Stata’s.

TheLeo · September 26, 2022, 7:06am

@DiaAzul
We’ve tested SQLite but unfortunately it will not recognize dates & date/times correctly (and has no compression at all).
@Daniel_Weikert
I’ll try to create example file to show the issues we’ve

TheLeo · September 26, 2022, 7:43am

So, now I’ve the examples:
Workflow for the test data: Data Export Example – KNIME Hub

Import in python:

If it would work, the columns should show up as Date, String/Object, Int, String/Object
But in both cases, the date is not recognized correctly.

DiaAzul · September 26, 2022, 11:26am

@TheLeo

Thanks for the data, that helps a bit. Three thoughts/ideas for you to consider (this will be a long post as I will include examples).

I’ve uploaded the workflow that contains all the examples to KNIME hub here.

1/ Convert data/time to ISO8601 string.

When writing dates and times to file formats/ databases that don’t support a native date/time type, then ISO8601 format strings should be used. This is an internationally recognised standard and widely supported across software platforms/packages. When reading an ISO8601 string Pandas can be configured to parse the string into a Pandas datetime dtype.

In KNIME you would convert the dateTime column to a string.
Screenshot 2022-09-26 113711

And in Python you would configure Pandas to parse the string. The additional parameter compared with my prior post is parse_dates which is passed a list of columns containing ISO8601 format strings.

import sqlalchemy as db
import pandas as pd

engine = db.create_engine("sqlite:///D:\\data\\sqlite.sqlite")
with engine.connect() as db_connection:
    table_name = 'Export'
    df = pd.read_sql(f'SELECT * FROM {table_name}',
                     con=db_connection,
                     parse_dates=["Date"])

2/ Implement a data schema

Your country column is a categorical column and has a lot of redundant duplicate strings. This is going to increase your output file size considerably. You might want to consider creating dimension tables holding which are referenced by the original (fact) table. In KNIME this is relatively easy to do.

In the following workflow, the a dimension table is created by grouping the Country column in Create Testdata. A counter generator is used to generate a set of sequential keys (integers). The table is then manipulated to rename the counter generated int column to CountryKey. This table is then output from the metanode as the dimension table (bottom leg of the workflow).

The dimension table is then joined to the original table matching on the country column. The original Country column is dropped and the CountryKey is added. This replaces the duplicate Country strings with an integer which references the actual country name in the dimension table. This will reduce the size of your output file significantly.

The benefit of this approach is (a) it is good practice if you move to an online database; (b) it will help pickup errors in categorical fields if there are slightly misspelled items, missing items and other corruption. The downside is that the end user will need to use dimension tables to add the string representation to the fact/data file (though with Pandas this should not be too difficult).

3/ Consider HDF5 file format

HDF5 file format provides a greater degree of flexibility compared with many other file formats. The downside is that KNIME does not have a native node supporting it (I’ve never understood why not). It is possible to use a Python node to export data to an HDF5 file though it will require a bit of work due to the evolving nature of KNIME’s Python implementation.

Pre-requisites:
You will need a conda environment with the KNIME base packages, into which you will also need to add pytables. Note: There is another Python package h5py which provides more control, but for the purposes of keeping it simple I will stick with pytables.

Why Python in KNIME is a pain point
Pandas is a wrapper around underlying data types. This gives it great flexibility in abstracting functionality from implementation. When an operation is defined using pandas then its implementation is passed to the underlying data type to implement. When a third party package consumes a Pandas dataframe (such as Pytables) it assumes data is represented using convention data types such as int for integers and object for strings and Python datetime. However, KNIME does not conform to the conventions and so it stores strings as string instead of object and datatime as a java inspired something. It’s a pain!

The reason for the above rantorial is to explain why the following is more complicated than it needs to be.

Before adding the Python node, the date column needs to be converted to an ISO8601 string.

Screenshot 2022-09-26 121652

The Python script is as follows:

Set the output file path and name of the table in the hdf5 file (hdf5 files can contain multiple tables in a folder like structure).
Convert the ISO8601 date to Pandas datetime format, and convert strings to objects. Copy remaining columns to output file.
Write output file setting compression mode and compression level.

import knime_io as knio
import pandas as pd
from os import path

# Configuration
output_path = path.abspath("D:\\Data\\dataExample.hdf")
output_table_name = "Export"

input_table = knio.input_tables[0].to_pandas()
output_table = pd.DataFrame()

# Convert ISO8601 dates to DateTime format.
dates_to_convert = ["Date"]
for column in dates_to_convert:
    output_table[column] = pd.to_datetime(input_table[column], 
                            format="%Y-%m-%d")

# Convert strings to objects
strings_to_convert = ["ID", "Country"]
for column in strings_to_convert:
    output_table[column] = input_table[column].astype(object)

# Copy remaining columns to output table.
all_other_columns = ["Quantity"]
for column in all_other_columns:
    output_table[column] = input_table[column]

# Output to hdf file
# mode = write a new file (w)
# complib = compression library
# complevel = compression level (1=faster, 9=more compression).
output_table.to_hdf(output_path,
                    output_table_name,
                    mode="w",
                    complib="zlib",
                    complevel=5)

# Pass though input.
knio.output_tables[0] = knio.write_table(input_table)

In Python to read the file you would use the following:

import pandas as pd
from os import path

input_path = path.abspath("D:\\Data\\dataExample.hdf")
input_table_name = "Export"

df = pd.read_hdf(input_path, key=input_table_name )

I hope this helps.

mlauber71 · September 26, 2022, 1:34pm

You could try and implement something with date and time and SQLite but it may not be the best approach

mlauber71 · September 26, 2022, 1:39pm

@TheLeo there are some things to check when converting data with date and time variables between KNIME and Python (and Parquet). It should be possible to come up with a setting that should work for all your data variants. I might take a look at your example.

kienerj · September 26, 2022, 1:56pm

Depending on the interactivity and “repeatability” (how often does the same exact thing happen?) of the analysis in python, instead of some manual import/export scheme I would consider one of the following:

Make your analysis a web service you can call from knime

This will still require some type of export/import but it can be fully automated. How exactly depends on the amount of data, eg. can it be sent in a http request or does there need to be some intermittent file stored and just a link sent.
Upside: runs on the server the web service is installed
downside: more involved that manual but if done daily, will likley become worth it quickly
downside: depends on 2.

Make your analysis a proper python package

eg with build process, versioning and the likes. Then you can import it into a python script in knime and the only code you need is input, to call your package, output.
Upside: you should do this anyway, even if used from Notebooks
Downside: runs locally (which might or might not be an issue)

TheLeo · September 26, 2022, 2:26pm

@DiaAzul
Thanks for your long reply!
We’ve been working with a similar way than your nr.1 but I just wanted to make sure that I’m not missing some obvious better way. Since the datasets are different each time it somewhat time consuming to specify each date/datetime column manualy when importing. It’s not the end of the world I just thought there might be a format that stores all the information and can be imported in python easily.
Regarding nr.2 I think if the file format like parquet supports compression that shouldn’t be a problem.

@mlauber71
thanks for the example workflows/links. I’ll look into it!
Generally speaking I think parquet is my favorite so far. Besides it converting some of the integers to float (and date/time issue) it has no downsides to a CSV but at the same time being a lot smaller due to compression.

@kienerj
I’ll look into way nr.2 since the process is different almost every time.

Thanks all for your feedback so far!

TheLeo · September 27, 2022, 9:14am

So, quick update: I got Parquet working!
Now to make it work you need to change the Type Mapping:

in Python during import you can use this as an example:


# Please check datatypes after import
### Import python libraries
import pandas as pd
import matplotlib.pyplot as plt
import pyarrow.parquet as pq
import pyarrow as pa
### Import Parquet file
#### Function that works as a mapper for int
def lookup(t):
  if pa.types.is_integer(t):
    return pd.Int64Dtype()
#### Imports using all available threads
pqimport = pq.read_table('C:/test.parquet', use_threads=True)
df = pqimport.to_pandas(types_mapper=lookup)
### Show first rows
df.head()

Using this it works at least for me to read in all the ints as int and dates as date without manually specifying columns.
Thanks all for you amazing support!

DiaAzul · September 27, 2022, 12:21pm

@TheLeo

Is there a reason that you had to change the output mapping for the Local Date format? I ask because INT96 for timestamps was deprecated some time ago and replaced with the mappings which are default within the dialogue. Secondly, you are exporting as int96 and importing as int64 - which doesn’t match.

I ran you solution leaving the default mappings in the KNIME node and it appears to work as you intend. Therefore, is it necessary to use an INT64 mapping?

DiaAzul

TheLeo · September 27, 2022, 12:35pm

@DiaAzul
I used int96 becasue it’s the only format that’s working when importing (all other formats are imported as String).
And I don’t think it’s converted since it shows up after import as date[ns] and only the “real” ints are affected as far as I can see.

DiaAzul · September 27, 2022, 1:31pm

@TheLeo

The date is stored in parquet as a date32[day] object. When this is imported into Pandas it is converted to a datetime.date object (note not all object shown by Pandas dtypes are strings). Pandas shows the date correctly because it has a handler to convert the internal date object to the string shown.

If you want Pandas datetime64[ns] columns you can use the following option when converting from Parquet to Pandas dataframe:

 df = pqimport.to_pandas(date_as_object=False)

This keeps the original mappings in KNIME and gives you the object type in Pandas you are looking for.

DiaAzul

PhilTroy · September 30, 2022, 12:12am

Hi

It would be really cool (I guess I am dating myself) to build two new nodes, the first to export data from knime on one machine to the second which would import it on the other machine. The actual format used could be almost any, or there could be a parameter that specifies it.

Phil Troy