Python Integration working with ArrowSourceTables

makaveli · November 17, 2023, 4:34pm

Hello KNIME Community,

I am working on a KNIME workflow where I want to manipulate tabular data (from exel file) in a python script node BUT without using additional libraries like pandas. The help()-function on the input table returns that the tables are of type ArrowSourceTable and I don’t see where you can access elements.

My goal is to compare data manipulations in the python script of pandas operations on the pandas dataframe with raw python string operations on the ArrowSourceTable or some other table. I want to know if working with pandas is necessary for my use case and if I can only rely on raw python.

Does someone has experience with these ArrowSourceTables? And does someone know if there are efficiency advantages using raw python over pandas?

Any help is truly appreciated.
Alex

mlauber71 · November 17, 2023, 4:44pm

@makaveli the Python integration comes with support for pandas dataframes and arrow.

The data exchange between knime and Python will be handled via arrow:

makaveli · November 20, 2023, 3:39pm

Hey mlauber71,

thank you for the quick reply. I know that you can work with the libraries pandas and pyarrow. But I am searching for a way to transform the data ,without any external libraries and only with raw python, mostly string manipulations (as far as this is possible). But I can only find to_pandas() or to_pyarrow() methods to manipulate my input table. I already looked into the documentation but cannot find any other operations to access the table directly. Do you know a way to realize this?

I really appreciate your thoughts. Thank You!
Alex

mlauber71 · November 20, 2023, 4:15pm

Not really sure what kind of data format that would be? Pandas is a de facto standard on Python and integrated into the KNIME Python extension.

Also to get data to and from the Python node you will have to use Arrow or you can try to use CSV or SQLite or Parquet. But that again would mean to have som additional packages.

I think it is in the ‘nature’ of Python to use several packages and I am not sure why you would not want to do that except for some experiments - or maybe you lost a bet

makaveli · November 20, 2023, 4:58pm

My exel file is rather small (20 columns, 10k rows) and I want to do the data transformations without pandas or libraries specialized for these transformations or data manipulations, because I want to observe the overhead of the pandas dataframe transformation. Additional libraries like openpyxl for reading and writing should be used.

My goal is to replace pandas operations like forward-fill, drop columns or fill NaN with raw python code (like a parser) and observe the execution time and overhead of both implementations.

Do you know if there is a faster/efficient way to rewrite pandas operations like drop_column, fill_na and forwardfill by implementing only string manipulations? I am currently trying to benchmark these methods for small excel files.

mlauber71 · November 20, 2023, 5:23pm

Hmm from my experience OpenPyxl will add significantly to every operation since it will have to heavily manipulate the excel structures depending on what you want to do (good new though the package is integrated into the KNIME Python extensions).

If you use OpenPyxl to import or export Excel files you might end up with a (well) Pandas or Arrow dataframe nonetheless …

Another hyped datatype currently is Polars. Another one is feather - which I am not really familiar with … but they all are additional packages.

As I said: if you work with KNIME and Excel and Python you will face some sort of data transfer in any case. Otherwise this might very well be a Python question.

But since we are at it I put your question to ChatGPT

data = [
    {"Name": "Alice", "Age": 30, "City": "New York"},
    {"Name": "Bob", "Age": None, "City": "Paris"},
    {"Name": "Charlie", "Age": 25, "City": None},
    {"Name": "David", "Age": None, "City": "London"}
]

# Dropping a Column

def drop_column(data, column):
    for row in data:
        row.pop(column, None)
    return data

# Example: Drop the 'City' column
data = drop_column(data, 'City')

# Filling NA Values

def fill_na(data, column, fill_value):
    for row in data:
        if row.get(column) is None:
            row[column] = fill_value
    return data

# Example: Fill NA in 'Age' with 0
data = fill_na(data, 'Age', 0)

# Forward Fill

def forward_fill(data, column):
    last_valid = None
    for row in data:
        if row.get(column) is not None:
            last_valid = row[column]
        elif row.get(column) is None and last_valid is not None:
            row[column] = last_valid
    return data

# Example: Forward fill the 'City' column
data = forward_fill(data, 'City')

makaveli · November 20, 2023, 5:50pm

Thank you for your thoughts on this. I also think directly manipulating the excel file with Python’s standard string manipulation methods will not be more efficient than pandas. But I will try to benchmark this.

If someone has experience with other libraries which I can use with self written Python’s standard string manipulation methods (like a parser over the dataframe or file) please share it with me. Thank you guys!

Daniel_Weikert · November 20, 2023, 6:07pm

I would also consider a different data type. Excel is inherrently slow (not KNIME specific!)
br

system · February 18, 2024, 6:07pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.