How to get an numpy array as output?

Niklas · July 25, 2018, 11:44am

Hi everyone,

What I want to do:
Read data from my hard drive which contains batch data (in numpy .npz format) and output a numpy array which I want to input into a neural network executor node. The network is a very sophisticated named entity recognition deep learning model, composed of a BILSTM and a CNN (the model works perfectly fine in plane python and produces state of the art results).

What I did:
I used the Python Source node with the following code (code chunk I):

import pandas as pd
import numpy as np
path = ‘…/test_batches_casefeat_punct_unk.npz’
test_batches = np.load(path, encoding=‘latin1’)[‘arr_0’]
#output_table = test_batches
#print(output_table[0])
output_table = pd.DataFrame(data = test_batches)
print(output_table.iloc)

My Problem(s) with this:

Like described, I need a numpy array as output, not pandas data frame
Second: If I read the output_table from the above code with a Python Script (1 => 1) node and investigate my four colums, I only see booleans (but there should be integers), e.g. (code chunk 2):

a = input_table[‘0’]
print(a)

yields:

Row0 True
Row1 True
…
Row64 True
Row65 True
Name: 0, Length: 66, dtype: bool

but it should look like this (as a PandasDataframe):

0 [[112], [2790], [3495], [7564], [549498], [217…
1 [[[8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
2 [[4], [3], [3], [3], [2], [3], [4], [3], [3], …
3 [[[4]], [[4]], [[4]], [[6]], [[2]], [[6]], [[4…
Name: 0, dtype: object

(this is the result from the print call inside of the Python Source node above in code chunk I)

So how do I get an numpy array as a node output?
And why does the Python Source node creates some weird boolean output?

Thanks in advance,
Niklas

beginner · July 26, 2018, 10:47am

Well you are Working with KNIME so obviously you can only output a knime table (which is generated via serialization from pandas df). if you then follow with a python script node it just get serialized again and then loaded into a df. So it doesn’t at all make much sense and easier to simply put all the code in the Python source node,

That begs the question what is the goal you want to achieve here? I mean I greatly like KNIME and advocate it’s usage but here I fail to see an advanatge if you are only using python code anyway.

Niklas · July 26, 2018, 11:16am

Hi beginner, thank you for your response.
The idea is to provide a pre-built workflow where people who are not familiar with coding can “easily” choose (e.g. from a folder) and exchange trained models with the Keras Network Reader node.
I don’t understand what you mean with serialized. Can you explain me why the Python Source node creates the weird boolean output?

The Python Source node from the Example 07_Sentiment_Analysis_with_Deep_Learning uses the following code:

from pandas import DataFrame
#Create empty table
from keras.datasets import imdb
max_features = flow_variables[‘maxFeatures’]
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
output_table = DataFrame()
output_table[‘sequence’] = x_train
output_table[‘sentiment’] = y_train

and outputs a regular pandas DataFrame which looks like this:

output1

I tried to recreate this. My Python Source node has the following code:

#import these two packages
import pandas as pd
import numpy as np
#path to the numpy file
path = ‘…/test_batches_casefeat_punct_unk.npz’
test_batches = np.load(path, encoding=‘latin1’)[‘arr_0’]
words = test_batches[:,0]
#create pandas DataFrame
output_table = pd.DataFrame()
output_table[‘words’] = words

But it outputs this strange boolean thing:

output2

In pure python, it works perfectly fine.

beginner · July 26, 2018, 11:30am

No i can’t because I don’t have your dataset. Your code reads your np array and then output the first column and name it words. I assume the first column in your np array is a column only containign 1 and 0.

Niklas · July 26, 2018, 11:38am

I know what the code does and no it does not contain 1 and 0
Like I said, it works perfectly fine if I run it in pure python. I think it is more likely that the Python Source node is bugged. Here is the .npz file (zipped because it is now allowed to upload numpy files):

test_batches_casefeat_punct_unk.zip (453.1 KB)

They contain word- (col 1), character- (col 2) and id- embeddings (col 3) as well as labels (col 4).

beginner · July 27, 2018, 9:05am

I have never used npz files. Are they safe to use over different versions of python and/or numpy? Are you using same python and numpy version in knime than outside of it?

EDIT:

With your code and file I actually get an error rand it doesn’t even output anything.

And what you put in the dataframes words column is a complex array structure. each column in the dataframe must be a simple column (number, string, boolean) but not an array. I have a feeling you either don’t understand you data structure or how it should be made importable into knime.

Niklas · July 30, 2018, 6:42am

Hi beginner,

Thank you for your efforts!
The .npz files are the standard format of saved numpy arrays, so yes, they should not be an issue.
Unfortunately this is the exact data structure I need for the neural network as input (the complex array structure).
Yes, I am using the same versions in- and outside of KNIME.

beginner · July 30, 2018, 8:44am

I can only repeat that from python you need to output a dataframe (eg table) that is serializable which means each column must be a supported primitive datatype (int, float, string, boolean,…). Columns containing object types like array will probably fail or lead to unexpected results.

In your case probably best to combine the python source and python 1-1 script into the source node only as that is way more efficient anyways.

mlauber71 · May 11, 2021, 12:08pm

@ricslator welcome to the KNIME forum

You could save a numpy array as a pickel file:

Or you could save it as a binary file or archive of files:

system · June 2, 2023, 9:27pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.