python scripting: apache arrow backend and google-bigquery upload via python API (error)

tgoedeke · January 29, 2022, 6:24pm

Hi guys,

this is just to let you know: There is a problem I guess, and I could not find it in Python itself.
…hours later I switched from apache arrow backend back to serialized…and suddenly everything works.

AP: 4.5.1 windows
And yes, I am using a manual environment on purpose (no conda), since that will also be the case on knime server (out of my hands).
Python 3.9.9 with all packages up to date.

I tried uploading some data to bigquery.
With apache arrow:
with bigquery API directly:

‘index_level_0’
Traceback (most recent call last):
File “”, line 3, in
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\google\cloud\bigquery\client.py”, line 2635, in load_table_from_dataframe
job_config.schema = _pandas_helpers.dataframe_to_bq_schema(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\google\cloud\bigquery_pandas_helpers.py”, line 383, in dataframe_to_bq_schema
sample_data = _first_valid(dataframe[column])
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\core\frame.py”, line 3506, in getitem
indexer = self.columns.get_loc(key)
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\core\indexes\base.py”, line 3623, in get_loc
raise KeyError(key) from err
KeyError: ‘index_level_0’

with pandas_gbq: (essentially the same error, its just an easier wrapper)
‘index_level_0’
Traceback (most recent call last):
File “”, line 1, in
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\core\frame.py”, line 2055, in to_gbq
gbq.to_gbq(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\io\gbq.py”, line 212, in to_gbq
pandas_gbq.to_gbq(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas_gbq\gbq.py”, line 1142, in to_gbq
connector.load_data(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas_gbq\gbq.py”, line 559, in load_data
chunks = load.load_chunks(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas_gbq\load.py”, line 237, in load_chunks
load_parquet(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas_gbq\load.py”, line 129, in load_parquet
client.load_table_from_dataframe(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\google\cloud\bigquery\client.py”, line 2635, in load_table_from_dataframe
job_config.schema = _pandas_helpers.dataframe_to_bq_schema(
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\google\cloud\bigquery_pandas_helpers.py”, line 383, in dataframe_to_bq_schema
sample_data = _first_valid(dataframe[column])
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\core\frame.py”, line 3506, in getitem
indexer = self.columns.get_loc(key)
File “C:\Users\torsten.goedeke\knime_manual_env\lib\site-packages\pandas\core\indexes\base.py”, line 3623, in get_loc
raise KeyError(key) from err
KeyError: ‘index_level_0’

Once I switch back to serialized buffer in knime preferences…it suddenly works (both versions).

I can check for logs in a few days, if you can’t replicate the error.

best,
Torsten

carstenhaubold · January 31, 2022, 10:01am

Hi Torsten,

Thanks for letting us know about this issue you are seeing! It seems as if bigquery expects a certain index in the Pandas DataFrame and that this index is not present when we are using the Apache Arrow serialization.

Could you provide us with a Python Script example how you sent the DataFrame provided by KNIME to google bigquery so that we can try to reproduce the error?

Best,
Carsten

mlauber71 · January 31, 2022, 10:10am

@carstenhaubold maybe you could take a look at this entry. I also had problems that seem to be related to something going on with the index in a Pandas data frame and trying to bring that back to KNIME. Tried to ‘fix’ the index but to no avail:

tgoedeke · January 31, 2022, 11:41am

Hi Carsten,

I cannot really send you the workflow, because I authorize via service account json (which I obviously cannot share).
The code within the python node could not be much simpler, it is just supposed to append data to a table.

x = input_table_1.copy()

from google.cloud import bigquery
from google.oauth2 import service_account

bigquery_client = bigquery.Client.from_service_account_json(‘serviceAccountFile-json’)

job = bigquery_client.load_table_from_dataframe(x, ‘schemaTablename’)
res = job.result()

carstenhaubold · January 31, 2022, 12:12pm

The script is exactly what I was looking for, thanks Torsten
We’ll try to reproduce the issue you’re seeing.

carstenhaubold · February 2, 2022, 8:50am

It seems as if the only difference between the dataframes that you obtain when using the Arrow serialization vs. Flatbuffers serialization is, that the pandas DataFrame index has a name set, namely to __index_level_0__ as in the error message.

It is not clear to me why bigquery has a problem with a named index, but you can use the Apache Arrow serialization if you just drop the name from the DataFrame before loading it. So in your script, as second line you could put:
x.index.name = None

I’ll create an internal ticket to make sure that we drop the name of the index in the next KNIME release.

Thanks again for letting us know!
Carsten

tgoedeke · February 4, 2022, 8:51pm

Thanks for your help. Dropping the index works fine now with Arrow backend!

tgoedeke · February 7, 2022, 8:09am

Sorry to start again… I have two more things.

Knime-Server: 4.5.1: the executor on the server does not support arrow backend for python nodes, did I read that correctly?

And regarding versions of packages…the server needs (maximum) flatbuffers==1.12 (not anything newer; won’t work with higher versions at all; even empty python script with output=input)