Py Spark script and Numpy problem

Hi,
I’m trying to use PySpark Script nodes, but they’re not working because they’re requesting numpy on the Hadoop cluster.
Is this the issue?
Are there any other required modules?

Attached is the error.
Thanks a lot )))
Giorgio

[Ieri 12:52] FEDERICO CAMPANELLA
Traceback (most recent call last):

File “/hadoop/disk3/yarn/nm/usercache/sa_mkcho-svil/appcache/application_1707041036019_0365/container_e175_1707041036019_0365_01_000002/tmp/pythonScript_e6459943_5133_4314_a863_3e17b2ac1ca27354869988235983952.py”, line 3, in

from pyspark.mllib.common import _py2java, _java2py

File “/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p2000.37147774/lib/spark/python/lib/pyspark.zip/pyspark/mllib/init.py”, line 28, in

ImportError: No module named numpy

Hi @salvatorigio,

Yes, the missing numpy seems to be the problem. As far as I remember, that’s the only required dependency of PySpark.

Cheers,
Sascha

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.