HDFS Connector, Download not working, List files does

I’ve created a minimal example using a HDP sandbox installation. I am trying to download a csv file, which I can sucecssfully list using the “list remote files node”. After the timeout period I am getting a message:
“ERROR Download 2:45 Execute failed: Could not obtain block: BP-32082187- file=/user/maria_dev/data/geolocation.csv”

What am I doing wrong? Using Ambari, I can download the file from the VM to the host. So the connection is working and HDFS, too. File Listing works in Knime as well (including navigation through the directory tree of the HDFS)

EDIT: I can connect and read data using python, so the problem is not in the connection itself. Note, that in the python code I am directly connecting to the NameNode

from hdfs import InsecureClient
import pandas as pd
import io

hostname = ''
port = 8020
hdfs_path = '/user/maria_dev/data/trucks.csv'
local_path = 'C:/tmp'

client = InsecureClient('http://localhost:50070', user='maria_dev')

# Loading a file in memory.
with client.read(hdfs_path) as reader:
  features = reader.read()
data = pd.read_csv(io.BytesIO(features), encoding='utf8', sep=",", lineterminator='\r')

THX for hlp

Hi Ingo,

if you can list files but not download from HDFS this is usually a problem with how vm-internal networks ports are (not) mapped to ports of the host machine. In the virtualbox config you need to forward not only the namenode port (8020) but also the datanode’s data transfer port (50010).

Could it be that you are running the Python script inside the VM? (where all service ports are reachable)

Best, Björn

Hi Björn,

thank you for your reply. I am using both, Knime and the python script from the local machine. As you can see from the script, i am directly addressing the data node port on 50070.

I also tried to use the ports directly (50070 and 50010) in Knime but both resulted in an error. HDP is running in a Virtual Box environment. Is there a special port mapping that Knime requires (and python doesn’t)?


OK solved it myself. Instead of using the plain HDFS connection the WebHDFS connector must be used.


1 Like