BlockMissingException when trying to read in files from my Hadoop Cluster in KNIME

StefanJoinPlus · July 19, 2023, 7:15am

I am currently trying to read in a csv-file from my Hadoop cluster in KNIME and keep getting this exception org.apache.hadoop.hdfs.BlockMissingException - Could not obtain block: BP-788889731-172.18.0.7-1689243803391:blk_1073741840_1016 file=/data/openbeer/breweries/breweries.csv.

My Hadoop Cluster runs locally with Docker. I successfully connected my Hadoop Cluster with KNIME through the HDFS Connector. But whenever I try to read in a simple CSV File which I stored in my HDFS File System it can’t seem to be able to access the file. Which is really weird cause KNIME seems to be able to see my HDFS File structure with the CSV Reader.

I already went through some post with similar problems and tried several solutions including some where corrupt files where included as the cause of this exception but I already checked through commands like hdfs fsck and my nodes seem to be healthy.

sascha.wolke · July 19, 2023, 10:57am

Hi @StefanJoinPlus,

Welcome to the KNIME community!

Not sure what container you are running in your local docker setup, but what is the health state of your cluster? Maybe check the Web-UI of Hadoop and the logs, if there are problems distributing the blocks in the cluster.

Cheers,
Sascha

StefanJoinPlus · July 20, 2023, 8:11am

Hi @sascha.wolke ,

Thanks for having me .

I am currently running a docker container based on this setup: I built a working Hadoop-Spark-Hive cluster on Docker. Here is how. | Expedition Data

I checked the WebUI and there seems to be nothing wrong with that.

Here are the screenshots for reference.

And I am able to access my CSV File from the shell without any problems.
In theory that means that my Namenode has unrestricted access to my Datanodes.
So we can disclose any host/port related problems.

Is there any way where I can see which specific commands each individual KNIME node uses?
For example which commands the CSV Reader node uses.

By the way is it ok if we continue in german? I might be better at specifying the problem that way.

Thanks for your support

Cheers,

Stefan

StefanJoinPlus · July 20, 2023, 10:43am

I managed to connect the same connection through Hive. So my question would be why it works with Hive and not with HDFS?

sascha.wolke · July 20, 2023, 1:40pm

Hi @StefanJoinPlus,

Your screenshot mentions that the data is under-replicated. Default replicated settings in HDFS are 3 replicas, but you only have one data node. You might be able to fix this on a single file using the cli (see here) or change the default via dfs.replication in the hdfs-defaults.xml (see here).

Not sure if this already solves your problem?

Cheers,
Sascha

StefanJoinPlus · July 21, 2023, 10:30am

Hi @sascha.wolke ,

unfortunately not. I fixed the under-replication as you suggested and the nodes somehow still won’t work.

Do you have any other suggestions I could try? I really want to know how exactly the CSV-Reader works and what commands it uses when trying to fetch the CSV-File.

In theory it should be possible cause I am able to access the files in the Datanodes from my Namenode through simple commands in the shell.

Cheers,

Stefan

StefanJoinPlus · July 27, 2023, 12:39pm

My Problem got resolved it was the same Problem as this ERROR Download file from HDFS in docker

KNIME didn’t have access to all my nodes from Docker so I switched from HDFS to HttpFS.

system · August 3, 2023, 12:39pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.