When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?

Hadoop is a big data eco system. HBase is a key value store (mostly), Hive is a system to execute SQL-like queries on a Hadoop system, Pig is a special query language to access big data. If you google these terms you will come up with a lot of architecture diagrams with a lot of elephants on them.

If you ask for the benefits of data storage and queries I would sum it up like this:

  • HBase for large amounts of data where fast access is key but only limited number of columns and not too much joins (Facebook uses/used HBase)

  • other storages like Parquet on HDFS for datasets with a certain structure (daily partitions) and lots of columns (more like a traditional data warehouse)

  • Hive (or Impala) if you want to work with SQL like queries

  • PIG (Latin) if you want a query language specific to big data environments

But of course that is not all there are much more components to a working big data environment and it very much depends on what you want to do. There are distributors who bundle the various big data components into a unified system like Cloudera or Hortonworks (who will merge in 2019)

If you want to get an idea what it is Cloudera offers a sandbox in a virtual container so you can have your own working big data environment and see what it all means:

And also KNIME has the ability to create a working local big data environment where you can test all nodes related to big data, Hive, Impala and Spark (very useful and highly recommended). Here you can test Hive queries:


Please note the landscape around big data systems is evolving constantly and new systems and vendors are entering the market. So it is not always easy to keep track. You have to do a lot of research and think hard about what your specific purpose is and what software and system would fit it best.

To give you an idea about the enormity of the landscape you can take a look at Matt Turck’s famous Big Data landscape:


Reasons why hadoop is suitable to use are:

There are range of data sources.
Its takes less processing time.

Ok. Thank you all for answer. Very helpful answers.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.