2. Brief Overview of Hadoop and Hive

Hadoop

An Apache open source software framework which makes extensive use of map-reduce. It can distribute data storage, retrieval, and processing over a cluster of computers.

Hadoop has some useful properties:

  • It is horizontally scalable. If you want to add more storage or more processing power, you buy more machines and add them as nodes to the cluster. Horizontal scaling has a linear cost profile for growth. Vertical scaling (buying more expensive servers and storage) has an exponential cost profile.

  • It has provisions for hardware fault tolerance, especially redundant storage, automatic fault detection, and ease of removing and adding nodes in a cluster. This means you can use cheap, relatively unreliable hardware for your nodes without compromising the availability or reliability of the cluster.

  • It is open source, and available for free from a number of sources. Some of those sources also sell commercial support for a particular distribution of the product.

You can read more about Hadoop at hadoop.apache.org, and download a release to try for yourself.

Hive

A query infrastructure built on top of a Hadoop data store which supports a SQL-like syntax for retrieving and analyzing its contents. Hive was originally developed by Facebook, but is now maintained by Apache. Amazon maintains its own fork of Hive which is available on its Web Services frameworks.

You can read more about Hive at hive.apache.org, and download a release to try for yourself.