Linux & Open Source Slideshow: Why Hadoop Has Google Fans (and Rivals) Excited

By David F. Carr  |  Posted 04-02-2009

Why Hadoop Has Google Fans (and Rivals) Excited

Hadoop has attracted the attention of major companies faced with Internet-scale data analysis challenges, including Amazon A9.com, AOL, Facebook, Fox Interactive Media, IBM, New York Times, Veoh, and Yahoo!

Why Hadoop Has Google Fans (and Rivals) Excited

Why Hadoop Has Google Fans (and Rivals) Excited - Page 2

Hadoop is a solution for analyzing large unstructured or semi-structured data sets (indexing the web, identifying spam email), typically in batch mode (indexing the web to feed a search engine, not executing the live queries).

Why Hadoop Has Google Fans (and Rivals) Excited - Page 2

Why Hadoop Has Google Fans (and Rivals) Excited - Page 3

You can use Hadoop for ad hoc analysis of large data sets, or to extract, transform, and load data into a traditional data warehouse. It can also be applied to machine learning, computer modeling, and scientific computing.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 3

Why Hadoop Has Google Fans (and Rivals) Excited - Page 4

Google' powerful analytic software for tasks such as indexing the web runs on a distributed system of cheap computers, each of which would complete some small part of the task.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 4

Why Hadoop Has Google Fans (and Rivals) Excited - Page 5

Academic papers on "The Google File System" (2003) and "MapReduce: Simplified Data Processing on Large Clusters" (2004) revealed enough details to allow for the creation of an open source implementation.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 5

Why Hadoop Has Google Fans (and Rivals) Excited - Page 6

Doug Cutting, a veteran of search technology research and development for Excite, Apple, and XEROX PARC, created Hadoop as a spin-off of the Apache Nutch and Lucene open source search technology projects.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 6

Why Hadoop Has Google Fans (and Rivals) Excited - Page 7

In 2006, Yahoo hired Cutting and became a major sponsor of the Hadoop Project. Yahoo has since incorporated Hadoop into the process of producing the Yahoo search index.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 7

Why Hadoop Has Google Fans (and Rivals) Excited - Page 8

Hadoop includes an implementation of MapReduce, with a Job Tracker/Task Tracker system for submitting MapReduce jobs, executing them on nodes of the computing cluster, and restarting any that fail.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 8

Why Hadoop Has Google Fans (and Rivals) Excited - Page 9

Data is distributed over many computers in a distributed file system cluster. Map programs on each computer analyze their own subset of the data and return intermediate results as key-value pairs. The Reduce step sorts and aggregates those intermediate results, then returns a final result.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 9

Why Hadoop Has Google Fans (and Rivals) Excited - Page 10

 

Why Hadoop Has Google Fans (and Rivals) Excited - Page 10

Why Hadoop Has Google Fans (and Rivals) Excited - Page 11

Hadoop provides support for distributed file systems, including Hadoop's own Hadoop File System (HFS), which is essentially a clone of the Google File System, supporting petabytes of storage across many cheap computers.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 11

Why Hadoop Has Google Fans (and Rivals) Excited - Page 12

Hadoop also runs on other distributed file systems, including Amazon's S3 cloud storage service. Hadoop MapReduce jobs can also run on Amazon's EC2 elastic compute cloud service.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 12

Why Hadoop Has Google Fans (and Rivals) Excited - Page 13

HBase is a very large database management system that runs on top of the Hadoop File System. It's a clone of another publicly-disclosed Google system, BigTable, for managing very large database tables.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 13

Why Hadoop Has Google Fans (and Rivals) Excited - Page 14

Pig is a high-level language for distributed programming. It provides as an alternative to working directly with MapReduce but runs atop the same runtime infrastructure.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 14

Why Hadoop Has Google Fans (and Rivals) Excited - Page 15

HIVE is a data warehouse infrastructure for executing SQL-like ad hoc queries on Hadoop. It started as an internal project at Facebook, and the developers there contributed it to the open source community.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 15

Why Hadoop Has Google Fans (and Rivals) Excited - Page 16

You can download the code from http://hadoop.apache.org/. Test on a Java-enabled computer, load onto your own computer cluster, or deploy to Amazon's EC2 / S3 cloud services.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 16

Why Hadoop Has Google Fans (and Rivals) Excited - Page 17

On March 15, 2009, a start-up called Cloudera announced it would produce a commercially supported distribution of Hadoop, with installation and configuration tools to simplify the setup. Also offers support for Pig and HIVE.

Why Hadoop Has Google Fans (and Rivals) Excited - Page 17