The Future of Presidential Debates

I recently discussed a topic with a friend about having IBM’s Watson moderate a presidential debate or at least using it to instant fact check their claims. My argument would be that you cannot just “fact check” like that per say. The facts that the candidates are quoting are from various studies, all of which have their own degree of bias and/or error. Or they manipulate the language that they use so that they can appear to be saying something when in fact they’re doing something else. That’s politics.

Watson was optimized for Jeopardy’s style of game play. Also, it does not have the linguistic analysis abilities needed to keep up with politics. For example, metaphors, euphemisms, sarcasm and things of the like would all confuse Watson. Some day though.

More info about IBM’s Watson from Yahoo!:

So what makes Watson’s genius possible? A whole lot of storage, sophisticated hardware, super fast processors and Apache Hadoop, the open source technology pioneered by Yahoo! and at the epicenter of big data and cloud computing.
Hadoop was used to create Watson’s “brain,” or the database of knowledge and facilitation of Watson’s processing of enormously large volumes of data in milliseconds. Watson depends on 200 million pages of content and 500 gigabytes of preprocessed information to answer Jeopardy questions. That huge catalog of documents has to be searchable in seconds. On a single computer, it would be impossible to do, but by using Hadoop and dividing the work on to many computers it can be done.
In 2005, Yahoo! created Hadoop and since then has been the most active contributor to Apache Hadoop, contributing over 70 percent of the code and running the world’s largest Hadoop implementation, with more than 40,000 servers. As a point of reference, our Hadoop implementation processes 1.5 times the amount of data in the printed collections in the Library of Congress per day, approximately 16 terabytes of data.

How to copy /etc/hosts to all machines

I recently installed Hadoop CDH4 on a new 10 node test cluster and instead of manually entering all the hosts in /etc/hosts I wrote a quick command to copy the current hosts file to all machines.

Hopefully this saves you some time!

Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode

From Facebook:

The Hadoop Distributed Filesystem (HDFS) forms the basis of many large-scale storage systems at Facebook and throughout the world. Our Hadoop clusters include the largest single HDFS cluster that we know of, with more than 100 PB physical disk space in a single HDFS filesystem. Optimizing HDFS is crucial to ensuring that our systems stay efficient and reliable for users and applications on Facebook.

Read more:

Hadoop incompatible build versions

Today I came across a build error with a newer version of CDH 0.20.2+320 installed on a new machine to be used as a new datanode. The datanode failed to join the cluster and the log is shown below. In order to fix this error, the build version of a data node has to be exactly the same as the namenode.

Writing A Hadoop MapReduce Program In PHP

I came across writing Map/Reduce in PHP.

Here is a great example:

Map: mapper.php
Save the following code in the file /home/guest/mapper.php:

Reduce: mapper.php

Save the following code in the file /home/guest/reducer.php:

Don’t forget to set execution rights for these files:

Running the PHP code on Hadoop:
Download example input data
Like Michael, we will use three ebooks from Project Gutenberg for this example:

Download each ebook and store them in a temporary directory of choice, for example /tmp/gutenberg

Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS

Run the MapReduce job
We’re all set and ready to run our PHP MapReduce job on the Hadoop cluster. We use HadoopStreaming for helping us passing data between our Map and Reduce code via STDIN and STDOUT.


The job will read all the files in the HDFS directory gutenberg, process it, and store the results in a single result file in the HDFS directory gutenberg-output.

You can track the status of the job using Hadoop’s web interface. Go to http://localhost:50030/

When the job has finished, Check if the result is successfully stored in HDFS directory gutenberg-output:

You can then inspect the contents of the file with the dfs -cat command:

Installing Ganglia on CentOS


In this article we will install the Ganglia monitoring system on a set of machines running CentOS. There are two kinds of machines involved:

  • The meta node: one machine that receives all measurements and presents it to a client through a website.
  • The monitoring nodes: machines that run only the monitoring daemon and send the measurements to the meta node.

Meta node

For this example we assume the meta node has the IP address We start by installing the necessary software:

To enable EPEL for CentOS 5

For 32-bits

For 64-bits

To enable EPEL for CentOS 4For 32-bits

For 64-bits


If you want to monitor the meta node as well as the monitoring nodes, edit the gmond configuration file /etc/gmond.conf:

Start the gmond service and make sure it starts at boot:

Edit the gmetad configuration file /etc/gmetad.conf:

Start the gmetad service and make sure it starts at boot:

Enable the http daemon, to be able to see the pretty monitoring pictures:

Monitoring nodes

On all the monitoring nodes start by installing the necessary software:

Edit the gmond configuration file /etc/gmond.conf. You can use an exact replica of the gmond configuration file shown for the meta node.
Start the gmond service and make sure it starts at boot:

If you would like to emit your own measurements (called metrics in Ganglia) and view them on the website, call the gmetric program:

To use the output of a program you wrote as a metric, simply call it like this, making sure to use backticks (`) instead of quotes (‘):

Apache Hadoop 0.23.0 has been released

The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts. Note, however, that 0.23.0 is not a production release, so please don’t install it on your production cluster.


Using PHP/cURL to Display Hbase Version Using REST Server

I am using a rest server on port 59999 with Hadoop/Hbase.

To start a rest server:

You must have PHP/cURL installed to use this code:

Starting Hadoop HBase REST Server on CDH3 Ubuntu 10.04

Here we can start the REST server on port 59999

Test that REST has started


Hadoop Default Ports Quick Reference

Is it 50030 or 50300 for that JobTracker UI? I can never remember!

Hadoop’s daemons expose a handful of ports over TCP. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or via plain old HTTP.

This post summarizes the ports that Hadoop uses; it’s intended to be a quick reference guide both for users, who struggle with remembering the correct port number, and systems administrators, who need to configure firewalls accordingly.

Web UIs for the Common User

The default Hadoop ports are as follows:

Daemon Default Port Configuration Parameter
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
Backup/Checkpoint node 50105 dfs.backup.http.address
MR Jobracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address
Replaces secondarynamenode in 0.21.

Hadoop daemons expose some information over HTTP. All Hadoop daemons expose the following:

Exposes, for downloading, log files in the Java system property hadoop.log.dir.
Allows you to dial up or down log4j logging levels. This is similar to hadoop daemonlog on the command line.
Stack traces for all threads. Useful for debugging.
Metrics for the server. Use /metrics?format=json to retrieve the data in a structured form. Available in 0.21.

Individual daemons expose extra daemon-specific endpoints as well. Note that these are not necessarily part of Hadoop’s public API, so they tend to change over time.

The Namenode exposes:

Shows information about the namenode as well as the HDFS. There’s a link from here to browse the filesystem, as well.
Shows lists of nodes that are disconnected from (DEAD) or connected to (LIVE) the namenode.
Runs the “fsck” command. Not recommended on a busy cluster.
Returns an XML-formatted directory listing. This is useful if you wish (for example) to poll HDFS to see if a file exists. The URL can include a path (e.g., /listPaths/user/philip) and can take optional GET arguments: /listPaths?recursive=yes will return all files on the file system; /listPaths/user/philip?filter=s.* will return all files in the home directory that start with s; and /listPaths/user/philip?exclude=.txt will return all files except text files in the home directory. Beware that filter and exclude operate on the directory listed in the URL, and they ignore the recursive flag.
/data and /fileChecksum
These forward your HTTP request to an appropriate datanode, which in turn returns the data or the checksum.

Datanodes expose the following:

/browseBlock.jsp, /browseDirectory.jsp, tail.jsp, /streamFile, /getFileChecksum
These are the endpoints that the namenode redirects to when you are browsing filesystem content. You probably wouldn’t use these directly, but this is what’s going on underneath.
Every datanode verifies its blocks at configurable intervals. This endpoint provides a listing of that check.

The secondarynamenode exposes a simple status page with information including which namenode it’s talking to, when the last checkpoint was, how big it was, and which directories it’s using.

The jobtracker’s UI is commonly used to look at running jobs, and, especially, to find the causes of failed jobs. The UI is best browsed starting at /jobtracker.jsp. There are over a dozen related pages providing details on tasks, history, scheduling queues, jobs, etc.

Tasktrackers have a simple page (/tasktracker.jsp), which shows running tasks. They also expose /taskLog?taskid=<id> to query logs for a specific task. They use /mapOutput to serve the output of map tasks to reducers, but this is an internal API.

Under the Covers for the Developer and the System Administrator

Internally, Hadoop mostly uses Hadoop IPC to communicate amongst servers. (Part of the goal of the Apache Avro project is to replace Hadoop IPC with something that is easier to evolve and more language-agnostic; HADOOP-6170 is the relevant ticket.) Hadoop also uses HTTP (for the secondarynamenode communicating with the namenode and for the tasktrackers serving map outputs to the reducers) and a raw network socket protocol (for datanodes copying around data).

The following table presents the ports and protocols (including the relevant Java class) that Hadoop uses. This table does not include the HTTP ports mentioned above.

Daemon Default Port Configuration Parameter Protocol Used for
Namenode 8020 IPC: ClientProtocol Filesystem metadata operations.
Datanode 50010 dfs.datanode.address Custom Hadoop Xceiver: DataNode and DFSClient DFS data transfer
Datanode 50020 dfs.datanode.ipc.address IPC: InterDatanodeProtocol, ClientDatanodeProtocol
Block metadata operations and recovery
Backupnode 50100 dfs.backup.address Same as namenode HDFS Metadata Operations
Jobtracker Ill-defined. mapred.job.tracker IPC: JobSubmissionProtocol, InterTrackerProtocol Job submission, task tracker heartbeats.
Tasktracker¤ IPC: TaskUmbilicalProtocol Communicating with child jobs
This is the port part of hdfs://host:8020/.
Default is not well-defined. Common values are 8021, 9001, or 8012. See MAPREDUCE-566.
¤ Binds to an unused local port.

That’s quite a few ports! I hope this quick overview has been helpful.

Reference from