Logging the client IP behind Amazon ELB with Apache

When you place your Apache Web Server behind an Amazon Elastic Load Balancer, Apache receives all requests from the ELB’s IP address.

Therefore, if you wish to do anything with the real client IP address, such as logging or whitelisting, you need to make use of the X-Forwarded-For HTTP Header Amazon ELB includes in each request which contains the IP address of the original host.

Solution for logging the true client IP



The one downside is that depending on how ELB treats X-Forwarded-For, it may allow clients to spoof their source IP.

Hopefully this helps out anyone experiencing this issue.

The Future of Presidential Debates

I recently discussed a topic with a friend about having IBM’s Watson moderate a presidential debate or at least using it to instant fact check their claims. My argument would be that you cannot just “fact check” like that per say. The facts that the candidates are quoting are from various studies, all of which have their own degree of bias and/or error. Or they manipulate the language that they use so that they can appear to be saying something when in fact they’re doing something else. That’s politics.

Watson was optimized for Jeopardy’s style of game play. Also, it does not have the linguistic analysis abilities needed to keep up with politics. For example, metaphors, euphemisms, sarcasm and things of the like would all confuse Watson. Some day though.

More info about IBM’s Watson from Yahoo!:

So what makes Watson’s genius possible? A whole lot of storage, sophisticated hardware, super fast processors and Apache Hadoop, the open source technology pioneered by Yahoo! and at the epicenter of big data and cloud computing.
Hadoop was used to create Watson’s “brain,” or the database of knowledge and facilitation of Watson’s processing of enormously large volumes of data in milliseconds. Watson depends on 200 million pages of content and 500 gigabytes of preprocessed information to answer Jeopardy questions. That huge catalog of documents has to be searchable in seconds. On a single computer, it would be impossible to do, but by using Hadoop and dividing the work on to many computers it can be done.
In 2005, Yahoo! created Hadoop and since then has been the most active contributor to Apache Hadoop, contributing over 70 percent of the code and running the world’s largest Hadoop implementation, with more than 40,000 servers. As a point of reference, our Hadoop implementation processes 1.5 times the amount of data in the printed collections in the Library of Congress per day, approximately 16 terabytes of data.

Distributed Apache Flume Setup With an HDFS Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to a HDFS sink.

I have 3 kinds of servers all running Ubuntu 10.04 locally:

hadoop-agent-1: This is the agent which is producing all the logs
hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc)
hadoop-master-1: This is the flume master node which is sending out all the commands

To add the CDH3 repository:

Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:

is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.

(To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.)

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

This key enables you to verify that you are downloading genuine packages

Initial Setup
On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).

On hadoop-master-1:

First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like:

Now on to the collector. Same file, different config.

Web Based Setup

I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running.

At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold):

Agent Node: hadoop-agent-1
Source: tailDir(“/var/logs/apache2/”,”.*.log”)
Sink: agentBESink(“hadoop-collector-1”,35853)
Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises.

Now click Submit Query and go back to the config page to setup the collector:

Agent Node: hadoop-collector-1
Source: collectorSource(35853)
Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00″,”server”)
This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00” (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:

Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex.
In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”.

Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured.

At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren’t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.

Apache Web Server Virtual Hosts

15.10. Virtual Hosts

Using virtual hosts, host several domains with a single web server. In this way, save the costs and administration workload for separate servers for each domain. One of the first web servers that offered this feature, Apache offers several possibilities for virtual hosts:

  • Name-based virtual hosts
  • IP-based virtual hosts
  • Operation of multiple instances of Apache on one machine

15.10.1. Name-Based Virtual Hosts

With name-based virtual hosts, one instance of Apache hosts several domains. You do not need to set up multiple IPs for a machine. This is the easiest, preferred alternative. Reasons against the use of name-based virtual hosts are covered in the Apache documentation.

Configure it directly by way of the configuration file (/etc/apache2/httpd.conf). To activate name-based virtual hosts, a suitable directive must be specified: NameVirtualHost *. * is sufficient to prompt Apache to accept all incoming requests. Subsequently, the individual hosts must be configured:

In the case of Apache 2, however, the paths of log files as shown in the above example (and in any examples further below) should be changed from /var/log/httpd to /var/log/apache2. A VirtualHost entry also must be configured for the domain originally hosted on the server (www.mycompany.com). So in this example, the original domain and one additional domain (www.myothercompany.com) are hosted on the same server.

Just as in NameVirtualHost, a * is used in the VirtualHost directives. Apache uses the host field in the HTTP header to connect the request with the virtual host. The request is forwarded to the virtual host whose ServerName matches the host name specified in this field.

For the directives ErrorLog and CustomLog, the log files do not need to contain the domain name. Here, use a name of your choice.

ServerAdmin designates the e-mail address of the responsible person that can be contacted if problems arise. In the event of errors, Apache gives this address in the error messages it sends to the client.

15.10.2. IP-Based Virtual Hosts

This alternative requires the setup of multiple IPs for a machine. In this case, one instance of Apache hosts several domains, each of which is assigned a different IP. The following example shows how Apache can be configured to host the original IP ( plus two additional domains on additional IPs ( and This particular example only works on an intranet, as IPs ranging from to are not routed on the Internet. Configuring IP Aliasing

For Apache to host multiple IPs, the underlying machine must accept requests for multiple IPs. This is called multi-IP hosting. For this purpose, IP aliasing must be activated in the kernel. This is the default setting in SUSE LINUX.

Once the kernel has been configured for IP aliasing, the commands ifconfig and route can be used to set up additional IPs on the host. These commands must be executed as root. For the following example, it is assumed that the host already has its own IP (such as, which is assigned to the network device eth0.

Enter the command ifconfig to find out the IP of the host. Further IPs can be added with commands such as the following:

All these IPs will be assigned to the same physical network device (eth0). Virtual Hosts with IPs

Once IP aliasing has been set up on the system or the host has been configured with several network cards, Apache can be configured. Specify a separate VirtualHost block for every virtual server:

VirtualHost directives are only specified for the additional domains. The original domain (www.mycompany.com) is configured through its own settings (under DocumentRoot, etc.) outside the VirtualHost blocks.

15.10.3. Multiple Instances of Apache

With the above methods for providing virtual hosts, administrators of one domain can read the data of other domains. To segregate the individual domains, start several instances of Apache, each with its own settings for User, Group, and other directives in the configuration file.

In the configuration file, use the Listen directive to specify the IP handled by the respective Apache instance. For the above example, the directive for the first Apache instance would be as follows:

For the other two instances:

Apache Hadoop 0.23.0 has been released

The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts. Note, however, that 0.23.0 is not a production release, so please don’t install it on your production cluster.

Credit: http://www.cloudera.com/blog/2011/11/apache-hadoop-0-23-0-has-been-released/

Monitoring and Scaling Production Setup using Scalr

Scalr is a fully redundant, self-curing and self-scaling hosting environment utilizing Amazon’s EC2. It is an initiative delivered by Intridea. It allows you to create server farms through a web-based interface using prebuilt AMI’s for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of. The health of the farm is continuously monitored and maintained. When the Load Average on a type of node goes above a configurable threshold a new node is inserted into the farm to spread the load and the cluster is reconfigured. When a node crashes a new machine of that type is inserted into the farm to replace it. 4 AMI’s are provided for load balancers, mysql databases, application servers, and a generic base image to customize. Scalr allows you to further customize each image, bundle the image and use that for future nodes that are inserted into the farm. You can make changes to one machine and use that for a specific type of node. New machines of this type will be brought online to meet current levels and the old machines are terminated one by one. The project is still very young, but we’re hoping that by open sourcing it the AWS development community can turn this into a robust hosting platform and give users an alternative to the current fee based services available.

Run a Hadoop Cluster on EC2 the easy way using Apache Whirr

To set up the cluster, you need two things-
1.    An AWS account
2.    A local machine running Ubuntu (Mine was running lucid)
The following steps should do the trick-
Step 1 – Add the JDK repository to apt and install JDK (replace lucid with your Ubuntu version, check using lsb_release – c in the terminal) –

Step 2 – Create a file named cloudera.list in /etc/apt/sources.list.d/ and paste the following content in it (again, replace lucid with your version)-

Step 3 – Add the Cloudera Public Key to your repository, update apt,  install Hadoop and Whirr-

Step 4 – Create a file hadoop.properties in your $HOME folder and paste the following content in it.

Step 5 – Replace [AWS ID] and [AWS KEY] with your own AWS Access Identifier and Key. You can find them in the Access Credentials section of your Account. Notice the third line, you can use it to define the nodes that will run on your cluster. This cluster will run a node as combined namenode (nn) and jobtracker (jt) and another node as combined datanode (dn) and tasktracker (tt).

Step 6 – Generate a RSA keypair on your machine. Do not enter any passphrase.

Step 7 – Launch the cluster! Navigate to your home directory and run-

This step will take some time as Whirr creates instances and configures Hadoop on them.

Step 8 – Run a Whirr Proxy. The proxy is required for secure communication between master node of the cluster and the client machine (your Ubuntu machine). Run the following command in a new terminal window-

Step 9 – Configure the local Hadoop installation to use Whirr for running jobs.

Step 10 – Add $HADOOP_HOME to ~/.bashrc file by placing the following line at the end-

Step 11 – Test run a MapReduce job-

Step 12 (Optional) – Destroy the cluster-

Note: This tutorial was prepared using material from the CDH3 Installation Guide

Main Focus

I recently attended Cloudera’s Developer Training for Apache Hadoop in Dallas which will play a big part in our application development.

What is Hadoop?

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.