OSX Mountain Lion and Java

My bash profile defined my JAVA_HOME, and after upgrading to ML I saw this logging in:

This will prompt an install of java

Distributed Apache Flume Setup With an HDFS Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to a HDFS sink.

I have 3 kinds of servers all running Ubuntu 10.04 locally:

hadoop-agent-1: This is the agent which is producing all the logs
hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc)
hadoop-master-1: This is the flume master node which is sending out all the commands

To add the CDH3 repository:

Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:

is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.

(To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.)

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

This key enables you to verify that you are downloading genuine packages

Initial Setup
On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).

On hadoop-master-1:

First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like:

Now on to the collector. Same file, different config.

Web Based Setup

I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running.

At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold):

Agent Node: hadoop-agent-1
Source: tailDir(“/var/logs/apache2/”,”.*.log”)
Sink: agentBESink(“hadoop-collector-1”,35853)
Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises.

Now click Submit Query and go back to the config page to setup the collector:

Agent Node: hadoop-collector-1
Source: collectorSource(35853)
Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00″,”server”)
This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00” (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:

Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex.
In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”.

Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured.

At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren’t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.

Intro to Chef

Chef is an incredible tool, but despite its beginnings in 2008/2009 it still lacks an effective quick start, or even an official “hello world” – so it takes too long to really get started as you desperately search for tutorials/examples/use cases. The existing quick starts or tutorials take too long and fail to explain the scope or what Chef is doing. This is really unfortunate because the world could be a better place if more people used Chef (or even Puppet) and we didn’t have to guess how to configure a server for various applications.

Official Chef Explanation

Officially, Chef is a “configuration management tool” where you write “recipes”. It’s written in Ruby and you’ll need at least version 1.8.6. Here’s the official Chef Quick Start guide but I think it fails at succinctly presenting the scope of Chef or even how to use it, hence this document.

Simpler Chef Explanation

Put simpler, Chef is a Ruby DSL (domain specific language) for configuring GNU/Linux (or BSD) machines (Windows is not well supported), it has 2 flavors, “Chef Server” and “Chef Solo”, in this document I’m talking about Chef SOLO because it’s easier to get started with – and works well as a complement to Rails apps.

Simplest Chef Explanation and actual working example

Put simpler yet: Chef is a ruby script that uses “recipes” (a recipe is a Ruby file that uses the Chef DSL) to install software and run scripts on GNU/Linux servers. You can run Chef over and over again safely because most recipes know not to, for example, reinstall something that already exists (sometimes you have to code this functionality of not installing something that already exists, but most of the DSLs do it already).

Think of Chef as having 4 components:
install a binary/executable (chef-solo), installable via Ruby Gems

create one or more ruby files that they call “recipes” in a structure like this

Vim install recipe example
if we want a recipe for installing vim, here’s one quick and simple way to do it:

Just duplicate the directory structure I have listed above, and in the default.rb file, you only need 1 line. that “package” method knows which package management software to use depending on what OS is running and then leverages it.

Create MySQL DB recipe example
There are a lot of methods available for recipes. Take “bash” for example. Pass the “bash” method a block, and inside the block you can use methods like “code” (which executes a string of bash commands) and “user” which specifies which OS user to run the commands as.

bash “really awesome way to create a mysql database from chef using the bash method” do

# dont if the db already exists

JSON file with array of recipes that you’ll point the binary at

Finally, a ruby file with more configuration options

NOW you can run it over and over again and your system will end up with Vim and a ottobib_production database. If you want to get CRAZY: add a recipe that checks out the latest copy of your application source code and then setup a cron job to execute your chef script every minute!

Here’s what your /home/evan/my_cookbooks dir should look like:


Installing Ganglia on CentOS


In this article we will install the Ganglia monitoring system on a set of machines running CentOS. There are two kinds of machines involved:

  • The meta node: one machine that receives all measurements and presents it to a client through a website.
  • The monitoring nodes: machines that run only the monitoring daemon and send the measurements to the meta node.

Meta node

For this example we assume the meta node has the IP address We start by installing the necessary software:

To enable EPEL for CentOS 5

For 32-bits

For 64-bits

To enable EPEL for CentOS 4For 32-bits

For 64-bits


If you want to monitor the meta node as well as the monitoring nodes, edit the gmond configuration file /etc/gmond.conf:

Start the gmond service and make sure it starts at boot:

Edit the gmetad configuration file /etc/gmetad.conf:

Start the gmetad service and make sure it starts at boot:

Enable the http daemon, to be able to see the pretty monitoring pictures:

Monitoring nodes

On all the monitoring nodes start by installing the necessary software:

Edit the gmond configuration file /etc/gmond.conf. You can use an exact replica of the gmond configuration file shown for the meta node.
Start the gmond service and make sure it starts at boot:

If you would like to emit your own measurements (called metrics in Ganglia) and view them on the website, call the gmetric program:

To use the output of a program you wrote as a metric, simply call it like this, making sure to use backticks (`) instead of quotes (‘):

Installing MySQL 5.5.12 on Ubuntu 10.04 (Lucid)

Here we have a step by step tutorial for installing MySQL 5.5.12 on a machine with a fresh install of Ubuntu 10.04 (lucid).

I could not find a tutorial and we needed MySQL 5.5 over MySQL 5.1 for one of our development farms.

Let’s start:

Get MySQL 5.5.12

Change directory into mysql-5.5.12 and make && make install

Chown/Chgrp and copy files

Start the services and create MySQL users

Task: MySQL list databases MySQL is a simple command-line tool. MySQL is command line and it is very easy to use. Invoke it from the prompt of your command interpreter as follows:


You may need to provide mysql username, password and hostname, use:

To list database type the following command

information_schema and mysql are name of databases. To use these database and to list available tables type the following two commands:
mysql> use mysql;Output:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Now list tables:

Run a Hadoop Cluster on EC2 the easy way using Apache Whirr

To set up the cluster, you need two things-
1.    An AWS account
2.    A local machine running Ubuntu (Mine was running lucid)
The following steps should do the trick-
Step 1 – Add the JDK repository to apt and install JDK (replace lucid with your Ubuntu version, check using lsb_release – c in the terminal) –

Step 2 – Create a file named cloudera.list in /etc/apt/sources.list.d/ and paste the following content in it (again, replace lucid with your version)-

Step 3 – Add the Cloudera Public Key to your repository, update apt,  install Hadoop and Whirr-

Step 4 – Create a file hadoop.properties in your $HOME folder and paste the following content in it.

Step 5 – Replace [AWS ID] and [AWS KEY] with your own AWS Access Identifier and Key. You can find them in the Access Credentials section of your Account. Notice the third line, you can use it to define the nodes that will run on your cluster. This cluster will run a node as combined namenode (nn) and jobtracker (jt) and another node as combined datanode (dn) and tasktracker (tt).

Step 6 – Generate a RSA keypair on your machine. Do not enter any passphrase.

Step 7 – Launch the cluster! Navigate to your home directory and run-

This step will take some time as Whirr creates instances and configures Hadoop on them.

Step 8 – Run a Whirr Proxy. The proxy is required for secure communication between master node of the cluster and the client machine (your Ubuntu machine). Run the following command in a new terminal window-

Step 9 – Configure the local Hadoop installation to use Whirr for running jobs.

Step 10 – Add $HADOOP_HOME to ~/.bashrc file by placing the following line at the end-

Step 11 – Test run a MapReduce job-

Step 12 (Optional) – Destroy the cluster-

Note: This tutorial was prepared using material from the CDH3 Installation Guide

Install Maven 3 CENTOS 5.5

This tutorial will show you how to install Maven 3 on CentOS 5.5.

First, cd to /downloads/

cd /downloads/

Next, let’s tar/unzip Maven

Then, we need to vi /etc/profile OR /etc/bashrc and add the following lines at the end:

In order to check if Maven was installed type mvn -version but first you might need to type “exit”. That’s it simple enough.

Installing Git on CENTOS 5

The following packages are dependencies for Git, so make sure they are installed.

Next, if you do not already have Curl installed, follow the next step to get it up and running.

Next, make sure /usr/local/lib is in your ld.so.conf, this is required for git-http-push to correctly link up to the Curl version you are installing.

(Insert the following)


Save the file, then run:

Now download and install Git

Thats all there is to it! Simple enough.