The Future of Presidential Debates

I recently discussed a topic with a friend about having IBM’s Watson moderate a presidential debate or at least using it to instant fact check their claims. My argument would be that you cannot just “fact check” like that per say. The facts that the candidates are quoting are from various studies, all of which have their own degree of bias and/or error. Or they manipulate the language that they use so that they can appear to be saying something when in fact they’re doing something else. That’s politics.

Watson was optimized for Jeopardy’s style of game play. Also, it does not have the linguistic analysis abilities needed to keep up with politics. For example, metaphors, euphemisms, sarcasm and things of the like would all confuse Watson. Some day though.

More info about IBM’s Watson from Yahoo!:

So what makes Watson’s genius possible? A whole lot of storage, sophisticated hardware, super fast processors and Apache Hadoop, the open source technology pioneered by Yahoo! and at the epicenter of big data and cloud computing.
Hadoop was used to create Watson’s “brain,” or the database of knowledge and facilitation of Watson’s processing of enormously large volumes of data in milliseconds. Watson depends on 200 million pages of content and 500 gigabytes of preprocessed information to answer Jeopardy questions. That huge catalog of documents has to be searchable in seconds. On a single computer, it would be impossible to do, but by using Hadoop and dividing the work on to many computers it can be done.
In 2005, Yahoo! created Hadoop and since then has been the most active contributor to Apache Hadoop, contributing over 70 percent of the code and running the world’s largest Hadoop implementation, with more than 40,000 servers. As a point of reference, our Hadoop implementation processes 1.5 times the amount of data in the printed collections in the Library of Congress per day, approximately 16 terabytes of data.

How to copy /etc/hosts to all machines

I recently installed Hadoop CDH4 on a new 10 node test cluster and instead of manually entering all the hosts in /etc/hosts I wrote a quick command to copy the current hosts file to all machines.

Hopefully this saves you some time!

Distributed Apache Flume Setup With an HDFS Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to a HDFS sink.

Context
I have 3 kinds of servers all running Ubuntu 10.04 locally:

hadoop-agent-1: This is the agent which is producing all the logs
hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc)
hadoop-master-1: This is the flume master node which is sending out all the commands

To add the CDH3 repository:

Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:

where:
is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above.

(To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.)

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

This key enables you to verify that you are downloading genuine packages

Initial Setup
On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector).

On hadoop-master-1:

First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like:

Now on to the collector. Same file, different config.

Web Based Setup

I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running.

At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold):

Agent Node: hadoop-agent-1
Source: tailDir(“/var/logs/apache2/”,”.*.log”)
Sink: agentBESink(“hadoop-collector-1”,35853)
Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises.

Now click Submit Query and go back to the config page to setup the collector:

Agent Node: hadoop-collector-1
Source: collectorSource(35853)
Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00″,”server”)
This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00” (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start:

Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh)
In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex.
In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”.

Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured.

At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren’t too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.

Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode

From Facebook:

The Hadoop Distributed Filesystem (HDFS) forms the basis of many large-scale storage systems at Facebook and throughout the world. Our Hadoop clusters include the largest single HDFS cluster that we know of, with more than 100 PB physical disk space in a single HDFS filesystem. Optimizing HDFS is crucial to ensuring that our systems stay efficient and reliable for users and applications on Facebook.

Read more:
https://www.facebook.com/notes/facebook-engineering/under-the-hood-hadoop-distributed-filesystem-reliability-with-namenode-and-avata/10150888759153920

Foreign Key MySQL Tutorial

Creating a foreign key in MySQL is a part of referential integrity in the database. A foreign key connects to tables. A foreign key is used in conjunction with a primary key, which is the main record for the dataset. For instance, a primary key could be used on a customer’s table. The customer ID is a unique field that distinctly identifies the customer. A foreign key is placed on the orders table, which connects the customer to his order.

Primary Key
Before creating a foreign key, a table that holds a primary key field needs to be created for referential integrity. In this example, creating the table for customers and orders can be accomplished using the MySQL command line. The syntax for creating a table is below:

In this example, a table is created using the “create table” keyword statement. If a primary key is undetermined, the programmer can leave out the primary key statement until he knows on which field to place the key. However, it’s important for tables to contain a primary key, because these elements speed up performance. In this example, a primary key is created on the “CustId” field. The CustId is used to distinctly identify the customer. Additionally, when assigning a primary key to a table, it must be unique.

Foreign Key
Now that the primary key is created, a foreign key is created on the orders table. Again, if the database developer is unsure of the foreign key to use at the time of table creation, it can be added later. The following code creates an orders table with a foreign key that points to the customers table:

Notice that a primary key was created for this table as well using the OrderId, which is also a unique value. The statement that defines the foreign key is the last one in the table syntax. It defines the foreign key and tells the database where its primary key is located. In this example, the CustId field in the orders table references the CustId in the customers table.

Create a MySQL table with a primary key

A primary key uniquely identifies a row in a table. One or more columns may be identified as the primary key. The values in a single column used as the primary key must be unique (like a person’s social security number). When more than one column is used, the combination of column values must be unique.
When creating the contacts table described in Create a basic MySQL table, the column contact_id can be made a primary key using PRIMARY KEY(contact_id) as with the following SQL command:

Additional columns can be identified as part of the primary key with a comma separated list in the PRIMARY KEY command, like PRIMARY KEY (contact_id, name).

Install Cloudera Manager 3.7.5 Free Edition CentOS 6.2

I recently installed Cloudera Manager Free Edition on my test cluster. I ran into a few problems during install so hopefully this will help you with your troubles.

Cloudera Manager Free Edition will build and configure your single or multi-node CDH cluster and help you manage future changes to it. This software is free to use for up to 50 nodes with no term limit.

Cloudera Documentation (PDF):

https://ccp.cloudera.com/download/attachments/18779712/scm-3.7.5-free-installation-guide.pdf?version=2&modificationDate=1334944267000

To download and run the Cloudera Manager installer:
1. Download cloudera-manager-installer.bin from the Cloudera Downloads page

2. After downloading cloudera-manager-installer.bin, change it to have executable permission.

3. Run cloudera-manager-installer.bin.

4. Read the Cloudera Manager Readme and then click Next.

5. Read the Cloudera Manager License and then click Next. Click Yes to confirm you accept the license.

6. Read the Oracle Binary Code License Agreement and then click Next. Click Yes to confirm you accept the Oracle Binary Code License Agreement.

The Cloudera Manager installer begins installing the Oracle JDK (if necessary), CDH3, and Cloudera Manager repo files and then installs the packages. The installer also installs the embedded PostgreSQL database and the Cloudera Manager Server.

Note
If an error message “Failed to start server” appears while running cloudera-managerinstaller.bin, exit the installation program. If the Cloudera Manager

Server log file

You can permanently disable SELinux by setting the following property in the
/etc/selinux/config file on the Cloudera Manager Server host:

After editing the /etc/selinux/config file, reboot your system.

7. Click Close.

8. Click Finish to quit the installer program.

Note

If the installation is interrupted for some reason, you may need to clean up before you can re-run it. See Uninstalling Cloudera Manager Free Edition.

Step 2: Start the Cloudera Manager Admin Console

The Cloudera Manager Admin console enables you to use Cloudera Manager to configure, manage, and monitor Hadoop on your cluster. In this release, the Cloudera Manager Admin console supports the following browsers:
• Internet Explorer 8 and 9
• Google Chrome
• Safari 5Automated Installation of CDH by Cloudera Manager
• Firefox 3.6 and later

To start the Cloudera Manager Admin console:
1. In a web browser, type the following URL: http(s)://serverhost:port

where:
“serverhost” is the fully-qualified domain name or IP address of the host machine where the Cloudera Manager Server is installed.

“port” is the port configured for the Cloudera Manager Server. The default port is 7180.

For example, if you are on the host where the Cloudera Manager Server is installed, enter the following URL: http://localhost:7180/

If you are on another host, use a URL such as the following:
http://myhost.example.com:7180/

The login screen for Cloudera Manager appears.

2. Log into Cloudera Manager. The default credentials are:

Username: admin (you can add other admin user accounts and remove the default admin account later using Cloudera Manager after you run the wizard in the next section)

Password: admin (you can change the default password later)

Note
You must have “127.0.0.1 localhost” in your /etc/hosts file

Test port 7180 by:

Change from DHCP to Static ip Fedora/Centos

Static IP addresses are manually assigned to a computer by an administrator. The exact procedure varies according to platform. This contrasts with dynamic IP addresses, which are assigned either by the computer interface or host software itself, as in Zeroconf, or assigned by a server using Dynamic Host Configuration Protocol (DHCP).

Even though IP addresses assigned using DHCP may stay the same for long periods of time, they can generally change. In some cases, a network administrator may implement dynamically assigned static IP addresses. In this case, a DHCP server is used, but it is specifically configured to always assign the same IP address to a particular computer. This allows static IP addresses to be configured centrally, without having to specifically configure each computer on the network in a manual procedure.

In the absence of both an administrator (to assign a static IP address) and a DHCP server, the operating system may assign itself an IP address using state-less autoconfiguration methods, such as Zeroconf.

1. login as root

2. get your current IP address with this command:

ifconfig

The OUTPUT will look someting like this:

OR you can also get the ip from the network configuration file with this command:
COMMAND to show configuration:

cat /etc/sysconfig/network-scripts/ifcfg-eth0

THe OUTPUT will look like this (i have it set to DHCP)

3. Ok, on step 3 we are going to change from DHCP to static ip. so send this command to edit the /etc/sysconfig/network-scripts/ifcfg-eth0 file:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
(you can other editor if you want like gedit pico nano..emacs)

Change from this:

TO this:

4. now save your changes and reboot your server.

The One Thing in Life You Can Control: Effort

It was right around then I heard something that I would hear a lot once I bought the Mavs.

In sports, the only thing a player can truly control is effort. The same applies to business. The only thing any entrepreneur, salesperson or anyone in any position can control is their effort.

I had to kick myself in the ass and recommit to getting up early, staying up late and consuming everything I possibly could to get an edge. I had to commit to making the effort to be as productive as I possibly could. It meant making sure that every hour of the day that I could contact a customer was selling time, and when customers were sleeping, I was doing things that prepared me to make more sales and to make my company better.

And finally, I had to make sure I wasn’t lying to myself about how hard I was working. It would have been easy to judge effort by how many hours a day passed while I was at work. That’s the worst way to measure effort. Effort is measured by setting goals and getting results. What did I need to do to close this account? What did I need to do to win this segment of business? What did I need to do to understand this technology or that business better than anyone? What did I need to do to find an edge? Where does that edge come from, and how was I going to get there?

The one requirement for success in our business lives is effort. Either you make the commitment to get results or you don’t.

–Mark Cuban
“How to Win at the Sport of Business.”

5 lessons for money seeking entrepreneurs

Here are 5 lessons for money seeking entrepreneurs from the Shark Tank:

1. Know your numbers: I can’t tell you how many times variations on the following conversation occurs on the show:

Kevin O’Leary (shark): “Let me get this right. You are offering 20% of your company in exchange for an investment of $500,000?
Entrepreneur: “You bet, we have a great product!”
O’Leary: “But that means that you are valuing your company at $2.5 million. You only have $120,000 in sales. Your company is not worth close to that my friend.”
Entrepreneur: “Uh, uh . . .”

Investors want to know how you are valuing your business, how much money you are going to make, how much profit you have made, and why you need their money. Potential is great and all, but numbers talk, BS walks.

2. Understand that money has no feelings: Kevin O’Leary is fond of pointing this out. This is about making a profit for the investors, nothing more, nothing less. How, exactly, will you do that? How you feel about your business is fairly irrelevant.

3. Have a real business that can be scaled: The business cannot be you doing labor, unless that labor can be duplicated en masse. If you make homemade cedar toy chests that cost $300 but take 25 hours to build, it is difficult to see how that is a business that can be ramped up to sell mass quantities. A business that makes widgets for $2 that retail for $4 is a business that is scalable.

4. Have real (not false) confidence and be emotionally intelligent: Yes, it’s all about the numbers, but then again, it’s not all about the numbers. You have to be a cheerleader for your business while being able to read the room.

Says Barbara Corcoran, “Make sure you can sell your product, because if you as the head of the company can’t sell it, who will? Also be sure you’re ready to answer the two key questions too many of the entrepreneurs who come on the show can’t: 1) What will you do with my money? and 2) How will I get my investment back?”

5. Be unique in the marketplace: The products that get some love are usually those that are 1) different, and 2) clearly serve a market need. Again, Barbara Corcoan puts it well: “If your business idea clearly answers a need in the marketplace, it’s probably a good idea. If the need is already being met by well-entrenched competitors, it can still be a good idea if it’s a new, cheaper or more clever way of doing it.”

Follow these rules and hopefully you wont be eaten by the sharks.