Monday, June 25, 2012

Install Apache Hadoop in a Single Node mode

Requirements:

You need following software before you can proceed

  1. Linux Machine (I am using Ubuntu)

  2. Java - JDK 1.6 (prefered is Oracle's version)

  3. *Hadoop latest distribution (1.0.3 in my case)

  4. Install SSH server on the linux Machine (On Ubuntu, sudo apt-get install openssh-server)

  5. Install rsync (On Ubuntu, sudo apt-get install rsync)

  6. Setup passphraseless ssh

    1. If you type ssh localhost and if SSH is asking for a password then you perform these steps


    2. ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

    3. type ssh localhost and it should not ask for password.




*It is better if you download a zipped version or tar.gz version instead of .deb or .rpm.

Brief Overview on Hadoop:

Before we can install Hadoop, you should know few things about hadoop. I would provide a brief overview of each topic and please refer to apache hadoop's site for more details.

Hadoop has 3 important pieces

  1. Distributed File System

  2. Map Reduce Framework

  3. Hadoop Common


Distributed File System

DFS (Distributed File System) has 2 important pieces i.e., Name Node and Data Node.

  1. Name Node: Name Node is like a controller in DFS.It is primary responsibilities include

    • dividing the files into chunks (blocks) and distribute them across Data Nodes

    • Maintain reports on where are blocks are present

    • Make sure that number of replications of a file are always met etc



  2. Data Node: This node's primary responsibility is to store the data blocks and send a periodic report on what blocks it is maintaining to Name Node.


Map Reduce Framework

MR Framework (Map Reduce) is used to create/run jobs to work on the data stored in DFS. Jobs are submitted using a Job client. There are 2 important pieces i.e., Job Tracker and Task Tracker. Job Tracker is responsible for handling the job, devide and distribute job among Task Trackers. I will go into the details in another blog.

Hadoop Common

Hadoop Common is set of utilies used by HDFS, Map Reduce Framework and various other sub projects

Setup:

  1. Unzip the Hadoop*.tar.gz to a location and lets create a environment variable called HADOOP_HOME to point to this location

  2. Open the file HADOOP_HOME/conf/hadoop-env.sh and remove the # infront of “export JAVA_HOME=” and set it to place where you have installed java

  3. Open the file HADOOP_HOME/conf/core-site.xml and replace <configuation> tag with



<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>


  1. Open the file HADOOP_HOME/conf/hdfs-site.xml and replace the <configuration> tag with



<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


  1. Open the file HADOOP_HOME/conf/mapred-site.xml and replace the <configuration> tag with



<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>


Installation:

  1. We need to setup Name node

    1. cd $HADOOP_HOME

    2. bin/hadoop namenode -format

    3. Above command creates a folder /tmp/hadoop-userName folder, which is the place where hadoop holds all the files (userName is the name of the user who performed step 2)



  2. Start the hadoop by running the following commands

    1. cd $HADOOP_HOME

    2. bin/start-all.sh

    3. It starts the Hadoop

    4. try running jps from command line and you should get somthing like following





3884 NameNode
4825 Jps
4437 SecondaryNameNode
4525 JobTracker
4192 DataNode
4774 TaskTracker


Congratulations you have successfully installed Hadoop. If you want to play bit more then open the following urls in a web browser

  1. http://localhost:50030 for job tracker

  2. http://localhost:50060 for task tracker

  3. http://localhost:50070 for dfs


I will go into the details on how to add files and map reduce etc in another blog.

Please go to Apache Hadoop's site for more details.

3 comments:

community grants rotorua said...

I regard something really interesting about your blog so I saved to my bookmarks .

http://www.chinahotelplus.com/content/no-hassle-secrets-gaming-laptops-under-1000-2012 said...

However, Toshiba's excellent reliability can't be ignored,
nor can the retail price. It also has enhanced features for business professionals.
Every year, the number of these recycled cartridges increases.
Send a contact or produce a personal request (on the phone or in person) for any meeting to discuss
your speed.

pukulrikekmpn.pen.io said...

Hi there to every one, the contents present at this site are
genuinely awesome for people experience, well, keep up the good work fellows.