Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

From Notes_Wiki
Revision as of 06:28, 31 January 2014 by Saurabh (talk | contribs)

<yambe:breadcrumb>Ubuntu|Ubuntu</yambe:breadcrumb>

Installing Hadoop 2.2.0 cluster on Ubuntu 12.04 x86 64 Desktops

To install hadoop 2.2.0 cluster on Ubuntu 12.04 x86_64 Desktops or VMs use:

  1. Setup proper hostnames on all machines using:
    1. Edit file '/etc/hostname' and enter name 'master' on master and 'slave1', 'slave2', etc. on slaves.
    2. Find out LAN IP address of master and slaves using 'ifconfig' command
    3. Edit file '/etc/hosts' and associate LAN IPs with master, slave1 and slave2 on all three machines. Note that you should be able to ping slave using 'ping slave1', 'ping slave2', etc. from master and similarly you should be ping master using 'ping master' from slave.
    4. Update hostname on running machine using 'sudo hostname master', 'sudo hostname slave1', sudo hostname slave2' etc.
    5. Verify that both commands 'hostname' and 'hostname --fqdn' return master or slave1 or slave2 respectively.
  2. Reboot all nodes. Without reboot the next step of installing java will not succeed with 'No protocol specified', 'Exception in class main' errors due to problem with X11 connection.
  3. Install Java on all nodes (including master and slaves) using Installing Java on Ubuntu 12.04 x86 64 Desktop. Install java in same folder such as /opt in all nodes.
  4. Create user account and group for hadoop using:
    sudo groupadd hadoop
    sudo useradd hadoop -b /home -g hadoop -mkU -s /bin/bash
    cd /home/hadoop
    sudo cp -rp /etc/skel/.[^.]* .
    sudo chown -R hadoop:hadoop .
    sudo chmod -R o-rwx .
    on all nodes. Node that hadoop user name and group name should match on all nodes.
  5. Install openssh-server on all nodes using:
    sudo apt-get -y install openssh-server
  6. Configure password for 'hadoop' user on all three machines using:
    sudo passwd hadoop
  7. Setup password-less ssh from hadoop user of master to hadoop user of master itself and all slaves using:
    sudo su - hadoop
    ssh-keygen
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    chmod 0600 ~/.ssh/authorized_keys
    ssh-copy-id hadoop@slave1
    ssh-copy-id hadoop@slave2
    #To test configuration, should echo hadoop
    ssh hadoop@master "echo $USER"
    ssh hadoop@slave1 "echo $USER"
    ssh hadoop@slave2 "echo $USER"
    exit
    Slave to master password-less is not required.
  8. Download hadoop source from one of the mirrors linked at https://www.apache.org/dyn/closer.cgi/hadoop/common/ Download the latest stable .tar.gz release from stable folder. (Ex hadoop-2.2.0.tar.gz). Copy the same hadoop to slaves using something similar to 'rsync -vaH hadoop-* hadoop@slave1:'
  9. Extract hadoop sources in '/opt/hadoop' and make hadoop:hadoop its owner:
    sudo mkdir /opt/hadoop
    cd /opt/hadoop/
    sudo tar xzf <path-to-hadoop-source>
    sudo mv hadoop-2.2.0 hadoop
    sudo chown -R hadoop:hadoop .
    Note that hadoop should be installed at same location in all nodes.
  10. Configure hadoop cluster setup using these steps on all nodes:
    1. Login as user hadoop:
      sudo su - hadoop
    2. Edit '~/.bashrc' and append
      export JAVA_HOME=/opt/jdk1.7.0_40
      export HADOOP_INSTALL=/opt/hadoop/hadoop
      export HADOOP_PREFIX=/opt/hadoop/hadoop
      export HADOOP_HOME=/etc/hadoop/hadoop
      export PATH=$PATH:$HADOOP_INSTALL/bin
      export PATH=$PATH:$HADOOP_INSTALL/sbin
      export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
      export HADOOP_COMMON_HOME=$HADOOP_INSTALL
      export HADOOP_HDFS_HOME=$HADOOP_INSTALL
      export YARN_HOME=$HADOOP_INSTALL
      export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
      export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
    3. Change folder to '/opt/hadoop/hadoop/etc/hadoop'
    4. Edit 'hadoop-env.sh' and set proper value for JAVA_HOME such as '/opt/jdk1.7.0_40'. Do not leave it as ${JAVA_HOME} as that does not works.
    5. Edit '/opt/hadoop/hadoop/libexec/hadoop-config.sh' and prepend following line at start of script:
      export JAVA_HOME=/opt/jdk1.7.0_40
    6. Exit from hadoop user and relogin using 'sudo su - hadoop'. Check hadoop version using 'hadoop version' command.
    7. Again change folder to '/opt/hadoop/hadoop/etc/hadoop'
    8. Use 'mdkir /opt/hadoop/tmp'
    9. Edit 'core-site.xml' and add following between <configuration> and </configuration>:
      <property>
      <name>fs.default.name</name>
      <value>hdfs://master:9000</value>
      </property>
      <property>
      <name>hadoop.tmp.dir</name>
      <value>/opt/hadoop/tmp</value>
      </property>
    10. Setup folders for HDFS using:
      cd ~
      mkdir -p mydata/hdfs/namenode
      mkdir -p mydata/hdfs/datanode
      cd /opt/hadoop/hadoop/etc/hadoop
    11. Edit 'hdfs-site.xml' and add following between <configuration> and </configuration>
      <property>
      <name>dfs.replication</name>
      <value>2</value>
      </property>
      <property>
      <name>dfs.permissions</name>
      <value>false</value>
      </property>
      <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/hadoop/mydata/hdfs/namenode</value>
      </property>
      <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/hadoop/mydata/hdfs/datanode</value>
      </property>
    12. Copy mapred-site.xml template using 'cp mapred-site.xml.template mapred-site.xml'
    13. Edit 'mapred-site.xml' and add following between <configuration> and </configuration>:
      <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
      </property>
    14. Edit 'yarn-site.xml' and add following between <configuration> and </configuration>:
      <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
      </property>
      <property>
      <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
      <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8025</value>
      </property>
      <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8030</value>
      </property>
      <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8040</value>
      </property>
    15. Format all namenodes master, slave1, slave2, etc. using 'hdfs namenode -format'
  11. Do following only on master machine:
    1. Edit 'slaves' files so that it contains:
      slave1
      slave2
      If master is also expected to serve as datanode (store hdfs files) then add 'master' to the slaves file as well.
    2. Run 'start-dfs.sh' and 'stary-yarn.sh' commands
    3. Run 'jps' and verify on master 'ResourceManager', 'NameNode' and 'SecondaryNameNode' are running.
      • Run 'jps' on slaves and verify that 'NodeManager' and 'DataNode' are running.
    4. Access NameNode at http://master:50070 and ResourceManager at http://master:8088
  12. Run sample map reduce job using:
    1. Setup input file for wordcount using:
      cd ~
      mdkir in
      cat > in/file <<EOF
      This is one line
      This is another one
      EOF
    2. Add input directory to HDFS:
      hdfs dfs -copyFromLocal in /in
    3. Run wordcount example provided:
      hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out
    4. Check the output:
      hdfs dfs -cat /out/*
  13. Stop cluster using:
    stop-yarn.sh
    stop-dfs.sh


Steps learned from http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ and https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Release-Notes/cdh5rn_topic_3_3.html and Installing Hadoop 2.2.0 on single Ubuntu 12.04 x86_64 Desktop


<yambe:breadcrumb>Ubuntu|Ubuntu</yambe:breadcrumb>