Tuesday, 21 April 2015

Installing Hadoop Single Node - 2.6

Get Started

Now we will check how to install stable version of Apache Hadoop on a Server running Linux Ubuntu 14 x64 but should work on all Debian based systems. To start we need to acquire hadoop package and get java installed, to install java, if not already installed follow my install java post. to check which versions of java are supported with hadoop check Hadoop Java Versions.

Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Apache Hadoop2.6 Installation

Configuring Secure Shell (SSH)   

Communication between master and slave nodes uses SSH, to ensure we have SSH server installed
and running SSH deamon.

Installed server with provided command:

 ~$ sudo apt-get install openssh-server  

You can check status of server use this command

 ~$ /etc/init.d/ssh status  

To start ssh server use:

 ~$ /etc/init.d/ssh start  

Now ssh server is running, we need to set local ssh connection with password. To enable passphraseless ssh use

 ~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 ~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
 OR
 ~$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
 ~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

to check ssh

 ~$ ssh localhost  
 ~$ exit  

Disabling IPv6

We need to make sure IPv6 is disabled, it is best to disable IPv6 as all Hadoop communication between nodes is IPv4-based.

For this, first access the file /etc/sysctl.conf


 ~$ sudo nano /etc/sysctl.conf  

add following lines to end

 net.ipv6.conf.all.disable_ipv6 = 1  
 net.ipv6.conf.default.disable_ipv6 = 1  
 net.ipv6.conf.lo.disable_ipv6 = 1  
Save and exit


Reload sysctl for changes to take effect

 ~$ sudo sysctl -p /etc/sysctl.conf  

If the following command returns 1 (after reboot), it means IPv6 is disabled.

 ~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6  

Install Hadoop

Download Version 2.6.0 (Stable Version)

 ~$ su -  
 ~$ cd /usr/local  
 ~$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz  
 ~$ tar xzf hadoop-2.6.0.tar.gz  
 ~$   
 ~$ mkdir hadoop  
 ~$ mv hadoop-2.6.0/* hadoop/  
 ~$   
 ~$ exit  

Update .bashrc with Hadoop-related environment variables

 ~$ sudo nano ~/.bashrc  

Add following lines at the end:

export HADOOP_HOME=/usr/local/hadoop  
export HADOOP_MAPRED_HOME=$HADOOP_HOME  
export HADOOP_COMMON_HOME=$HADOOP_HOME  
export HADOOP_HDFS_HOME=$HADOOP_HOME  
export YARN_HOME=$HADOOP_HOME  
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  
export JAVA_HOME=/usr/  
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_PATH/bin  

Save & Exit

Reload bashrc

 ~$ source ~/.bashrc

Update JAVA_HOME in hadoop-env.sh

 ~$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh  

Add following line at the end:
 export JAVA_HOME=/usr/  

or if Java is Installed Manually:: double check your installed version of java and update path accordingly, I have assumed 1.7.0_51

 export JAVA_HOME=/usr/local/java/jdk1.7.0_51  

Save and exit

Hadoop Configurations

Now we are moving to update configuration files for Hadoop installation

 ~$ cd /usr/local/hadoop/etc/hadoop  

Modify core-site.xml – Core Configuration

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.

 ~$ sudo nano core-site.xml  

Add the following lines between configuration tags

   <property>   
    <name>fs.default.name</name>   
    <value>hdfs://localhost:9000</value>   
   </property>  
Your file will look like
 <configuration>  
   
   <property>   
    <name>fs.default.name</name>   
    <value>hdfs://localhost:9000</value>   
   </property>  
     
 </configuration>  

Modify mapred-site.xml – MapReduce configuration

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template file, we are required to copy the file from mapred-site.xml.template to mapred-site.xml file.

 ~$ sudo cp mapred-site.xml.template mapred-site.xml  
 ~$ sudo nano mapred-site.xml  

Add the following lines between configuration tags.

   <property>   
    <name>mapreduce.framework.name</name>   
    <value>yarn</value>   
   </property>  
your file should look like:




 <configuration>  
   
   <property>   
    <name>mapreduce.framework.name</name>   
    <value>yarn</value>   
   </property>  
   
 </configuration>  

* Note you may have other configurations defined later, we are considering fresh install

Modify yarn-site.xml – YARN

This file is used to configure yarn into Hadoop.

 ~$ sudo nano yarn-site.xml  

Add following lines between configuration tags:


   <property>   
    <name>yarn.nodemanager.aux-services</name>   
    <value>mapreduce_shuffle</value>   
   </property>  
your file should look like:
 <configuration>  
   
   <property>   
    <name>yarn.nodemanager.aux-services</name>   
    <value>mapreduce_shuffle</value>   
   </property>  
     
 </configuration>  

 Modify hdfs-site.xml – File Replication

This file contains information like replication factor for application we have used 1, name-node path, data-node path to your local file system. this will be the location to store Hadoop information.

 ~$ sudo nano hdfs-site.xml  

Add following lines between configuration tags and check file path:


   <property>   
    <name>dfs.replication</name>   
    <value>1</value>   
   </property>   
   <property>   
    <name>dfs.name.dir</name>   
    <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>   
   </property>   
   <property>   
    <name>dfs.data.dir</name>  
    <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >   
   </property>  
your file should look like:
 <configuration>  
   
   <property>   
    <name>dfs.replication</name>   
    <value>1</value>   
   </property>   
   <property>   
    <name>dfs.name.dir</name>   
    <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>   
   </property>   
   <property>   
    <name>dfs.data.dir</name>  
    <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >   
   </property>  
     
 </configuration>  

Initializing the Single-Node Cluster

Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS.

 ~$ cd ~  
 ~$ hdfs namenode -format  

Starting Hadoop dfs daemons:

 ~$ start-dfs.sh  

Starting Yarn daemons:

 ~$ start-yarn.sh  

Check all daemon processes:

 ~$ jps  

 6069 NodeManager  
 5644 DataNode  
 5827 SecondaryNameNode  
 4692 ResourceManager  
 6165 Jps  
 5491 NameNode  

* Process id will be changed for each execution, main idea is to check if certain processes are running fine.

You should now be able to browse the name-node in your browser (after a short delay for start-up) by browsing to the following URLs:

name-node: http://localhost:50070/

Stopping all daemons:

 ~$ stop-dfs.sh  
 ~$ stop-yarn.sh  
   

Now run examples.  looking for examples to run without changing your style of code, am going run Python MapReduce on New Version of Hadoop wait for post.

5 comments: