Friday 24 April 2015

Installing Single Node Hadoop 2.6 using Bash Script

Get Started

We are going to write simple bash script and execute it to install Hadoop 2.6 with all its dependencies, Now we will check how to install stable version of Apache Hadoop from bash script on a Server running Linux Ubuntu 14 x64 but should work on all Debian based systems. We will write content to text file using any of your favorite test editor. give it permission to execute and enjoy. hope it  works without error, else let me know issue to update it, it works fine for me on 2 different machines running Linux mint 17 and Ubuntu 14.04 respectively.

What is Bash?

Descended from the Bourne Shell, Bash is a GNU product, the "Bourne Again SHell." It's the standard command line interface on most Linux machines. It excels at interactivity, supporting command line editing, completion, and recall. It also supports configurable prompts - most people realize this, but don't know how much can be done.

Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Write Script File

To write script for installation open new file named install_hadoop.sh and put following content in it.
#!/bin/bash  

# Script to install Sun Java and Hadoop 2.6 

clear  

# This command is used to tell shell, turn installation mode to non interactive  
# and set auto selection of agreement for Sun Java  
export DEBIAN_FRONTEND=noninteractive  
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections  
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections  


echo "Bash Script for Installing Sun Java for Ubuntu!"  

echo "Now Script will try to purge OpenJdk if installed..."  

# purge openjdk if installed to remove conflict  
apt-get purge openjdk-\* -y  

echo "Now we will update repository..."  

apt-get update -y  

echo "Adding Java Repository...."  

apt-get install python-software-properties -y  
add-apt-repository ppa:webupd8team/java -y  

echo "Updating Repository to load java repository"  

apt-get update -y  

echo "Installing Sun Java....."  
sudo -E apt-get purge oracle-java7-installer -y  
sudo -E apt-get install oracle-java7-installer -y  


echo "Installation completed...."  

echo "Installed java version is...."  

java -version  


apt-get install openssh-server -y  
/etc/init.d/ssh status  
/etc/init.d/ssh start  

ssh-keyscan -H localhost > ~/.ssh/known_hosts  
y|ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa  
cat ~/.ssh/id_dsa.pub > ~/.ssh/authorized_keys  
ssh-add  

cd /usr/local  
sudo wget http://mirror.sdunix.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz  
tar xzf hadoop-2.6.0.tar.gz  
mkdir hadoop  
mv hadoop-2.6.0/* hadoop/  

echo "Now script is updating Bashrc for export Path etc"  

cat >> ~/.bashrc << EOL  
export HADOOP_HOME=/usr/local/hadoop  
export HADOOP_MAPRED_HOME=/usr/local/hadoop  
export HADOOP_COMMON_HOME=/usr/local/hadoop  
export HADOOP_HDFS_HOME=/usr/local/hadoop  
export YARN_HOME=/usr/local/hadoop  
export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/lib/native  
export JAVA_HOME=/usr/  
export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:$JAVA_PATH/bin  
EOL  

cat ~ / .bashrc  

source ~ / .bashrc  

echo "Now script is updating hadoop configuration files"  

cat >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh << EOL  
export JAVA_HOME=/usr/  
EOL  

cd /usr/local/hadoop/etc/hadoop  

cat > core-site.xml << EOL  
<configuration>  
<property>  
<name>fs.default.name</name>  
<value>hdfs://localhost:9000</value>  
</property>  
</configuration>  
EOL  

cp mapred-site.xml.template mapred-site.xml  
cat > mapred-site.xml << EOL  
<configuration>  
<property>  
<name>mapreduce.framework.name</name>  
<value>yarn</value>  
</property>  
</configuration>  
EOL  

cat > yarn-site.xml << EOL  
<configuration>  
<property>  
<name>yarn.nodemanager.aux-services</name>  
<value>mapreduce_shuffle</value>  
</property>  
</configuration>  
EOL  

cat > hdfs-site.xml << EOL  
<configuration>  
<property>  
<name>dfs.replication</name>  
<value>1</value>  
</property>  
<property>  
<name>dfs.name.dir</name>  
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>  
</property>  
<property>  
<name>dfs.data.dir</name>  
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >  
</property>  
</configuration>  
EOL  

echo "Completed process Now Reloading Bash Profile...."  
cd ~  

echo "You may require reloading bash profile, you can reload using following command."  
echo "source ~/.bashrc"  

echo "To Start you need to format Name Node Once you can use following command."  
echo "hdfs namenode -format"  

echo "Hadoop configured. now you can start hadoop using following commands. "  
echo "start-dfs.sh"  
echo "start-yarn.sh"  

echo "To stop hadoop use following scripts."  
echo "stop-dfs.sh"  
echo "stop-yarn.sh"  


Now we will give it permission to make it executable.
 ~$ chmod 755 install_hadoop.sh

Now we can execute script using following command, Since installation require root excess, you are required to login as root or switch to root using command "su -"
 ~$ ./install_hadoop.sh  
after successful completion of script you can now move forward for Formation name Node and starting Hadoop.

You may face command not recognized issue, mean bash profile not reloaded. safe way is to reload manually using following command.
 ~$ source ~/.bashrc

Initializing the Single-Node Cluster

Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS.

 ~$ cd ~  
 ~$ hdfs namenode -format  

Starting Hadoop dfs daemons:

 ~$ start-dfs.sh  

Starting Yarn daemons:

 ~$ start-yarn.sh  

Check all daemon processes:

 ~ $ Jps  

 6069 NodeManager  
 5644 DataNode  
 5827 SecondaryNameNode  
 4692 ResourceManager  
 6165 Jps  
 5491 NameNode  

* Process id will be changed for each execution, main idea is to check if certain processes are running fine.

You should now be able to browse the name-node in your browser (after a short delay for start-up) by browsing to the following URLs:

name-node: http://localhost:50070/

Stopping all daemons:

 ~$ stop-dfs.sh  
 ~$ stop-yarn.sh  
   
Enjoy.

Installing Sun-Java JDK Using Bash Script :: Ubuntu

Getting Started

We are going to write simple bash script and execute it to install Sun Java. We will write content to text file using any of your favorite test editor. give it permission to execute and enjoy. hope it  works without error, else let me know issue to update it, it works fine for me on 2 different machines running Linux mint 17 and Ubuntu 14.04 respectively.

What is Bash?

Descended from the Bourne Shell, Bash is a GNU product, the "Bourne Again SHell." It's the standard command line interface on most Linux machines. It excels at interactivity, supporting command line editing, completion, and recall. It also supports configurable prompts - most people realize this, but don't know how much can be done.

Writing Script:

To write script for installation open new file named install_java.sh and put following content in it.
#!/bin/bash

# Script to install Sun Java

clear

# This command is used to tell shell, turn installation mode to non interactive
# and set auto selection of agreement for Sun Java
export DEBIAN_FRONTEND=noninteractive
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections


echo "Bash Script for Installing Sun Java for Ubuntu!"

echo "Now Script will try to purge OpenJdk if installed..."

# purge openjdk if installed to remove conflict
sudo apt-get purge openjdk-\* -y

echo "Now we will update repository..."

sudo apt-get update -y

echo "Adding Java Repository...."

sudo apt-get install python-software-properties -y
sudo add-apt-repository ppa:webupd8team/java -y

echo "Updating Repository to load java repository"

sudo apt-get update -y

echo "Installing Sun Java....."
sudo -E apt-get purge oracle-java7-installer -y
sudo -E apt-get install oracle-java7-installer -y

echo "Installation completed...."

echo "Installed java version is...."

java -version
   

Now we will give it permission to make it executable.
 ~$ chmod 755 install_java.sh

Now we can execute script using following command, Since installation require root excess and use sudo, script may prompt to ask password. provide password to initiate installation process.
 ~$ ./install_java.sh  


Enjoy.

How to write Bash Script Basic (Shell Script)

Get Started

To automate installations and configurations or routine task, we are required to write scripts to run using bash. Here we will learn how to write simple bash script and make it executable.

What is Bash?

Descended from the Bourne Shell, Bash is a GNU product, the "Bourne Again SHell." It's the standard command line interface on most Linux machines. It excels at interactivity, supporting command line editing, completion, and recall. It also supports configurable prompts - most people realize this, but don't know how much can be done. 

Write your first script

To write a bash script process include,
  • Writing bash script
  • Give it Execution permission to make it executable

Script:

You need to write to any text file, you may use any of your favorite text editor, I will use nano for this post to stop arguments from Emacs or vim(vi). you may use any. shell script is a simple file that contain ASCII text content. Now we will write simple script.
 ~$ nano hello_test.sh

Now put following content in file and save and close (For nano CTRL+O Enter CTRL+X)
 #!/bin/bash
 # My Hello World Script
 echo "Hello World!"

We have written a very simple ever common "Hello World" Program, This script simple printout Hello World! and quit. first line in a script is written to tell shell, which program is used to interpret this script. in this case we are using bash, this path may change if your bash is inst alled in some other location to check your bash path simple type following command.
 ~$ which bash
 /bin/bash

Second line is a comment, every content that is appear after # is skipped by bash. Comments are highly recommended to make your script readable. echo command is used to print in bash.

Give Permission:

Now script is ready, we will give it execution permission. use following command to give script permission.
 ~$ chmod 755 hello_test.sh

Now we can test our script by executing it from shell using following command.
 ~$ ./hello_test.sh

Enjoy.

Thursday 23 April 2015

Apache Hive? Compile Hive with Ubuntu 14 :: HADOOP 2.6.0

Get Started Hive

The Apache Hive ™ is data warehouse infrastructure built on top of Hadoop that software facilitates querying and managing large data-sets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. For details visit WIKI

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL, PostgreSQL can optionally be used. While Yahoo was working with PIG for deployment on Hadoop, Facebook started their own warehouse solutions on Hadoop which resulted on HIVE, The reason behind using hive is because traditional warehousing  solutions are getting expensive, While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.

Components of Hive:

HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

Hive is not built for quick response applications thus, can not be compared with other applications designed for reduced response times. it is build for data mining applications with post processing of data distributed over Hadoop cluster.

features of Hive include:
  • Indexing to provide acceleration
  • Different storage types such as plain text, RCFile, HBase, ORC, and others.
  • Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
  • Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into Map-reduce, or Spark jobs.

HiveQL

HiveQL is based on SQL, but it does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multi-table inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited sub-query support. Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.

Data is hive is organized in three formats.
Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the file-systems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to sub-directories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive store metastore in relational database containing metadata about hive schema.

For details on how to use HiveQL check following Wikis.

Prerequisite for Hive

since hive is based on Hadoop and uses HDFS, Hadoop is required to be installed prior to Hive installation, we have used Hadoop 2.6.0 latest release till date. to install hadoop is not configured earlier, follow my post on How to Install Hadoop 2.6.0

Next we require Apache Maven and Subversion. Apache Maven is required to build Apache Hive while subversion is required to clone source for compilation. to resolve Apache Maven version conflict issue we will remove maven2 before maven installation if installed. 
 ~$ sudo apt-get update  
 ~$ sudo apt-get remove maven2  
 ~$ sudo apt-get install maven  
   
 ~$ apt-get install subversion  
   

Compile Hive

after prerequisite for hive installed, we can move forward to compile Hive, We will clone source from repository using subversion and will build it using maven.

 svn co http://svn.apache.org/repos/asf/hive/trunk hive  
 cd hive  
   
 mvn clean install -Phadoop-2,dist -e -DskipTests  

Now compilation is completed successfully, we will export Hive path, change version number from export command if changed, this can be verified by listing directory.

 ~$ nano ~/.bashrc  

Add following line:
 export HIVE_HOME=/root/hive/packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin  


Save and close file and reload bachrc using source command.
 ~$ source ~/.bashrc  

Now we will create warehouse for Hive in Hadoop, temp directory and set permissions for write.
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /tmp  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/hive/  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/hive/warehouse  
   
 ~$ $HADOOP_HOME/bin/hadoop fs -chmod g+w  /tmp  
 ~$ $HADOOP_HOME/bin/hadoop fs -chmod g+w  /user/hive/warehouse  

To start Beeline CLI use following.
 ~$ $HIVE_HOME/bin/beeline -u jdbc:hive2://  


Now we are ready to use Hive, for testing we have practiced Hive and provided is snap of shell. We have practiced this with 8GB RAM with 4 cores of CPU on SSD Drives.
 0: jdbc:hive2://> create database mydb;  
 15/04/22 08:31:23 [HiveServer2-Background-Pool: Thread-32]: WARN metastore.ObjectStore: Failed to get database mydb, returning NoSuchObjectException  
 OK  
 No rows affected (1.565 seconds)  
 0: jdbc:hive2://> CREATE TABLE mydb.testdata (  
 0: jdbc:hive2://>   id  INT,  
 0: jdbc:hive2://>   data VARCHAR(30)  
 0: jdbc:hive2://> );  
 OK  
 No rows affected (0.561 seconds)  
 0: jdbc:hive2://> INSERT into mydb.testdata(id, data) values(1, 'Testing 1');  
 Query ID = root_20150422083821_5e2e9f2c-dc47-4650-9280-a9e52cb61c7c  
 Total jobs = 3  
 Launching Job 1 out of 3  
 Number of reduce tasks is set to 0 since there's no reduce operator  
 15/04/22 08:38:22 [HiveServer2-Background-Pool: Thread-59]: ERROR mr.ExecDriver: yarn  
 15/04/22 08:38:23 [HiveServer2-Background-Pool: Thread-59]: WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.  
 Starting Job = job_1429700414970_0001, Tracking URL = http://hadoop:8088/proxy/application_1429700414970_0001/  
 Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1429700414970_0001  
 WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.  
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
 15/04/22 08:38:31 [HiveServer2-Background-Pool: Thread-59]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead  
 2015-04-22 08:38:31,952 Stage-1 map = 0%, reduce = 0%  
 2015-04-22 08:38:39,433 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.24 sec  
 MapReduce Total cumulative CPU time: 2 seconds 240 msec  
 Ended Job = job_1429700414970_0001  
 Stage-4 is selected by condition resolver.  
 Stage-3 is filtered out by condition resolver.  
 Stage-5 is filtered out by condition resolver.  
 Moving data to: hdfs://localhost:9000/user/hive/warehouse/mydb.db/testdata/.hive-staging_hive_2015-04-22_08-38-21_684_4213030447536309872-1/-ext-10000  
 Loading data to table mydb.testdata  
 Table mydb.testdata stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]  
 MapReduce Jobs Launched:  
 Stage-Stage-1: Map: 1  Cumulative CPU: 2.24 sec  HDFS Read: 3732 HDFS Write: 77 SUCCESS  
 Total MapReduce CPU Time Spent: 2 seconds 240 msec  
 OK  
 No rows affected (19.328 seconds)  
 0: jdbc:hive2://> SELECT * FROM mydb.testdata;  
 OK  
 +----------+------------+--+  
 | testdata.id | testdata.data |  
 +----------+------------+--+  
 | 1    | Testing 1 |  
 +----------+------------+--+  
 1 row selected (0.223 seconds)  
   
 0: jdbc:hive2://> SELECT * FROM mydb.testdata WHERE testdata.id%2==1;  
 OK  
 +----------+------------+--+  
 | testdata.id | testdata.data |  
 +----------+------------+--+  
 | 1    | Testing 1 |  
 +----------+------------+--+  
 1 row selected (0.129 seconds)  
 0: jdbc:hive2://>  
   


Enjoy.

Tuesday 21 April 2015

Installing Hadoop Single Node - 2.6

Get Started

Now we will check how to install stable version of Apache Hadoop on a Server running Linux Ubuntu 14 x64 but should work on all Debian based systems. To start we need to acquire hadoop package and get java installed, to install java, if not already installed follow my install java post. to check which versions of java are supported with hadoop check Hadoop Java Versions.

Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Apache Hadoop2.6 Installation

Configuring Secure Shell (SSH)   

Communication between master and slave nodes uses SSH, to ensure we have SSH server installed
and running SSH deamon.

Installed server with provided command:

 ~$ sudo apt-get install openssh-server  

You can check status of server use this command

 ~$ /etc/init.d/ssh status  

To start ssh server use:

 ~$ /etc/init.d/ssh start  

Now ssh server is running, we need to set local ssh connection with password. To enable passphraseless ssh use

 ~$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 ~$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
 OR
 ~$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
 ~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

to check ssh

 ~$ ssh localhost  
 ~$ exit  

Disabling IPv6

We need to make sure IPv6 is disabled, it is best to disable IPv6 as all Hadoop communication between nodes is IPv4-based.

For this, first access the file /etc/sysctl.conf


 ~$ sudo nano /etc/sysctl.conf  

add following lines to end

 net.ipv6.conf.all.disable_ipv6 = 1  
 net.ipv6.conf.default.disable_ipv6 = 1  
 net.ipv6.conf.lo.disable_ipv6 = 1  
Save and exit


Reload sysctl for changes to take effect

 ~$ sudo sysctl -p /etc/sysctl.conf  

If the following command returns 1 (after reboot), it means IPv6 is disabled.

 ~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6  

Install Hadoop

Download Version 2.6.0 (Stable Version)

 ~$ su -  
 ~$ cd /usr/local  
 ~$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz  
 ~$ tar xzf hadoop-2.6.0.tar.gz  
 ~$   
 ~$ mkdir hadoop  
 ~$ mv hadoop-2.6.0/* hadoop/  
 ~$   
 ~$ exit  

Update .bashrc with Hadoop-related environment variables

 ~$ sudo nano ~/.bashrc  

Add following lines at the end:

export HADOOP_HOME=/usr/local/hadoop  
export HADOOP_MAPRED_HOME=$HADOOP_HOME  
export HADOOP_COMMON_HOME=$HADOOP_HOME  
export HADOOP_HDFS_HOME=$HADOOP_HOME  
export YARN_HOME=$HADOOP_HOME  
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  
export JAVA_HOME=/usr/  
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_PATH/bin  

Save & Exit

Reload bashrc

 ~$ source ~/.bashrc

Update JAVA_HOME in hadoop-env.sh

 ~$ sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh  

Add following line at the end:
 export JAVA_HOME=/usr/  

or if Java is Installed Manually:: double check your installed version of java and update path accordingly, I have assumed 1.7.0_51

 export JAVA_HOME=/usr/local/java/jdk1.7.0_51  

Save and exit

Hadoop Configurations

Now we are moving to update configuration files for Hadoop installation

 ~$ cd /usr/local/hadoop/etc/hadoop  

Modify core-site.xml – Core Configuration

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags.

 ~$ sudo nano core-site.xml  

Add the following lines between configuration tags

   <property>   
    <name>fs.default.name</name>   
    <value>hdfs://localhost:9000</value>   
   </property>  
Your file will look like
 <configuration>  
   
   <property>   
    <name>fs.default.name</name>   
    <value>hdfs://localhost:9000</value>   
   </property>  
     
 </configuration>  

Modify mapred-site.xml – MapReduce configuration

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template file, we are required to copy the file from mapred-site.xml.template to mapred-site.xml file.

 ~$ sudo cp mapred-site.xml.template mapred-site.xml  
 ~$ sudo nano mapred-site.xml  

Add the following lines between configuration tags.

   <property>   
    <name>mapreduce.framework.name</name>   
    <value>yarn</value>   
   </property>  
your file should look like:




 <configuration>  
   
   <property>   
    <name>mapreduce.framework.name</name>   
    <value>yarn</value>   
   </property>  
   
 </configuration>  

* Note you may have other configurations defined later, we are considering fresh install

Modify yarn-site.xml – YARN

This file is used to configure yarn into Hadoop.

 ~$ sudo nano yarn-site.xml  

Add following lines between configuration tags:


   <property>   
    <name>yarn.nodemanager.aux-services</name>   
    <value>mapreduce_shuffle</value>   
   </property>  
your file should look like:
 <configuration>  
   
   <property>   
    <name>yarn.nodemanager.aux-services</name>   
    <value>mapreduce_shuffle</value>   
   </property>  
     
 </configuration>  

 Modify hdfs-site.xml – File Replication

This file contains information like replication factor for application we have used 1, name-node path, data-node path to your local file system. this will be the location to store Hadoop information.

 ~$ sudo nano hdfs-site.xml  

Add following lines between configuration tags and check file path:


   <property>   
    <name>dfs.replication</name>   
    <value>1</value>   
   </property>   
   <property>   
    <name>dfs.name.dir</name>   
    <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>   
   </property>   
   <property>   
    <name>dfs.data.dir</name>  
    <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >   
   </property>  
your file should look like:
 <configuration>  
   
   <property>   
    <name>dfs.replication</name>   
    <value>1</value>   
   </property>   
   <property>   
    <name>dfs.name.dir</name>   
    <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>   
   </property>   
   <property>   
    <name>dfs.data.dir</name>  
    <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >   
   </property>  
     
 </configuration>  

Initializing the Single-Node Cluster

Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS.

 ~$ cd ~  
 ~$ hdfs namenode -format  

Starting Hadoop dfs daemons:

 ~$ start-dfs.sh  

Starting Yarn daemons:

 ~$ start-yarn.sh  

Check all daemon processes:

 ~$ jps  

 6069 NodeManager  
 5644 DataNode  
 5827 SecondaryNameNode  
 4692 ResourceManager  
 6165 Jps  
 5491 NameNode  

* Process id will be changed for each execution, main idea is to check if certain processes are running fine.

You should now be able to browse the name-node in your browser (after a short delay for start-up) by browsing to the following URLs:

name-node: http://localhost:50070/

Stopping all daemons:

 ~$ stop-dfs.sh  
 ~$ stop-yarn.sh  
   

Now run examples.  looking for examples to run without changing your style of code, am going run Python MapReduce on New Version of Hadoop wait for post.

Monday 13 April 2015

MonetDB Basic Example with Python

Overview

When size of your application database grows into millions of records, distributed over different tables, and business intelligence/ science becomes the prevalent application domain, a column-store database management system is called for. Unlike traditional row-stores, such as MySQL and PostgreSQL, a column-store provides a modern and scale-able solution without calling for substantial hardware investments.

In earlier blog post we have compiled MonetDB from source tarball and connected to shell for testing SQL from SQL reference manual. Now we are going to explore python API for connecting to MonetDB Database and execute SQL commands. 


Python Package:

Python package hosted by monetDB itself is available at pypi repository and can be installed using following commands.

pip install python-monetdb

of download source tarball and install manually using;

wget https://pypi.python.org/packages/source/p/python-monetdb/python-monetdb-11.19.3.2.tar.gz#md5=9031fd2ea4b86a2bc2d5dd1ab4b10a77
tar xvf python-monetdb-11.19.3.2.tar.gz
cd python-monetdb-11.19.3.2
python setup.py install

Create Test Table:

Now we will connect to database created in last post, you can change database to your own.

mclient -u monetdb -d mydatabase
Create table using following SQL;

CREATE TABLE "sys"."test" (
    "id"   INTEGER,
    "data" VARCHAR(30)
);



Now here is python code to insert data:

 import monetdb.sql  
 connection = monetdb.sql.connect(username="monetdb", password="monetdb", hostname="localhost", database="mydatabase")  
 cursor = connection.cursor()  
 cursor.arraysize = 100  
 for a in range(1, 200):    
   cursor.execute("INSERT into sys.test(id, data) values(%s, '%s')"%(a, 'testing %s'%a))  
 connection.commit()  
 cursor.execute("SELECT * FROM sys.test LIMIT 1")  
 # To Fetch all rows as list  
 print cursor.fetchall()  
 # To Fetch single row as list  
 print cursor.fetchone()  




You can perform all queries using cursor.execute. for queries and SQL use MonetDB SQL Reference manual.

Why Monetdb? Compile Monetdb with Ubuntu.

Overview

Column store technology has found its way into products offering in all major commercial vendors. The market for applications empowered be these techniques provide ample space for further innovations. 

WHY?

When you database grows into million of rows you really need one NoSQL Solution, column store database management system would be good choice.  
MonetDB innovates at all layers of DBMS, e.g. a storage model bases on vertical fragmentation, a modern CPU-tuned query execution architecture, automatic and self-tuning indexes, run-time query optimization, and a modular software architecture.

MonetDB  pioneered column-store solutions for high-performance data warehouses for business intelligence and eScience since 1993. It achieves its goal by innovations at all layers of a DBMS. It is based on the SQL 2003 standard with full support for foreign keys, joins, views, triggers, and stored procedures. It is fully ACID compliant and supports a rich spectrum of programming interfaces (JDBC, ODBC, PHP, Python, RoR, C/C++, Perl).

INSTALL MONETDB:

OS: UBUNTU 14.10

download copy of monetdb source tar ball, I have fetched latest available copy 11.19.9 using commands below, extract it and go to MonetDB directory:

~ # wget https://www.monetdb.org/downloads/sources/Oct2014-SP2/MonetDB-11.19.9.tar.bz2
~ # tar xvf MonetDB-11.19.9.tar.bz2
~ # cd MonetDB-11.19.9/

Now we will compile source for installation:

To configure and compile we need following packages to be installed,
  • make
  • pkg-config
  • openssl
  • pcre
  • libxml2

To install above listed packages use following commands for ubuntu(sudo users will use sudo for each command like "sudo apt-get update")

apt-get update
apt-get install make
apt-get install pkg-config
apt-get install bison
apt-get install OpenSSL
apt-get install libssl-dev
apt-get install libpcre3 libpcre3-dev
apt-get install libxml2 libxml2-dev


To Configure and install Monetdb source use following command:
./configure
make
make install

To Add missing path to the monet libraries use:
ldconfig -v

Now Monetdb is installed, to continue using we need to create/start dbform to store data, use following commands to create dbform.

monetdbd create /root/my_dbform
monetdbd start /root/my_dbform

After starting dbform you can create database using following commands.

monetdb create mydatabase
monetdb release mydatabase

To start db shell using following command. default username/password for fresh installation is monetdb.
mclient -u monetdb -d mydatabase    <------ Hit Enter and you will be asked for password

For SQL Reference use provided weblink.

https://www.monetdb.org/Documentation/SQLreference