Monday, 13 April 2015

MonetDB Basic Example with Python

Overview

When size of your application database grows into millions of records, distributed over different tables, and business intelligence/ science becomes the prevalent application domain, a column-store database management system is called for. Unlike traditional row-stores, such as MySQL and PostgreSQL, a column-store provides a modern and scale-able solution without calling for substantial hardware investments.

In earlier blog post we have compiled MonetDB from source tarball and connected to shell for testing SQL from SQL reference manual. Now we are going to explore python API for connecting to MonetDB Database and execute SQL commands. 


Python Package:

Python package hosted by monetDB itself is available at pypi repository and can be installed using following commands.

pip install python-monetdb

of download source tarball and install manually using;

wget https://pypi.python.org/packages/source/p/python-monetdb/python-monetdb-11.19.3.2.tar.gz#md5=9031fd2ea4b86a2bc2d5dd1ab4b10a77
tar xvf python-monetdb-11.19.3.2.tar.gz
cd python-monetdb-11.19.3.2
python setup.py install

Create Test Table:

Now we will connect to database created in last post, you can change database to your own.

mclient -u monetdb -d mydatabase
Create table using following SQL;

CREATE TABLE "sys"."test" (
    "id"   INTEGER,
    "data" VARCHAR(30)
);



Now here is python code to insert data:

 import monetdb.sql  
 connection = monetdb.sql.connect(username="monetdb", password="monetdb", hostname="localhost", database="mydatabase")  
 cursor = connection.cursor()  
 cursor.arraysize = 100  
 for a in range(1, 200):    
   cursor.execute("INSERT into sys.test(id, data) values(%s, '%s')"%(a, 'testing %s'%a))  
 connection.commit()  
 cursor.execute("SELECT * FROM sys.test LIMIT 1")  
 # To Fetch all rows as list  
 print cursor.fetchall()  
 # To Fetch single row as list  
 print cursor.fetchone()  




You can perform all queries using cursor.execute. for queries and SQL use MonetDB SQL Reference manual.

Why Monetdb? Compile Monetdb with Ubuntu.

Overview

Column store technology has found its way into products offering in all major commercial vendors. The market for applications empowered be these techniques provide ample space for further innovations. 

WHY?

When you database grows into million of rows you really need one NoSQL Solution, column store database management system would be good choice.  
MonetDB innovates at all layers of DBMS, e.g. a storage model bases on vertical fragmentation, a modern CPU-tuned query execution architecture, automatic and self-tuning indexes, run-time query optimization, and a modular software architecture.

MonetDB  pioneered column-store solutions for high-performance data warehouses for business intelligence and eScience since 1993. It achieves its goal by innovations at all layers of a DBMS. It is based on the SQL 2003 standard with full support for foreign keys, joins, views, triggers, and stored procedures. It is fully ACID compliant and supports a rich spectrum of programming interfaces (JDBC, ODBC, PHP, Python, RoR, C/C++, Perl).

INSTALL MONETDB:

OS: UBUNTU 14.10

download copy of monetdb source tar ball, I have fetched latest available copy 11.19.9 using commands below, extract it and go to MonetDB directory:

~ # wget https://www.monetdb.org/downloads/sources/Oct2014-SP2/MonetDB-11.19.9.tar.bz2
~ # tar xvf MonetDB-11.19.9.tar.bz2
~ # cd MonetDB-11.19.9/

Now we will compile source for installation:

To configure and compile we need following packages to be installed,
  • make
  • pkg-config
  • openssl
  • pcre
  • libxml2

To install above listed packages use following commands for ubuntu(sudo users will use sudo for each command like "sudo apt-get update")

apt-get update
apt-get install make
apt-get install pkg-config
apt-get install bison
apt-get install OpenSSL
apt-get install libssl-dev
apt-get install libpcre3 libpcre3-dev
apt-get install libxml2 libxml2-dev


To Configure and install Monetdb source use following command:
./configure
make
make install

To Add missing path to the monet libraries use:
ldconfig -v

Now Monetdb is installed, to continue using we need to create/start dbform to store data, use following commands to create dbform.

monetdbd create /root/my_dbform
monetdbd start /root/my_dbform

After starting dbform you can create database using following commands.

monetdb create mydatabase
monetdb release mydatabase

To start db shell using following command. default username/password for fresh installation is monetdb.
mclient -u monetdb -d mydatabase    <------ Hit Enter and you will be asked for password

For SQL Reference use provided weblink.

https://www.monetdb.org/Documentation/SQLreference 

Thursday, 8 May 2014

Running your Example On hadoop 2.2.0 using python


Overview

Even though the Hadoop framework is written in Java, but we can use other languages like python and C++, to write MapReduce for Hadoop. However, Hadoop’s documentation suggest that your must translate your code to java jar file using jython. which is not very convenient and can even be problematic if you depend on Python features not provided by Jython.

Example

We will write simple WordCount MapReduce program using pure python. input is text files and output is file with words and thier count. you can use other languages like perl.

Prerequisites

You should have hadoop cluster running if still not have cluster ready Try this to start with single node cluster.

MapReduce

Idea behind python code is that we will use hadoop streaming API to transfer data/Result between our Map and Reduce code using STDIN(sys.stdin)/ STDOUT(sys.stdout). We will use STDIN to read data
from input and print output to STDOUT.

Mapper

Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

Output pairs do not need to be of the same types as input pairs. A given input pair may map to zero or many output pairs. All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output. The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. 

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.

Reducer

Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via JobConf.setNumReduceTasks(int). Reducer has 3 primary phases: shuffle, sort and reduce.

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

Sort

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

Reduce

In this phase the reduce method is called for each <key, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the File-system. Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.
*The output of the Reducer is not sorted.

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the File-system, into the output path. The framework does not sort the map-outputs before writing them out to the File-system.

Sample Code:

mapper.py


import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)


reducer.py

from operator import itemgetter

import sys


current_word = None

current_count = 0

word = None


for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:

        count = int(count)

    except ValueError:

        continue

    if current_word == word:

        current_count += count

    else:

        if current_word:

            print '%s\t%s' % (current_word, current_count)

        current_count = count

        current_word = word

if current_word == word:

    print '%s\t%s' % (current_word, current_count)


Running Hadoop's Job

Download Example Data to home directory like /home/elite/Downloads/examples/
Book1
Book2
Book3



Start Cluster

$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
$ sbin/mr-jobhistory-daemon.sh start historyserver

Copy Data from Local to dfs File System
$ bin/hadoop dfs -mkdir /wordscount
$ bin/hadoop dfs -copyFromLocal /home/hdpuser/gutenberg/ /wordscount/


Check files on dfs
$ bin/hadoop dfs -ls /wordscount/gutenberg

Run MapReduce Job

I have both mapper.py and reducer.py and /home/hdpuser/ here is command to run job.

$ bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -file  /home/hdpuser/mapper.py -mapper /home/hdpuser/mapper.py -file /home/hdpuser/reducer.py -reducer /home/hdpuser/reducer.py -input /wordscount/gutenberg/* -output /wordscount/wc.out
You Can check status from terminal or web page http://elite-pc:19888/jobhistory This will provide you extensive details about executed job.



Check Result

Browse this url and check for created files, this url is fetched from http://localhost:50070 to access file system.

http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=127.0.0.1:54310



Stop running cluster

$ sbin/hadoop-daemon.sh stop namenode
$ sbin/hadoop-daemon.sh stop datanode
$ sbin/yarn-daemon.sh stop resourcemanager
$ sbin/yarn-daemon.sh stop nodemanager
$ sbin/mr-jobhistory-daemon.sh stop historyserver

Wednesday, 7 May 2014

Installing Hadoop Single Node - 2.2

Get Started

Now we will check how to install stable version of Apache Hadoop on a Laptop running Linux Mint 15 but will work on all Debian based systems including Ubuntu. To start we need to acquire hadoop package and get java installed, to install java, if not already installed follow my install java post. to check which versions of java are supported with hadoop check Hadoop Java Versions. Next step is to acquire hadoop which could be downloaded @ hadoop webpage. we opted for hadoop-2.2.0 in our blog.

Apache Hadoop NextGen MapReduce (YARN)

MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.





The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

Apache Hadoop2.0 Installation

Create Dedicated Hadoop User

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hdpuser

Give user sudo rights

$ sudo nano /etc/sudoers
add this to end of file
hdpuser ALL=(ALL:ALL) ALL

Configuring Secure Shell (SSH)   

Communication between master and slave nodes uses SSH, to ensure we have SSH server installed
and running SSH deamon.

Installed server with provided command:

$ sudo apt-get install openssh-server

You can check status of server use this command

$ /etc/init.d/ssh status

To start ssh server use:

$ /etc/init.d/ssh start

Now ssh server is running, we need to set local ssh connection with password. To enable passphraseless ssh use

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

to check ssh

$ ssh localhost
$ exit

Disabling IPv6

We need to make sure IPv6 is disabled, it is best to disable IPv6 as all Hadoop communication between nodes is IPv4-based.

For this, first access the file /etc/sysctl.conf

$ sudo nano /etc/sysctl.conf
add following lines to end
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save and exit

Reload sysctl for changes to take effect

$ sudo sysctl -p /etc/sysctl.conf

If the following command returns 1 (after reboot), it means IPv6 is disabled.

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Install Hadoop

Download Version 2.2.0 (Stable Version)

Make Hadoop installation directory

$ sudo mkdir -p /usr/hadoop

Copy Hadoop installer to installation directory

$ sudo cp -r ~/Downloads/hadoop-2.2.0.tar.gz /usr/hadoop

Extract Hadoop installer

$ cd /usr/hadoop
$ sudo tar xvzf hadoop-2.2.0.tar.gz

Rename it to hadoop

$ sudo mv hadoop-2.2.0 hadoop

Change owner to hdpuser for this folder

$ sudo chown -R hdpuser:hadoop hadoop

Update .bashrc with Hadoop-related environment variables

$ sudo nano ~/.bashrc
Add following lines at the end:
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/hadoop/hadoop
export HADOOP_PREFIX=/usr/hadoop/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
#Java path
# Import if you have installed java from apt-get
# use /usr/local/java/jdk1.7.0_51 (1.7.0_51 installed version) instead of /usr/
export JAVA_HOME='/usr/'
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin



Save & Exit

Reload bashrc

$ source ~/.bashrc


Update JAVA_HOME in hadoop-env.sh

$ cd /usr/hadoop/hadoop
$ sudo vi etc/hadoop/hadoop-env.sh

Add the line:
export JAVA_HOME=/usr/

or if Java is Installed Manually
export JAVA_HOME=/usr/local/java/jdk1.7.0_51

Save and exit

Create a Directory to hold Hadoop’s Temporary Files:

$ sudo mkdir -p /usr/hadoop/tmp

Provide hdpuser the rights to this directory

$ sudo chown hdpuser:hadoop /usr/hadoop/tmp


Hadoop Configurations

Modify core-site.xml – Core Configuration

$ sudo nano etc/hadoop/core-site.xml

Add the following lines between configuration tags
<property>
   <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/tmp</value>
   <description>Hadoop's temporary directory</description>
</property>
<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
</property>

Modify mapred-site.xml – MapReduce configuration

$ sudo cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
$ sudo nano etc/hadoop/mapred-site.xml

Add the following lines between configuration tags
<property>
   <name>mapred.job.tracker</name>
   <value>localhost:54311</value>
   <description>The URI is used to monitor the status of MapReduce tasks</description>
</property>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

Modify yarn-site.xml – YARN

$ sudo nano etc/hadoop/yarn-site.xml

Add following lines between configuration tags:
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

 Modify hdfs-site.xml – File Replication

$ sudo nano etc/hadoop/hdfs-site.xml

Add following lines between configuration tags and check file path:
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/hadoop/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/hadoop/hadoop/yarn_data/hdfs/datanode</value>
</property>

Initializing the Single-Node Cluster


Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS.
$ bin/hadoop namenode -format

Starting all daemons:

$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
$ sbin/mr-jobhistory-daemon.sh start historyserver

Check all daemon processes:

$ jps
4829 ResourceManager
4643 NameNode
4983 NodeManager
5224 JobHistoryServer
4730 DataNode
7918 Jps

You should now be able to browse the nameNode in your browser (after a short delay for startup) by browsing to the following URLs:

nameNode: http://localhost:50070/

Stoping all daemons:

$ sbin/hadoop-daemon.sh stop namenode
$ sbin/hadoop-daemon.sh stop datanode
$ sbin/yarn-daemon.sh stop resourcemanager
$ sbin/yarn-daemon.sh stop nodemanager
$ sbin/mr-jobhistory-daemon.sh stop historyserver

Now run examples.  looking for examples to run without changing your style of code, am going run Python MapReduce on New Version of Hadoop wait for post.

Monday, 3 March 2014

Running your First Example On hadoop using python


Overview

Even though the Hadoop framework is written in Java, but we can use other languages like python and C++, to write MapReduce for Hadoop. However, Hadoop’s documentation suggest that your must translate your code to java jar file using jython. which is not very convenient and can even be problematic if you depend on Python features not provided by Jython.

Example

We will write simple WordCount MapReduce program using pure python. input is text files and output is file with words and thier count. you can use other languages like perl.

Prerequisites

You should have hadoop cluster running if still not have cluster ready Try this to start with single node cluster.

MapReduce

Idea behind python code is that we will use hadoop streaming API to transfer data/Result between our Map and Reduce code using STDIN(sys.stdin)/ STDOUT(sys.stdout). We will use STDIN to read data
from input and print output to STDOUT.

mapper.py


import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)


reducer.py

from operator import itemgetter

import sys


current_word = None

current_count = 0

word = None


for line in sys.stdin:

    line = line.strip()

    word, count = line.split('\t', 1)

    try:

        count = int(count)

    except ValueError:

        continue

    if current_word == word:

        current_count += count

    else:

        if current_word:

            print '%s\t%s' % (current_word, current_count)

        current_count = count

        current_word = word

if current_word == word:

    print '%s\t%s' % (current_word, current_count)


Running Hadoop's Job

Download Example Data to home directory like /home/elite/Downloads/examples/
Book1
Book2
Book3



Start Cluster

$ bin/start-all.sh
Copy Data from Local to dfs File System
$ bin/hadoop dfs copyFromLocal /home/elite/Downloads/examples/ /home/hdpuser/wordscount/

Check files on dfs
$ bin/hadoop dfs -ls /home/hdpuser/wordscount

Run MapReduce Job

I have both mapper.py and reducer.py and /home/hdpuser/ here is command to run job.
$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar \
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
-input /home/hdpuser/wordscount/* -output /home/hdpuser/wordscount.out

You Can check status from terminal or web page http://localhost:50030/ configured in your cluster setup. after job is complete we can get results back by coping output file from hadoop file system to local

$ bin/hadoop dfs -copyToLocal /home/hdpuser/wordscount.out /home/hdpuser/

Check Result

$ vi /home/hdpuser/wordscount.out/part-00000

Stop running cluster

$ bin/stop-all.sh

Sunday, 2 March 2014

Installing Hadoop Single Node - 1.2.1

Get Started

Now we will check how to install stable version of Apache Hadoop on a Laptop running Linux Mint 15 but will work on all Debian based systems including Ubuntu. To start we need to acquire hadoop package and get java installed, to install java, if not already installed follow my install java post. to check which versions of java are supported with hadoop check Hadoop Java Versions. Next step is to acquire hadoop which could be downloaded @ hadoop webpage. we opted for hadoop-1.2.1 in our blog.

Create Dedicated Hadoop User

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hdpuser

Give user sudo rights

$ sudo nano /etc/sudoers
add this to end of file
hdpuser ALL=(ALL:ALL) ALL

Configuring Secure Shell (SSH)   

Communication between master and slave nodes uses SSH, to ensure we have SSH server installed
and running SSH deamon.

Installed server with provided command:

$ sudo apt-get install openssh-server

You can check status of server use this command

$ /etc/init.d/ssh status

To start ssh server use:

$ /etc/init.d/ssh start

Now ssh server is running, we need to set local ssh connection with password. To enable passphraseless ssh use

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

to check ssh

$ ssh localhost
$ exit

Disabling IPv6

We need to make sure IPv6 is disabled, it is best to disable IPv6 as all Hadoop communication between nodes is IPv4-based.

For this, first access the file /etc/sysctl.conf

$ sudo nano /etc/sysctl.conf
add following lines to end
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save and exit

Reload sysctl for changes to take effect

$ sudo sysctl -p /etc/sysctl.conf

If the following command returns 1 (after reboot), it means IPv6 is disabled.

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Install Hadoop

Download Version 1.2.1 (Stable Version)

Make Hadoop installation directory

$ sudo mkdir -p /usr/hadoop

Copy Hadoop installer to installation directory

$ sudo cp -r ~/Downloads/hadoop-1.2.1.tar.gz /usr/hadoop

Extract Hadoop installer

$ cd /usr/hadoop
$ sudo tar xvzf hadoop-1.2.1.tar.gz

Rename it to hadoop

$ sudo mv hadoop-1.2.1 hadoop

Change owner to hdpuser for this folder

$ sudo chown -R hdpuser:hadoop hadoop

Update .bashrc with Hadoop-related environment variables

$ sudo nano ~/.bashrc
Add following lines at the end:
# Set HADOOP_HOME
export HADOOP_HOME=/usr/hadoop/hadoop
# Set JAVA_HOME
# Import if you have installed java from apt-get
# use /usr instead of /usr/local/java/jdk1.7.0_51
export JAVA_HOME=/usr/local/java/jdk1.7.0_51
# Add Hadoop bin directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Save & Exit

Reload bashrc

$ source ~/.bashrc


Update JAVA_HOME in hadoop-env.sh

$ cd /usr/hadoop/hadoop
$ sudo nano conf/hadoop-env.sh

Add the line:
export JAVA_HOME=/usr/local/java/jdk1.7.0_51

Save and exit

Create a Directory to hold Hadoop’s Temporary Files:

$ sudo mkdir -p /usr/hadoop/tmp

Provide hdpuser the rights to this directory

$ sudo chown hdpuser:hadoop /usr/hadoop/tmp


Hadoop Configurations

Modify conf/core-site.xml – Core Configuration

$ sudo nano conf/core-site.xml

Add the following lines between configuration tags
<property>
   <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/tmp</value>
   <description>Hadoop's temporary directory</description>
</property>
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:54310</value>
   <description>Specifying HDFS as the default file system.</description>
</property>

Modify conf/mapred-site.xml – MapReduce configuration

$ sudo nano conf/mapred-site.xml

Add the following lines between configuration tags
<property>
   <name>mapred.job.tracker</name>
   <value>localhost:54311</value>
   <description>The URI is used to monitor the status of MapReduce tasks</description>
</property>

Modify conf/hdfs-site.xml – File Replication

$ sudo nano conf/hdfs-site.xml

Add following lines between configuration tags:
<property>
   <name>dfs.replication</name>
   <value>1</value>
   <description>Default block replication.</description>
</property>

Initializing the Single-Node Cluster


Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS.
$ bin/hadoop namenode -format

Starting all daemons:

$ bin/start-all.sh

You should now be able to browse the nameNode and JobTracker in your browser (after a short delay for startup) by browsing to the following URLs:

nameNode: http://localhost:50070/
JobTracker: http://localhost:50030/

Stoping all daemons:

$ bin/stop-all.sh

your can seperatly start stop as

hdfs:

$ bin/start-dfs.sh
$ bin/stop-dfs.sh

mappered:

$ bin/start-mapred.sh
$ bin/stop-mapred.sh


Now run examples Java Word Count Example.  looking for examples to run without changing your style of code, am going run python map-reduce wait for post.

Saturday, 1 March 2014

Installing Sun-Java JDK 7

Install Java JDK

OS- MINT15, will work on on all Debian Based systems including ubuntu

Easy Way:

Simple and easy way to install JDK is to do it with apt-get repository. but noted that this some ti PPA becomes out dated. 
This installs JDK 7 (which includes Java JDK, JRE and the Java browser plugin).

Remove any installed version of open-JDK
sudo apt-get purge openjdk-\*

Add PPA and update apt-get repo

$ sudo apt-get update
$ sudo apt-get install python-software-properties
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update

install it
$ sudo apt-get install oracle-java7-installer

Check version and process id
Check your Java version to ensure installations and settings:
$ java -version
Verify that JPS (JVM Process Status tool) is up and running
$ jps

Manual way:

1) Remove any previous OpenJDK installations
$ sudo apt-get purge openjdk-\*

2) Make directory to hold Sun Java
$ sudo mkdir -p /usr/local/java

3) Download Oracle Java Sun (JDK/JRE) from Oracle’s website:

JDK Download and JRE Download. Normally downloaded files will be

placed in /home/<your_user_name>/Downloads folder.

4) Copy the downloaded files to the Java directory
$ cd /home/<your_user_name>/Downloads
$ sudo cp -r jdk-7u51-linux-x64.tar.gz /usr/local/java
$ sudo cp -r jre-7u51-linux-x64.tar.gz /usr/local/java

5) Unpack the compressed binaries
$ cd /usr/local/java
$ sudo tar xvzf jdk-7u51-linux-x64.tar.gz
$ sudo tar xvzf jre-7u51-linux-x64.tar.gz

6) Cross-check the extracted binaries:
$ ls -a
The following two folders should be created: jdk1.7.0_51 and jre1.7.0_51

7) To provide information about JDK/JRE paths to the system PATH

(located in /etc/profile), first access the PATH:
$sudo nano /etc/profile

and add the following lines at the end:
JAVA_HOME=/usr/local/java/jdk1.7.0_51
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
JRE_HOME=/usr/local/java/jre1.7.0_51
PATH=$PATH:$HOME/bin:$JRE_HOME/bin
export JAVA_HOME
export JRE_HOME
export PATH

Save and exit (CTRL+O then Enter, then press CTRL+X then Enter)

8) Inform OS about Oracle Sun Java location to signal that it is ready for use:

JDK is available:
$ sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_51/bin/javac" 1

JRE is available:
$ sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jre1.7.0_51/bin/java" 1

Java Web Start is available:
$ sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jre1.7.0_51/bin/javaws" 1

9) Make Oracle Sun JDK/JRE the default on your system:
Set JRE:
$ sudo update-alternatives --set java /usr/local/java/jre1.7.0_51/bin/java

Set javac Compiler:
$ sudo update-alternatives --set javac /usr/local/java/jdk1.7.0_51/bin/javac

Set Java Web Start:
$ sudo update-alternatives --set javaws /usr/local/java/jre1.7.0_51/bin/javaws

10) Re-load the /etc/profile
$ source /etc/profile

11) Check your Java version to ensure installations and settings:
$ java -version

12) Verify that JPS (JVM Process Status tool) is up and running
$ jps

This will show the process id of the jps process