Cutting Edge: Running your First Example On hadoop using python

Overview

Even though the Hadoop framework is written in Java, but we can use other languages like python and C++, to write MapReduce for Hadoop. However, Hadoop’s documentation suggest that your must translate your code to java jar file using jython. which is not very convenient and can even be problematic if you depend on Python features not provided by Jython.

Example

We will write simple WordCount MapReduce program using pure python. input is text files and output is file with words and thier count. you can use other languages like perl.

Prerequisites

You should have hadoop cluster running if still not have cluster ready Try this to start with single node cluster.

MapReduce

Idea behind python code is that we will use hadoop streaming API to transfer data/Result between our Map and Reduce code using STDIN(sys.stdin)/ STDOUT(sys.stdout). We will use STDIN to read data
from input and print output to STDOUT.

mapper.py

import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)

reducer.py

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split('\t', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

if current_word == word:

print '%s\t%s' % (current_word, current_count)

Running Hadoop's Job

Download Example Data to home directory like /home/elite/Downloads/examples/
Book1
Book2
Book3

Start Cluster

$ bin/start-all.sh

Copy Data from Local to dfs File System
$ bin/hadoop dfs -mkdir /wordscount
$ bin/hadoop dfs -copyFromLocal /home/elite/Downloads/examples/ /home/hdpuser/wordscount/

Here we have created directory in hadoop file system named wordcount and moved our local directory containing our test data to hadoop hdfs. We can check if files have been copied properly to hadoop directory by listing its content as presented below.

Check files on dfs
$ bin/hadoop dfs -ls /home/hdpuser/wordscount

Run MapReduce Job

I have both mapper.py and reducer.py and /home/hdpuser/ here is command to run job.
$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar \
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \
-input /home/hdpuser/wordscount/* -output /home/hdpuser/wordscount.out

You Can check status from terminal or web page http://localhost:50030/ configured in your cluster setup. after job is complete we can get results back by coping output file from hadoop file system to local

$ bin/hadoop dfs -copyToLocal /home/hdpuser/wordscount.out /home/hdpuser/

Check Result

$ vi /home/hdpuser/wordscount.out/part-00000

Stop running cluster

$ bin/stop-all.sh

Cutting Edge

Monday 3 March 2014

Running your First Example On hadoop using python

Overview

Example

Prerequisites

MapReduce

mapper.py

reducer.py

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split('\t', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

if current_word == word:

print '%s\t%s' % (current_word, current_count)

Running Hadoop's Job

Start Cluster

Run MapReduce Job

Check Result

Stop running cluster

No comments:

Post a Comment