Thursday 23 April 2015

Apache Hive? Compile Hive with Ubuntu 14 :: HADOOP 2.6.0

Get Started Hive

The Apache Hive ™ is data warehouse infrastructure built on top of Hadoop that software facilitates querying and managing large data-sets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. For details visit WIKI

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL, PostgreSQL can optionally be used. While Yahoo was working with PIG for deployment on Hadoop, Facebook started their own warehouse solutions on Hadoop which resulted on HIVE, The reason behind using hive is because traditional warehousing  solutions are getting expensive, While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.

Components of Hive:

HCatalog is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools — including Pig and MapReduce — to more easily read and write data on the grid.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

Hive is not built for quick response applications thus, can not be compared with other applications designed for reduced response times. it is build for data mining applications with post processing of data distributed over Hadoop cluster.

features of Hive include:
  • Indexing to provide acceleration
  • Different storage types such as plain text, RCFile, HBase, ORC, and others.
  • Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
  • Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into Map-reduce, or Spark jobs.

HiveQL

HiveQL is based on SQL, but it does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multi-table inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited sub-query support. Support for insert, update, and delete with full ACID functionality was made available with release 0.14.

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.

Data is hive is organized in three formats.
Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the file-systems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to sub-directories and file systems as well.

Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Hive store metastore in relational database containing metadata about hive schema.

For details on how to use HiveQL check following Wikis.

Prerequisite for Hive

since hive is based on Hadoop and uses HDFS, Hadoop is required to be installed prior to Hive installation, we have used Hadoop 2.6.0 latest release till date. to install hadoop is not configured earlier, follow my post on How to Install Hadoop 2.6.0

Next we require Apache Maven and Subversion. Apache Maven is required to build Apache Hive while subversion is required to clone source for compilation. to resolve Apache Maven version conflict issue we will remove maven2 before maven installation if installed. 
 ~$ sudo apt-get update  
 ~$ sudo apt-get remove maven2  
 ~$ sudo apt-get install maven  
   
 ~$ apt-get install subversion  
   

Compile Hive

after prerequisite for hive installed, we can move forward to compile Hive, We will clone source from repository using subversion and will build it using maven.

 svn co http://svn.apache.org/repos/asf/hive/trunk hive  
 cd hive  
   
 mvn clean install -Phadoop-2,dist -e -DskipTests  

Now compilation is completed successfully, we will export Hive path, change version number from export command if changed, this can be verified by listing directory.

 ~$ nano ~/.bashrc  

Add following line:
 export HIVE_HOME=/root/hive/packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin  


Save and close file and reload bachrc using source command.
 ~$ source ~/.bashrc  

Now we will create warehouse for Hive in Hadoop, temp directory and set permissions for write.
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /tmp  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/hive/  
 ~$ $HADOOP_HOME/bin/hadoop fs -mkdir    /user/hive/warehouse  
   
 ~$ $HADOOP_HOME/bin/hadoop fs -chmod g+w  /tmp  
 ~$ $HADOOP_HOME/bin/hadoop fs -chmod g+w  /user/hive/warehouse  

To start Beeline CLI use following.
 ~$ $HIVE_HOME/bin/beeline -u jdbc:hive2://  


Now we are ready to use Hive, for testing we have practiced Hive and provided is snap of shell. We have practiced this with 8GB RAM with 4 cores of CPU on SSD Drives.
 0: jdbc:hive2://> create database mydb;  
 15/04/22 08:31:23 [HiveServer2-Background-Pool: Thread-32]: WARN metastore.ObjectStore: Failed to get database mydb, returning NoSuchObjectException  
 OK  
 No rows affected (1.565 seconds)  
 0: jdbc:hive2://> CREATE TABLE mydb.testdata (  
 0: jdbc:hive2://>   id  INT,  
 0: jdbc:hive2://>   data VARCHAR(30)  
 0: jdbc:hive2://> );  
 OK  
 No rows affected (0.561 seconds)  
 0: jdbc:hive2://> INSERT into mydb.testdata(id, data) values(1, 'Testing 1');  
 Query ID = root_20150422083821_5e2e9f2c-dc47-4650-9280-a9e52cb61c7c  
 Total jobs = 3  
 Launching Job 1 out of 3  
 Number of reduce tasks is set to 0 since there's no reduce operator  
 15/04/22 08:38:22 [HiveServer2-Background-Pool: Thread-59]: ERROR mr.ExecDriver: yarn  
 15/04/22 08:38:23 [HiveServer2-Background-Pool: Thread-59]: WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.  
 Starting Job = job_1429700414970_0001, Tracking URL = http://hadoop:8088/proxy/application_1429700414970_0001/  
 Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1429700414970_0001  
 WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.  
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
 15/04/22 08:38:31 [HiveServer2-Background-Pool: Thread-59]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead  
 2015-04-22 08:38:31,952 Stage-1 map = 0%, reduce = 0%  
 2015-04-22 08:38:39,433 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.24 sec  
 MapReduce Total cumulative CPU time: 2 seconds 240 msec  
 Ended Job = job_1429700414970_0001  
 Stage-4 is selected by condition resolver.  
 Stage-3 is filtered out by condition resolver.  
 Stage-5 is filtered out by condition resolver.  
 Moving data to: hdfs://localhost:9000/user/hive/warehouse/mydb.db/testdata/.hive-staging_hive_2015-04-22_08-38-21_684_4213030447536309872-1/-ext-10000  
 Loading data to table mydb.testdata  
 Table mydb.testdata stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]  
 MapReduce Jobs Launched:  
 Stage-Stage-1: Map: 1  Cumulative CPU: 2.24 sec  HDFS Read: 3732 HDFS Write: 77 SUCCESS  
 Total MapReduce CPU Time Spent: 2 seconds 240 msec  
 OK  
 No rows affected (19.328 seconds)  
 0: jdbc:hive2://> SELECT * FROM mydb.testdata;  
 OK  
 +----------+------------+--+  
 | testdata.id | testdata.data |  
 +----------+------------+--+  
 | 1    | Testing 1 |  
 +----------+------------+--+  
 1 row selected (0.223 seconds)  
   
 0: jdbc:hive2://> SELECT * FROM mydb.testdata WHERE testdata.id%2==1;  
 OK  
 +----------+------------+--+  
 | testdata.id | testdata.data |  
 +----------+------------+--+  
 | 1    | Testing 1 |  
 +----------+------------+--+  
 1 row selected (0.129 seconds)  
 0: jdbc:hive2://>  
   


Enjoy.

6 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.


    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  2. After reading this blog i very strong in this topics and this blog really helpful to all Big data hadoop online training India

    ReplyDelete
  3. The people to give them a good shake to get your point and across the command.
    best safety training in chennai

    ReplyDelete
  4. As we know there are many companies which are converting into Big data platform managed service. with the right direction we can definitely predict the future.

    ReplyDelete