Hive

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

As we know Hadoop is an open source framework to store and process Big Data in a distributed enviornment. It contains two modules, one is Mapreduce and another is HDFS.

Pig and Hive are open source platform mainly used for same purpose. These tools that ease the complexity of writing difficult/complexed programs of java based MapReduce. Hive is like a data warehouse that uses the MapReduce for the purpose of analyzing data stored on HDFS. It provides a query language called HiveQL that is familiar to the Structured Query Language (SQL) standard. It is developed based on facebook concepts. Hive was created who are posing strong analysts having strong SQL skills but few java programming skills are required to run queries on the large volumes of data that Face book stored in HDFS. Apache Pig and Hive are two projects that are consider as the top most layer of Hadoop and provide a higher-level language for using MapReduce library of Hadoop management.

  • Hive store schema in a database and processed data in HDFS
  • It is desgined for Online Analytical Processing(OLAP)
  • It provides SQl type langauge for quering called HiveQL

Hive is not designed for :
  • Not designed as relational database
  • Not designed for online transactional processing 
  • Hive is not for real time queries and row level updates
Why hive?
It consists of a query language based on the standard SQL instead of giving a rapid development of map and reduces tasks. Hive takes HiveQL statements and then automatically transforms each and every query into one or more MapReduce jobs. Later it runs the overall MapReduce program and executes the output to the user whereas Hadoop streaming decreases the mandatory code, compile, and submit cycle. Hive removes it completely instead requires only the composition of HiveQL statements.

This interface to Hadoop not only accelerates the time required to produce results from data analysis but also it significantly expands for whom this Hadoop and MapReduce are helpful.


What makes Hive Hadoop popular?
The users are provided with strong and powerful statistics functions.It is similar to SQL and hence it is very easy to understand the concepts.It can be combined with the HBase for querying the data in HBase. This kind of feature is not available in pig. Pig function named HbaseStorage () is mainly used for loading the data from HBase.Supported by Hue.Various user groups are considered such as CNET, Last.fm, Facebook, and Digg etc.
  • The Apache Hive is mainly data warehouse software which allows you to read, write and manage huge number volumes of datasets stored in a distributed environment using SQL. It is possible to project structure onto data that is termed as storage. Users can be connected to Hive using a JDBC driver and a command line tool.
  • Hive is an open Source platform system. Use Hive for analyzing and querying in large number of datasets consisting the Hadoop files. It’s similar to the SQL programming. The current version of Hive is 0.13.1.
  • Hive supports ACID transaction: Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, those are Insert, Delete, and Update options so that Hive supports ACID transaction.
  • Hive is not considered as a complete database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do in the field of programming.
Hive is most suitable for following data warehouse applications
  • Analyzing the static data
  • Less Responsive time
  • No rapid changes in datasets.
Hive doesn’t provide fundamental features required for OLTP (Online Transaction Processing). Hive is proper usage for data warehouse applications in large data sets.

Architecture of Hive:
This component diagram contains different units. The following table describes each unit:
Unit NameOperation
User InterfaceHive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta StoreHive meta-store is a component that stores all the structure information(metadata) of objects like tables and partitions in the warehouse including columns and column types information, serializers and deserializers necessary to read and write data and corresponding HDFS files where data is stored.
Any JDBC compliant database can be used as metastore for Hive. Derby database is the default metastore which supports one user so only one shell you can open.
HiveQL Process EngineHiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.
Execution EngineThe conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASEHadoop distributed file system or HBASE are the data storage techniques to store data into file system.
Working of Hive:
The following table defines how Hive interacts with Hadoop framework:

Step No.Operation
1Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.
3Get Metadata
The compiler sends metadata request to Metastore (any database).
4Send Metadata
Metastore sends metadata as a response to the compiler.
5Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.
6Execute Plan
The driver sends the execute plan to the execution engine.
7Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
8Fetch Result
The execution engine receives the results from Data nodes.
9Send Results
The execution engine sends those resultant values to the driver.
10Send Results
The driver sends the results to Hive Interfaces.

There can be  two types of tables in Hive
  • Managed table
  • External table
DataType:

Hive supports different data types to be used in table columns. The data types supported by Hive can be broadly classified in Primitive and Complex data types.
The primitive data types supported by Hive are listed below:
  1.  Numeric Types
    TINYINT (1-byte signed integer, from -128 to 127)
    SMALLINT (2-byte signed integer, from -32,768 to 32,767)
    INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
    BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
    FLOAT (4-byte single precision floating point number)
    DOUBLE (8-byte double precision floating point number)
    DECIMAL (Hive 0.13.0 introduced user definable precision and scale)
  2. Date/Time Types
    TIMESTAMP:It supports traditional UNIX timestamp with optional nanosecond precision. It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”

    DATE:
    DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}

    We can also cast the String, Time-stamp values to Date format if they match format but declared as String or Time-stamp data type.

    Cast Type                                Result

    .............................                ................................................................................................
    cast(date as date)              Same date value
    cast(date as string)              Date is formatted as a string in the form ‘YYYY-MM-DD’.
    cast(date as timestamp)      Midnight of the year/month/day of the date value is returned as timestamp.
    cast(string as date)              If the string is in the form ‘YYYY-MM-DD’, then a date value corresponding to that is returned. If the string value does not match this format, then NULL is returned.
    cast(timestamp as date)      The year/month/day of the timestamp is returned as a date value.
  3.  String Types:STRING
    VARCHAR: Supports length from 1 to 65355 char
    CHAR:Supports length from 1 to 255char

  4.  Misc TypesBOOLEAN
    BINARY
Complex Types:
In addition to primitive data types, Hive also support complex data types (or also known as collection data types) which are not available in many RDBMSs.

Complex Types can be built up from primitive types and other composite types. Data type of the fields in the collection are specified using an angled bracket notation. Currently Hive supports four complex data types. They are:
  1. arrays:   An Ordered sequences of similar type elements that are indexable usingzero-based integers. It is similar to arrays in Java.
    Exp:
    array (‘siva’, ‘bala’, ‘praveen’); 
  2. maps:  Collection of key-value pairs. Fields are accessed using array notation of keys (e.g., [‘key’])
    Exp:
    ‘first’ -> ‘bala’ , ‘last’ -> ‘PG’ is represented as map(‘first’, ‘bala’, ‘last’, ‘PG’). 
    Now ‘bala ‘ can be accessed with map[‘first’].
  3. structs:  It is similar to STRUCT in C language. It is a record type which encapsulates a set of named fields that can be any primitive data type. Elements in STRUCT type are accessed using the DOT (.) notation.
    Exp:
    For a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
  4. union: Union is a collection of heterogeneous data types. You can create an instance using create union
    UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

    Exp:
    {0:1} 
    {1:2.0} 
    {2:["three","four"]} 
    {3:{"a":5,"b":"five"}}

Creating a Table in Hive using complex datatype:
CREATE TABLE user (
     name      STRING,
     id        BIGINT,
     isFTE     BOOLEAN,
     role      VARCHAR(64),
     salary    DECIMAL(8,2),
     phones    ARRAY<INT>,
     deductions MAP<CHAR, FLOAT>,
     address   STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
     others    UNIONTYPE<FLOAT,BOOLEAN,STRING>,
     misc      BINARY
     )
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    COLLECTION ITEMS TERMINATED BY '\002'
    MAP KEYS TERMINATED BY '\003'
    LINES TERMINATED BY '\n';

Below is the examples for complex data types.

Array:
$ cat >arrayfile
1,abc,40000,a$b$c,hyd
2,def,3000,d$f,bang
hive> create table tab7(id int,name string,sal bigint,sub array<string>,city string)
> row format delimited
> fields terminated by ‘,’
> collection items terminated by ‘$’;
hive>select sub[2] from tab7 where id=1;
hive>select sub[0] from tab7;

MAP:
$ cat >mapfile
1,abc,40000,a$b$c,pf#500$epf#200,hyd
2,def,3000,d$f,pf#500,bang
hive>create table tab10(id int,name string,sal bigint,sub array<string>,dud map<string,int>,city string)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘$’
map keys terminated by ‘#’;
hive> load data local inpath ‘/home/training/mapfile’ overwrite into table tab10;
hive>select dud[“pf”] from tab10;
hive>select dud[“pf”],dud[“epf”] from tab10;

STRUCT:
cat >mapfile
1,abc,40000,a$b$c,pf#500$epf#200,hyd$ap$500001
2,def,3000,d$f,pf#500,bang$kar$600038
hive> create table tab11(id int,name string,sal bigint,sub array<string>,dud map<string,int>,addr struct<city:string,state:string,pin:bigint>)
> row format delimited
> fields terminated by ‘,’
> collection items terminated by ‘$’
> map keys terminated by ‘#’;
hive> load data local inpath ‘/home/training/structfile’ into table tab11;
hive>select addr.city from tab11;



Lets go through all syntax used in Hive:

Create Database:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;


Show Database:
hive> SHOW DATABASES;
default
userdb
Drop Database:
hive> DROP DATABASE IF EXISTS userdb;
//The following query drops the database using CASCADE. It means dropping respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;
Create table:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Exp:
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Alter Table:

ALTER TABLE tblname RENAME TO new_name
ALTER TABLE tblname ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE tblname DROP [COLUMN] column_name
ALTER TABLE tblname CHANGE column_name new_name new_type
ALTER TABLE tblname REPLACE COLUMNS (col_spec[, col_spec ...])

show Tables:
show tables;

Loading File data into Hive table:

LOAD DATA LOCAL INPATH 'FILEPATH' OVERWRITE INTO TABLE table_name
We can change the settings within Hive session, using the command known as SET. It is used to change Hive job settings for a query to gain the exact results.

Hive Operators:

  • Relational Operators
  • Arithmetic Operators
  • Logical Operators
  • String Operators
  • Operators on Complex Types
Hive Relational Operators:
These operators compare two operands and generate a TRUE or FALSE value. The below table describes the relational operators available in Hive:
Operator        Operand types       Description

A+B                   All number types       Gives the result of adding A and B
A-B                   All number types       Gives the result of subtracting B from A
A*B                   All number types       Gives the result of multiplying A and
A/B                   All number types         Gives the result of dividing A by B
A % B               All number types      Gives the remainder resulting from dividing A by B
A & B               All number types      Gives the result of bitwise AND of A and B
A | B                 All number types      Gives the result of bitwise OR of A a
A ^ B                All number types      Gives the result of bitwise XOR of A and B

Hive Logical Operators:
Operator        Operand types       Description
A AND B               Boolean                     TRUE if both A and B are TRUE. Otherwise FALSE. NULL if A or B is NULL.
A OR B                  Boolean                    TRUE if either A or B or both are TRUE, FALSE OR NULL is NULL. Otherwise FALSE.
NOT A                  Boolean                       TRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE.
! A                         Boolean                      Same as NOT A.

Operators on Complex Types:
Operator        Operand types                                 Description
A[n]                 A is an Array and n is an int                                Returns the nth element in the array A. The first element has index 0
M[key]            M is a Map<K, V> and key has type K              Returns the value corresponding to the key in the map
S.x                    S is a struct                                                           Returns the x field of S.


Hive Built-In Functions:
Hive support below Built-in functions

Return TypeSignatureDescription
BIGINTround(double a)It returns the rounded BIGINT value of the double.
BIGINTfloor(double a)It returns the maximum BIGINT value that is equal or less than the double.
BIGINTceil(double a)It returns the minimum BIGINT value that is equal or greater than the double.
doublerand(), rand(int seed)It returns a random number that changes from row to row.
stringconcat(string A, string B,...)It returns the string resulting from concatenating B after A.
stringsubstr(string A, int start)It returns the substring of A starting from start position till the end of string A.
stringsubstr(string A, int start, int length)It returns the substring of A starting from start position with the given length.
stringupper(string A)It returns the string resulting from converting all characters of A to upper case.
stringucase(string A)Same as above.
stringlower(string A)It returns the string resulting from converting all characters of B to lower case.
stringlcase(string A)Same as above.
stringtrim(string A)It returns the string resulting from trimming spaces from both ends of A.
stringltrim(string A)It returns the string resulting from trimming spaces from the beginning (left hand side) of A.
stringrtrim(string A)rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A.
stringregexp_replace(string A, string B, string C)It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C.
intsize(Map<K.V>)It returns the number of elements in the map type.
intsize(Array<T>)It returns the number of elements in the array type.
value of <type>cast(<expr> as <type>)It converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed.
stringfrom_unixtime(int unixtime)convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"
stringto_date(string timestamp)It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"
intyear(string date)It returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
intmonth(string date)It returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
intday(string date)It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
stringget_json_object(string json_string, string path)It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid.
Creating View in Hive:
Hive also has the concept of a View as a "copy" of the table. The main reason for having a view of a table is to simplify some of the complexities of a larger table into a more Flat structure. For instance, if you have a table that has 100 columns, but you are only interested in 5, you could create a View with those 5 columns. The advantage of using a View over a table are only for simplifying the query (the view normally has less columns than the original table, less complexity and is flatter in nature).
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Exp:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
Dropping view:
DROP VIEW view_name

Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Exp:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';


Select Query with all options:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition]
[ORDER BY col_list]] 
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] 
[LIMIT number];

Hive Partioning & Bucketing:
Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values. Generally, Hive users know about the domain of the data that they deal with. With this knowledge they can identify common columns that are frequently queried in order to identify columns with low cardinality which can be used to organize data using the partitioning feature of Hive. In non-partitioned tables, Hive would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive—especially in cases of large tables.
The concept of partitioning is not new for folks who are familiar with relational databases. Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks. In Hive, partitioning is supported for both managed and external tables in the table definition as seen below.

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
In Hive's implementation of partitioning, data within a table is split across multiple partitions. Each partitions corresponds to a particular values(s) of partition column(s) and is stored as sub directory within tables's directory on HDFS. When the table is queried, where applicable only the required partitions of table  is is queried, thereby reducing the I/O and the required by the query.

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.

For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data:


CREATE TABLE REGISTRATION-DATA   (
    userid             BIGINT,
    First_Name        STRING,
    Last_Name         STRING,
    address1           STRING,
    address2           STRING,
    city               STRING,
    zip_code           STRING,
    state              STRING
 
)
PARTITION BY  (
    REGION             STRING,
    COUNTRY            STRING
) 
As you can see, multi-column partition is supported (REGION/COUNTRY). You do not need to include the partition columns in the table definition and you can still use them in your query projections. The partition statement lets Hive alter the way it manages the underlying structures of the table’s data directory. If you browse the location of the data directory for a non-partitioned table, it will look like this: .db/. All the data files are directly written to this directory. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. In case the table is partitioned on multiple columns, then Hive creates nested subdirectories based on the order of partition columns in the table definition. For instance, from the above example of the registration data table the subdirectories will look like the example below


/quaero.<strong>db</strong>/registration-data/region=South America/country=BR
/quaero.<strong>db</strong>/registration-data/region=South America/country=ME
/quaero.<strong>db</strong>/registration-data/region=North America/country=<strong>US</strong>
/quaero.<strong>db</strong>/registration-data/region=North America/country=CA
When a partitioned table is queried with one or both partition columns in criteria or in the WHERE clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect.
There are two types of partitioning in HIVE:
1.Static Partitioning       2.Dynamic Partitioning

Static Partitioning: In Static Partition, we know the partition column before itself. So far so good, now when we load data there it makes the difference.


LOAD DATA LOCAL INPATH [path_name] OVERWRITE INTO TABLE [table_name] PARTITION(partition_column='value'....)
Here we have to give the partition column value explicitly whenever we want to create new partition as shown below:


DYNAMIC PARTITIONING: In order to achieve Dynamic Partition, we need to set 4 things.By default, Dynamic Partition is disabled till HIVE 2.0 to prevent accidental creation of partitions.So In order to achieve Dynamic Partition we need to set 4 things through Hive shell or hive-site.xml
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;

set hive.exec.max.dynamic.partitions.pernode=100;

Loading in hive is instantaneous process and it won't trigger a Map/Reduce job. That's why our file is stored as UserLog.txt instead of 00000_o file.
Pointers
A few things to keep in mind when using partitioning:
  • It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data and put strain on the name node to manage all the underlying structures in HDFS.
  • Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
  • Partitioning columns should be selected such that it results in roughly similar size partitions in order to prevent a single long running thread from holding up things.
  • If hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic.
  • If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the set hive.mapred.mode=strict command. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.

using Hive partitioning in the right context and on appropriate columns will help a data management platform be much more efficient.


Difference between hive and pig

HivePig
Used for Data AnalysisUsed for Data and Programs
Used as Structured DataPig is Semi-Structured Data
Hive has HiveQLPig has Latin
Hive is used for creating reportsPig is used for programming
Hive works on the server sidePig works on the client side
Hive does not support avroPig supports Avro

Comments

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Solutions

    Data Lake Companies

    Advanced Analytics Solutions

    Full Stack Development Company

    ReplyDelete
  2. Nice Article you have posted here. Thank you for giving this innovative information and please add more in future.Full Stack Development Company

    ReplyDelete
  3. Google brain is working in the AWS big data consultant to make it a huge success for the world. We hope that society will soon use AI devices at a reasonable cost.

    ReplyDelete

Post a Comment

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World