NOSQL-HBASE Database

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

BigData Project List

HBASE is a data model that is similar to Google's BIG Table designed to produce random access to huge amount of structured data.
Hadoop can perform only on batch processing and data will will be accessed only in sequential manner.This mean, one has to search entire data-set even for the simplest job.
A huge data-set when processed ,results in another huge data-set which should also be processed sequentially. At this point,a new solution is needed to access any point of data in a single unit of time(random access).

Db applications such as HBASE,CASSANDRA,CouchDB,Dynamo and MongoDB are some of databases that store huge amounts of data and access the data in random manner.

HBASE is distributed ,column oriented database built on top of Hadoop file system.It is open source project and is horizontally scallable .It leverages the fault tolerant provided by Hadoop file system.It is part of Hadoop file system that provides random real time read/write access to data in Hadoop file system.

How HBASE/BigTable works:
  • Cluster Storage(i.e. HDFS or GFS)
  • Data is structured as a giant "table" in which each row has a (primary) key.
  • Key lookup is the only way to retrieve a row -no indexes are natively  provided.
  • Data is bytes.
  • The Table's schema is defined by its name and the names of 'Column Families'.
  • All rows/cell are versioned and timestamped.
  • The Table is sorted, and partitioned horizontally.



HBASE internally use Hash tables and provide random access and it store the data in indexed HDFS files for faster look-ups.


Storage Mechanism in HBase:

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
  • Table is a collection of rows.
  • Row is a collection of column families.
  • Column family is a collection of columns.
  • Column is a collection of key value pairs.
  • Each column/cell has a timestamp.
 The following image shows column families in a column-oriented database:

Data is accessed and stored together  by RowKey. Similar data is grouped & stored in column families.




Key  Feature of  HBase:
  • Near real time speed:Perform fast, random read/write.
  • Column families can be optimised differently for performance.
  • keep version of data.
  • It has no join or Index capability natively and this is not needed as HBASE can handle millions of columns.
  • Scalable: Designed for massive scalability to handle nearly unlimited amounts of data(as it is on top of Hadoop system)
  • Flexibility:Can store data of any type (Structured / Semi-Structured /Un-structured)
  • Reliability:Automatic replication and built in fault tolerance.
  • Automatic sharding and load balancing.
  • Ease of Use:Support use of Java APIs,GThrift and REST gateways API

Applications of HBase:

It is used whenever there is a need to write heavy applications.HBase is used whenever we need to provide fast random access to available data.Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.





HBASE Architecture:

HBASE has 3 major components , Client, HBASE Master and Region Servers.


    • Regions: A subset of table's rows.
    • RegionServer(slave): Serves data for reads and writes.
    • Master: Coordinates the slaves.




















Region server Components:
  • WAL : write ahead log on disk(commit log),used for recovery.
  • BlockCache :Read Cahche,Least Recently Used evicted.
  • MemStore :Write Cache,sorted keyValue updates.
  • Hfile :sorted keyvalues on disk.


HBASE Read/Write Path:
WRITE:
  • Every write/update operation updates the in memory data(MemStore) and a local hard-disk file(WAL-Write Ahead Log).
  • Over the period of time, the data in MemStore is flushed to the HDFS and the MemStore & WAL are cleared.

MemStore:

  • Memstore is in-memory
  • It keeps sorted list key->Value 
  • In each region, there is one Memstore per column family.
  • Updates quickly sorted in memory.

HBASE Region Flush:When 1 Memstore is full:
  • all memstore in region flushed to new Hfiles on disk
  • HFile:sorted list of key->values on disk.1 HFile is there per column family.These HFiles resided in HDFS.

READ:
  • Every read operation first looksup for given data in Block Cache(RAM lookup).
  • Then the data is searched in MemStore(Memory).
  • Only if data is not found in Block Cache & MemStore  ,the data is searched in HFiles which are residing in HDFS.

HBASE Use Cases: Here are 3 main use cases where HBASE can be used.
  1. Capturing Incremental Data -- Time Series Data (Stuff with a Time Stamp)
        -- Sensor, System Metrics, Event, Log files
        -- High Volume,velocity Writes
  2. Information Exchange,Messaging
        --Email ,Chat,Inbox:Facebook
        --High Volume,velocity Writes/Read
  3. Content Serving, Web Application Backend
       --Online Catalog : Gap ,World Library Catalog.
       --Search Index:ebay
       --Online Pre-Computed View : Groupon ,Pinterest
       --High Volume,velocity Reads
So we know now that HBASE is Hadoop database which provides random,real time read/write access to very large data.

Lets learn some HBASE commands now which helps in CRUD operations:

========================================================================
//Creating table with column families CF1 & CF2
create 'myTestTable' , {NAME=>'CF1'},{NAME=>'CF2'}

//Chk the List of table
 list

//Inserting some data 
 // put tablename ,rowid ,'Column-family-name : Column-Name' ,'Value'
put 'myTestTable' ,'row1' , 'CF1:col1', 'R1CF1col1v1'
put 'myTestTable' ,'row1' , 'CF2:col1', 'R1CF2col1v1'

// If you want to get data based on particular rowid
   // Get all column for rowid :row1
     get 'myTestTable' , 'row1'
  //Get all column of column family CF1   
    get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1']}
  //Get specific column of  column families
   get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1:col1','CF2:col1']}
  //Get all column of column family CF1  & CF2
   get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1','CF2']}

// If you want all records not specific to any rowid
   //Get all records 
      scan 'myTestTable'
 //Get all records with all columns of column-family CF1
    scan 'myTestTable',{COLUMNS=>['CF1']}
 //Get all records with  specific columns of column-family CF1
    scan 'myTestTable',{COLUMNS=>['CF1:col1']}
//Get all records with all columns of column-family CF1 where rowid staring from 'row2'
    scan 'myTestTable',{COLUMNS=>['CF1'] , STARTROW=>'row2'}
//Get all records with all columns of column-family CF1 where rowid ends to 'row3'
   scan 'myTestTable',{COLUMNS=>['CF1'] , STOPROW=>'row3'}
//Get all records with all columns of column-family CF1 where rowid staring from 'row2' and rowid ends to 'row3' 
  scan 'myTestTable',{COLUMNS=>['CF1'] , STARTROW=>'row2',STOPROW=>'row3'}

//Get details of Tables
   describe 'myTestTable'

//Modify property of table
   //change version setting
     alter 'myTestTable' , NAME=>'CF1',VERSIONS=>5
 //Get records for specfic rowid with all Versions you need
   get 'myTestTable' , 'row1', {COLUMNS=>'CF1:col1',VERSIONS=>5}

//Count the records of tables
   count 'myTestTable'

//Delete specific columns 
  delete 'myTestTable' ,'row1' ,'CF1:col1'
//Delete entire row
  deleteall 'myTestTable','row1'

//To delete a table : first disable it and then delete it.
 disable 'myTestTable'
 drop 'myTestTable'

------------------------------------------------------------------------------------------------------------------


Lets Have some small assignment to practice these HBASE commands
========================================================================
Problem Assignment:
Create a table called dataScienceBooks whose schema is book Title, Description,Author.Book's title and description should be grouped as be saved/retrieved together.
Here is Dataset:

ID
Title
Description
Author
 1
Elements of Statistical Learning  
 Basics of statistics and machine learning
  Trevor Hastie
 2
 The Signal and the Noise: Why So Many Predictions Fail-but
Some Don't
 Distinguish a true signal from a universe of noisy data
 Nate Silver
 3
Machine Learning for Hackers
 Practical manual for learning machine learning
Drew Conway
                         :: Lets  create HBASE statement to solve given below requests::
1) Create a table dataScienceBooks with schema shown above
create 'dataScienceBooks', {NAME=>'info'}, {NAME=>'author'}
2) Fill the table with data as given in table above
put 'dataScienceBooks', '1', 'info:title', 'Elements of Statistical Learning'
put 'Book', '1', 'info:description', ' Basics of statistics and machine learning'
put 'Book', '1', 'author:name, 'Trevor Hastie'
put 'Book', '2', 'info:title', ' The Signal and the Noise: Why So Many Predictions Fail-but
Some Don't'
put 'Book', '2', 'info:description', ' Distinguish a true signal from a universe of noisy data'
put 'Book', '2', 'author:name', 'Nate Silver'
put 'Book', '3', 'info:title', 'Machine Learning for Hackers'
put 'Book', '3', 'info:description', 'Practical manual for learning machine learning'
put 'Book', '3', 'author:name', 'Drew Conway'
3) Count the number of rows. Make sure that every row is printed to the
screen as it being counted.
count 'dataScienceBooks', INTERVAL => 1

4) Retrieve an entire record with ID 1
get 'dataScienceBooks', '1'

5) Only retrieve title and description for record with ID 3.
get 'dataScienceBooks', '3', {COLUMNS => ['info:title', 'info:description']}

6) Display all the records to the screen.
scan 'dataScienceBooks'

7) Display title and author's name for all the records.
scan 'dataScienceBooks', {COLUMNS => ['info:title', 'author:name']}

8) Display title and description for the first 2 records.
scan 'Book', {COLUMNS=>['info:title','info:description'], LIMIT => '2'}

9) Drop the table
disable 'dataScienceBooks'
drop 'dataScienceBooks'

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World