NOSQL-HBASE Database

Hadoop
------------------------------
Map Reduce | Pig | Hive

SPARK
------------------------------------------------SQL & Hive | Streaming| ML | GraphX

NOSQL
-----------------------
MongoDB | HBase

Data Ingestion Tools
--------------------------
Sqoop | Flume

BigData Project List

HBASE is a data model that is similar to Google's BIG Table designed to produce random access to huge amount of structured data.
Hadoop can perform only on batch processing and data will will be accessed only in sequential manner.This mean, one has to search entire data-set even for the simplest job.
A huge data-set when processed ,results in another huge data-set which should also be processed sequentially. At this point,a new solution is needed to access any point of data in a single unit of time(random access).

Db applications such as HBASE,CASSANDRA,CouchDB,Dynamo and MongoDB are some of databases that store huge amounts of data and access the data in random manner.

HBASE is distributed ,column oriented database built on top of Hadoop file system.It is open source project and is horizontally scallable .It leverages the fault tolerant provided by Hadoop file system.It is part of Hadoop file system that provides random real time read/write access to data in Hadoop file system.

How HBASE/BigTable works:

Cluster Storage(i.e. HDFS or GFS)
Data is structured as a giant "table" in which each row has a (primary) key.
Key lookup is the only way to retrieve a row -no indexes are natively provided.
Data is bytes.
The Table's schema is defined by its name and the names of 'Column Families'.
All rows/cell are versioned and timestamped.
The Table is sorted, and partitioned horizontally.

HBASE internally use Hash tables and provide random access and it store the data in indexed HDFS files for faster look-ups.

Storage Mechanism in HBase:

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:

Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Each column/cell has a timestamp.

The following image shows column families in a column-oriented database:

Data is accessed and stored together by RowKey. Similar data is grouped & stored in column families.

Key Feature of HBase:

Near real time speed:Perform fast, random read/write.
Column families can be optimised differently for performance.
keep version of data.
It has no join or Index capability natively and this is not needed as HBASE can handle millions of columns.
Scalable: Designed for massive scalability to handle nearly unlimited amounts of data(as it is on top of Hadoop system)
Flexibility:Can store data of any type (Structured / Semi-Structured /Un-structured)
Reliability:Automatic replication and built in fault tolerance.
Automatic sharding and load balancing.
Ease of Use:Support use of Java APIs,GThrift and REST gateways API

Applications of HBase:

It is used whenever there is a need to write heavy applications.HBase is used whenever we need to provide fast random access to available data.Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBASE Architecture:

HBASE has 3 major components , Client, HBASE Master and Region Servers.

Regions: A subset of table's rows.

RegionServer(slave): Serves data for reads and writes.

Master: Coordinates the slaves.

Region server Components:

WAL : write ahead log on disk(commit log),used for recovery.
BlockCache :Read Cahche,Least Recently Used evicted.
MemStore :Write Cache,sorted keyValue updates.
Hfile :sorted keyvalues on disk.

HBASE Read/Write Path:

WRITE:

Every write/update operation updates the in memory data(MemStore) and a local hard-disk file(WAL-Write Ahead Log).
Over the period of time, the data in MemStore is flushed to the HDFS and the MemStore & WAL are cleared.

MemStore:

Memstore is in-memory
It keeps sorted list key->Value
In each region, there is one Memstore per column family.
Updates quickly sorted in memory.

HBASE Region Flush:When 1 Memstore is full:

all memstore in region flushed to new Hfiles on disk
HFile:sorted list of key->values on disk.1 HFile is there per column family.These HFiles resided in HDFS.

READ:

Every read operation first looksup for given data in Block Cache(RAM lookup).
Then the data is searched in MemStore(Memory).
Only if data is not found in Block Cache & MemStore ,the data is searched in HFiles which are residing in HDFS.

HBASE Use Cases: Here are 3 main use cases where HBASE can be used.

Capturing Incremental Data -- Time Series Data (Stuff with a Time Stamp)
-- Sensor, System Metrics, Event, Log files
-- High Volume,velocity Writes
Information Exchange,Messaging
--Email ,Chat,Inbox:Facebook
--High Volume,velocity Writes/Read
Content Serving, Web Application Backend
--Online Catalog : Gap ,World Library Catalog.
--Search Index:ebay
--Online Pre-Computed View : Groupon ,Pinterest
--High Volume,velocity Reads

So we know now that HBASE is Hadoop database which provides random,real time read/write access to very large data.

Lets learn some HBASE commands now which helps in CRUD operations:

========================================================================

//Creating table with column families CF1 & CF2

create 'myTestTable' , {NAME=>'CF1'},{NAME=>'CF2'}

//Chk the List of table

list

//Inserting some data

// put tablename ,rowid ,'Column-family-name : Column-Name' ,'Value'

put 'myTestTable' ,'row1' , 'CF1:col1', 'R1CF1col1v1'

put 'myTestTable' ,'row1' , 'CF2:col1', 'R1CF2col1v1'

// If you want to get data based on particular rowid

// Get all column for rowid :row1

get 'myTestTable' , 'row1'

//Get all column of column family CF1

get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1']}

//Get specific column of column families

get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1:col1','CF2:col1']}

//Get all column of column family CF1 & CF2

get 'myTestTable' , 'row1' ,{COLUMNS=>['CF1','CF2']}

// If you want all records not specific to any rowid

//Get all records

scan 'myTestTable'

//Get all records with all columns of column-family CF1

scan 'myTestTable',{COLUMNS=>['CF1']}

//Get all records with specific columns of column-family CF1

scan 'myTestTable',{COLUMNS=>['CF1:col1']}

//Get all records with all columns of column-family CF1 where rowid staring from 'row2'

scan 'myTestTable',{COLUMNS=>['CF1'] , STARTROW=>'row2'}

//Get all records with all columns of column-family CF1 where rowid ends to 'row3'

scan 'myTestTable',{COLUMNS=>['CF1'] , STOPROW=>'row3'}

//Get all records with all columns of column-family CF1 where rowid staring from 'row2' and rowid ends to 'row3'

scan 'myTestTable',{COLUMNS=>['CF1'] , STARTROW=>'row2',STOPROW=>'row3'}

//Get details of Tables

describe 'myTestTable'

//Modify property of table

//change version setting

alter 'myTestTable' , NAME=>'CF1',VERSIONS=>5

//Get records for specfic rowid with all Versions you need

get 'myTestTable' , 'row1', {COLUMNS=>'CF1:col1',VERSIONS=>5}

//Count the records of tables

count 'myTestTable'

//Delete specific columns

delete 'myTestTable' ,'row1' ,'CF1:col1'

//Delete entire row

deleteall 'myTestTable','row1'

//To delete a table : first disable it and then delete it.

disable 'myTestTable'

drop 'myTestTable'

------------------------------------------------------------------------------------------------------------------

Lets Have some small assignment to practice these HBASE commands

========================================================================

Problem Assignment:
Create a table called dataScienceBooks whose schema is book Title, Description,Author.Book's title and description should be grouped as be saved/retrieved together.

Here is Dataset:

ID	Title	Description	Author
1	Elements of Statistical Learning	Basics of statistics and machine learning	Trevor Hastie
2	The Signal and the Noise: Why So Many Predictions Fail-but Some Don't	Distinguish a true signal from a universe of noisy data	Nate Silver
3	Machine Learning for Hackers	Practical manual for learning machine learning	Drew Conway

:: Lets create HBASE statement to solve given below requests::
1) Create a table dataScienceBooks with schema shown above

create 'dataScienceBooks', {NAME=>'info'}, {NAME=>'author'}

2) Fill the table with data as given in table above

put 'dataScienceBooks', '1', 'info:title', 'Elements of Statistical Learning'

put 'Book', '1', 'info:description', ' Basics of statistics and machine learning'

put 'Book', '1', 'author:name, 'Trevor Hastie'

put 'Book', '2', 'info:title', ' The Signal and the Noise: Why So Many Predictions Fail-but

Some Don't'

put 'Book', '2', 'info:description', ' Distinguish a true signal from a universe of noisy data'

put 'Book', '2', 'author:name', 'Nate Silver'

put 'Book', '3', 'info:title', 'Machine Learning for Hackers'

put 'Book', '3', 'info:description', 'Practical manual for learning machine learning'

put 'Book', '3', 'author:name', 'Drew Conway'

3) Count the number of rows. Make sure that every row is printed to the

screen as it being counted.

count 'dataScienceBooks', INTERVAL => 1

4) Retrieve an entire record with ID 1

get 'dataScienceBooks', '1'

5) Only retrieve title and description for record with ID 3.

get 'dataScienceBooks', '3', {COLUMNS => ['info:title', 'info:description']}

6) Display all the records to the screen.

scan 'dataScienceBooks'

7) Display title and author's name for all the records.

scan 'dataScienceBooks', {COLUMNS => ['info:title', 'author:name']}

8) Display title and description for the first 2 records.

scan 'Book', {COLUMNS=>['info:title','info:description'], LIMIT => '2'}

9) Drop the table

disable 'dataScienceBooks'

drop 'dataScienceBooks'

Search This Blog

Exploring BigData in Todays World