NOSQL Databases

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

BigData Project List

Over the last few years we have seen the rise of a new type of databases, known as NoSQL databases, that are challenging the dominance of relational databases. Relational databases have dominated the software industry for a long time providing mechanisms to store data persistently, concurrency control, transactions, mostly standard interfaces and mechanisms to integrate application data, reporting. The dominance of relational databases, however, is cracking.Relational database were never built for distributed applications.With the speed data is growing now(in PetaBytes), RDBMS system are not able to handle that.Also RDBMS can only handle well structured data.

Databases can be divided in 3 types:
  • RDBMS (Relational Database Management System)
  • OLAP (Online Analytical Processing)
  • NoSQL (recently developed database)
Relational Databases Schema(Limitations in handling Bigdata):

  • You cant add a record which does not fit the schema
  • You need to add nulls to unused items in row(wastage of storage)
  • We should consider the data types.i.e:you cant add a string to an integer field.
  • You can't add multiple items in a field(You have to create another table with primary-foreign key,joins, normalisation ...!!!)
  • Not fit for distributed model as if data is distributed over network then joins and other operation is going to be very costly which will make performance very worst. .

NoSQL means Not Only SQL, implying that when designing a software solution or product, there are more than one storage mechanism that could be used based on the needs. NoSQL was a hashtag (#nosql) choosen for a meetup to discuss these new databases. The most important result of the rise of NoSQL is Polyglot Persistence. NoSQL does not have a prescriptive definition but we can make a set of common observations, such as:
  • Not using the relational model
  • Running well on clusters
  • Mostly open-source
  • Built for the 21st century web estates
  • Schema-less

NoSQL provides a mechanism for storage and retrieval of data other than tabular relations model used in relational databases. NoSQL database doesn't use tables for storing data. It is generally used to store big data and real-time web applications.NoSQL runs distributed fashain and can handle the large amout of data growing daya by day without impacting the performance and can handle semi structured data.

Schema in NOSQL Databases:

  • There is no schema to consider.
  • There is no unused cell(No wastage of storage).
  • There is no datatype(implicit)
  • Most of considerations (data validation and logical checks) are done in application layer.
  • We gather all items in an aggregate(document)

One of the most fundamental choices to make when developing an application is whether to use a SQL or NoSQL database to store the data. Conventional SQL (i.e. relational) databases are the product of decades of technology evolution, good practice, and real-world stress testing. They are designed for reliable transactions and ad hoc queries, the staples of line of business applications. But they also come burdened with restrictions—such as rigid schema—that make them less suitable for other kinds of apps.

NoSQL databases arose in response to those limitations. NoSQL systems store and manage data in ways that allow for high operational speed and great flexibility on the part of the developers. Many were developed by companies like Google, Amazon, Yahoo, and Facebook that sought better ways to store content or process data for massive websites. Unlike SQL databases, many NoSQL databases can be scaled horizontally across hundreds or thousands of servers.

NOSQL Database provides few things..
  1. Easy and frequent changes to database
  2. Horizontal scaling
  3. Solution to Impedance mismatch
  4. Fast Developement
But NoSQL databases avoids few things..
  1. Overhead of ACID transactions
  2. Complexity of SQL Queries
  3. Burdon of up-front schema design.
Aggregation:
  • Related data is kept together , stored as single unit.
  • Transaction should not cross aggregate boundaries, means you should be dealing with same record only as all related data is stored only as one record only not in multiple tables(you not trying to update multiple records at a time)
  • Minimises join operations as data is stored as complete unit in one record.
  • This helps in scaling out.


There are more than 150 NoSQL databases(nosql-database.org).With NoSQL, data can be stored in a schema-less or free-form fashion. Any data can be stored in any record. Among the NoSQL databases, you will find four common models for storing data, which lead to four common types of NoSQL systems:





Document databases (e.g. CouchDB, MongoDB). Inserted data is stored in the form of free-form JSON structures or “documents,” where the data could be anything from integers to strings to free form text. There is no inherent need to specify what fields, if any, a document will contain.

Key-value stores (e.g. Redis, Riak). Free-form values—from simple integers or strings to complex JSON documents—are accessed in the database by way of keys.

Wide column stores (e.g. HBase, Cassandra). Data is stored in columns instead of rows as in a conventional SQL system. Any number of columns (and therefore many different types of data) can be grouped or aggregated as needed for queries or data views.

Graph databases (e.g. Neo4j). Data is represented as a network or graph of entities and their relationships, with each node in the graph a free-form chunk of data.


each NoSQL database tends to have its own syntax for querying and managing the data. CouchDB, for instance, uses requests in the form of JSON, sent via HTTP, to create or retrieve documents from 
its database. MongoDB sends JSON objects over a binary protocol, by way of a command-line interface or a language library.

These NOSQL databases are built on CAP theorem.Lets understand here what is CAP theorem.

CAP Theorem:

Cap theorem states that there are 3 basic requirements which exist in a special relation when designing application for distributed architecture.

In theoretically, it is impossible to to full-fill all 3 requirements.Therefore,all the current NOSQL database follows the different combination of C , A ,P from CAP Theorem.So a combination of 2 must be chosen and this is usually the deciding factor in what technology is used.

When it comes to distributed databases, the two choices are only AP or CP because if it's not partition tolerant, it's not really a reliable distributed database. So the choice is simpler: if a network split happens, do you want the database to keep answering but with possibly old/bad data (AP)? Or should it just stop responding unless you can get the absolute latest copy (CP)?

NOSQL DBs follows BASE system not ACID system.A BASE system gives up on consistency.

  • Basically available indicates that the system does guarantee availability, in terms of the CAP theorem.
  • Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual consistency indicates that the system will become consistent over time, given that the system doesn't receive input during that time
Also there is another term sharding used very much in NOSQL.Lets see what is sharding.
What is sharding:
Sharding is a type of database partitioning that separates very large databases the into smaller,  faster, more easily managed parts called data shards. The word shard means a small part of a whole.
In the simplest sense, sharding your database involves breaking up your big database into many, much smaller databases that share nothing and can be spread across multiple servers.Technically, sharding is a synonym for horizontal partitioning. In practice, the term is often used to refer to any database partitioning that is meant to make a very large database more manageable

In subsequent pages, we will learn MongoDB , HBASE...

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World