Flume

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

BigData Project List

Generally, most of the data that is to be analyzed will be produced by various data sources like applications servers, social networking sites, cloud servers, and enterprise servers. This data will be in the form of log files and events.


Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.


Advantages of Apache Flume:There are several advantages of Apache Flume which makes it a better choice over others. The advantages are:


  • Flume is scalable, reliable, fault tolerant and customizable for different sources and sinks.
  • Apache Flume can store data in centralized stores (i.e data is supplied from a single store) like HBase & HDFS. 
  • Flume is horizontally scalable.
  • If the read rate exceeds the write rate, Flume provides a steady flow of data between read and write operations.
  • Flume provides reliable message delivery. The transactions in Flume are channel-based where two transactions (one sender & one receiver) are maintained for each message.
  • Using Flume, we can ingest data from multiple servers into Hadoop.
  • It gives us a solution which is reliable and distributed and helps us in collecting, aggregating and moving large amount of data sets like Facebook, Twitter and e-commerce websites.
  • It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS.
  • It supports a large set of sources and destinations types.
Flume Architecture
The following illustration depicts the basic architecture of Flume. As shown in the illustration, data generators (such as Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
Event:
An event is the basic unit of data transported by flume from source to destination.
  • Payload is opaque toFlume
  • Events are accompanied by optional header.
Headers:
  • Header are collections of unique key-value pairs.
  • Headers are used for contextual routing
 A typical Flume event would have the following structure −
Client:
Client is entity that simulates event generation, passed to one or more agents.
Agent:
An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. Following diagram represents a Flume Agent
As shown in the diagram a Flume Agent contains three main components namely, sourcechannel, and sink.
source is the component of an Agent which receives data from the data generators and transfers it to one or more channels in the form of Flume events.
Apache Flume supports several types of sources and each source receives events from a specified data generator.
Example − Avro source, Thrift source, twitter 1% source etc.
channel is a transient store which receives the events from the source and buffers them till they are consumed by sinks. It acts as a bridge between the sources and the sinks.
These channels are fully transactional and they can work with any number of sources and sinks.
Example − JDBC channel, File system channel, Memory channel, etc.
sink stores the data into centralized stores like HBase and HDFS. It consumes the data (events) from the channels and delivers it to the destination. The destination of the sink might be another agent or the central stores.
Example − HDFS sink
Note − A flume agent can have multiple sources, sinks and channels. We have listed all the supported sources, sinks, channels in the Flume configuration chapter of this tutorial.


Lets have some hands on by fetching data from twitter using FLUME.
To fetch Twitter data, we will have to follow the steps given below −
    • Create a twitter Application
    • Install / Start HDFS
    • Configure Flume

Creating a twitter Application:
In order to get the tweets from Twitter, it is needed to create a Twitter application.
  • To create a Twitter application, click on the following link https://apps.twitter.com/
  • Click on the Create New App button
  • Fill in the details, accept the Developer Agreement when finished, click on the Create your Twitter application button 
  • Under keys and Access Tokens tab at the bottom of the page, you can observe a button named Create my access token. Click on it to generate the access token.
  • Finally, click on the Test OAuth button which is on the right side top of the page. This will lead to a page which displays your Consumer key, Consumer secret, Access token, and Access token secret. Copy these details. These are useful to configure the agent in Flume.

Configuring Flume:
We have to configure the source, the channel, and the sink using the configuration file in the conf folder

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World