PIG

Hadoop
----------------------------
Map Reduce | Pig | Hive
SPARK
----------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

BigData Project List

Apache Pig is an abstraction over MapReduce.It is a tool/Platform which is used to analyse larger sets of data representing them as data flows.The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

In nutshell, we can say that Pig is a high- level language platform for analyzing and querying huge dataset that are stored in HDFS. Language used in Pig is called PigLatin. It is very similar to SQL. It is used to load the data, apply the required filters and dump the data in the required format. 

Pig was created to simplify the burden of writing complex Java codes to perform MapReduce jobs. Earlier Hadoop developers have to write complex java codes in order to perform data analysis.

In order to perform analysis using Apache Pig, programmers have to write scripts using Pig Latin language to process data stored in Hadoop Distributed File System. Internally all these scripts are converted to Map and Reduce tasks. A component known as Pig Engine is present inside Apache Pig in which Pig Latin scripts are taken as input and these scripts gets converted into Map-Reduce jobs

Why we need PIG/What makes Pig Hadoop popular?
  • Programmer can perform MapReduce tasks easily withpout having to type complex code in Java.
  • Easy to learn read and write and implement if you know SQL.
  • Apache Pig uses multi-query approah,therefore reading the length of code,ultimately Pig reduces the development time by almost 10 times.
  • Provides a large number of nested data types such as Maps, Tuples and Bags which are not easily available in MapReduce along with some other data operations like Filters, Ordering and Joins.
  • It consist of different user groups for instance up to 90% of Yahoo’s MapReduce is done by Pig and up to 80% of Twitter’s MapReduce is also done by Pig and various other companies like Sales force, LinkedIn and Nokia etc are majoritively using the Pig.
Features of PIG:
  • Rich Set of operators: Like Join , sort ,filters..
  • Ease of Programming: Pig is similar to SQl and it is easy to write a PIG script if you are good as SQL.
  • Optimisation Opportunities: The tasks in Apache PIG optimize their execution automatically ,so the programmers need to focus only on semantics of language not the optimisation part.
  • Extensibility : Using the existing operators, users can develop their own functions to read, process and write dates.
  • User Define Functions (UDF’s): With the help of facility provided by Pig of creating UDF’s, we can easily create User Defined Functions on a number of programming languages such as Java and invoke or embed them in Pig Scripts.
  • All types of data handling: Analysis of all types of Data (i.e. both structured as well as unstructured) is provided by Apache Pig and the results are stored inside HDFS.

Apache Pig is generally used by data scientist for performing tasks involving ad-hoc processing and quick prototyping.

Apache Pig is used-
  • To Process huge data sources such as Weg Logs.
  • To Perform data processing for search plateforms.
  • To process time sensitive dataloada and in many other use cases of data processing in batch manner.
Apache Pig Components:
As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components.

Parser:
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer:
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler:
The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine:
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

History Behind PIG Development:
In 2006 , Apache  Pig was developed as a research project at Yahoo, especially to create and execute Map-Reduce jobs on every dataset. In 2007, Apache Pig was open source via Apache incupator.In 2008,The first release of Apache pig came out.In 2010, Apache Pig graduated as Apache top level project.

Apache Pig Execution Model/Mechanism:
You can run Apache Pig in two modes-
1) Local Mode   (syntax to launch in Local mode:   $ ./pig -x local2) HDFS (Map-Reduce) Mode (syntax to launch in Local mode:   $ ./pig -x local)

There are namely 3 ways of executing Pig programs which works on both local and MapReduce mode:
•Script :Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.
•Grunt:Grunt is an interactive shell programming for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option apparently not used. It is also possible to run Pig scripts from within Grunt using run and exec.
•Embedded:You can execute all the Pig programs from Java and can use JDBC to run SQL programs from Java.
Example: Word count in Pig
grunt> Lines=LOAD ‘input/hadoop.log’ AS (line: chararray);
grunt> Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word;
grunt> Groups = GROUP Words BY word;
grunt> Counts = FOREACH Groups GENERATE group, COUNT (Words);
grunt> Results = ORDER Words BY Counts DESC;
grunt>  Top5 = LIMIT Results 5;
grunt> STORE Top5 INTO /output/top5words

Data Types Available in PIG-Latin:The scalar data types in pig are in the form of int, float, double, long, chararray, and byte array. The complex data types in Pig are namely the map, tuple, and bag.
Map:
A Map is set of Key-Value pairs.The key needs to be type chararray and should be unique.the value might be of any type.It is represented by []
Example- [city’#’bang’,’pin’#560001]
Tuple:
A record that is formed by an ordered set of fields known as Tuple.The fields can be of any type.
Example- (Rajiv,30)
Bag:
A Bag is unordered set of tuples.in other words, a collection of tuples is known as bag.
Each tuple can have any number of fields.A bag is represented by {}.
It is not necessary that every tuple contain the same number of fields or that the in the same positition(column) have same data type.
So It is a huge collection of tuples ,unordered sequence , tuples arranged in the bag are separated by comma.
Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)
Relation:
 
A relation is a bag of tuples. The relations in PigLatin is unordered. 

LOAD function

Load function helps to load the data from the file system. It is a known as a relational operator. In the first step in data-flow language it is required to mention the input, which is completed by using the keyword named as ‘load’.
The LOAD syntax is
LOAD ‘mydata’ [USING function] [AS schema];
Example- A = LOAD ‘intellipaat.txt’;
A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);
The relational operations in Pig segmentation is as follows:
foreach, order by, filters, group, distinct, join, limit.
foreach:
 Takes a set of expressions and applies them to almost all the records in the data pipeline to next operator.
A =LOAD ‘input’ as (emp_name: charrarray, emp_id: long, emp_add : chararray, phone : chararray, preferences : map []);
B = foreach A generate emp_name, emp_id;
Filters:
It contains a predicate and it provides us to select which records will be retained in our data pipeline permanently.
Syntax: alias = FILTER alias BY expression;
Otherwise it indicates the name of the relation, By indicates required keyword and the expression containing Boolean.
Example: M = FILTER N BY F5 == 4;
Grouping:
The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
grunt> Group_bySingleColumn = GROUP Relation_name BY col1;
grunt> Group_byMultiColumn = GROUP Relation_name BY col1, col2;
grunt> Group_byAllColumn = GROUP Relation_name BY All;
Co-Grouping:
The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.
Exp: suppose we have 2 files data(having same schema/columns) and loaded into 2 different relation and if we want to group the both file data based on same column , we can use co-group operator.
grunt> cogroup_data = COGROUP student_details by age, employee_details by age;

Joins:
The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types
        Self Join:
When both relation is from same file.This join is called self join.
grunt> Rel_Name3 = Join RelName1 by col1, RelName2 by col2;
     InnerJoin: Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables
grunt> Rel_Name3 = Join RelName1 by RN1col, RelName2 by RN2col;
     Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways.
        Left Outer Join:
The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.
grunt> Relation3_name = JOIN Relation1_name BY RN1col LEFT OUTER, Relation2_name BY RN2col;
        Right Outer Join:The right outer join operation returns all rows from the right table, even if there are no matches in the left table
grunt> Relation3_name = JOIN Relation1_name BY RN1col RIGHT, Relation2_name BY RN2col;
        Full Outer Join:The full outer join operation returns rows when there is a match in one of the relations.
grunt> Relation3_name = JOIN Relation1_name BY RN1col FULL OUTER, Relation2_name BY RN2col;
CROSS:
The CROSS operator computes the cross-product of two or more relations.      
Ex:   if we have 5 records in rel1 and 4 rec in rel2 then Rel3 will have 20 records if we apply CROSS on rel1,rel2

grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
UNION:
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.
grunt> Relation3_name = UNION Relation1_name, Relation2_name;
SPLIT:
The SPLIT operator is used to split a relation into two or more relations.
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2)

Example:spliting student_details relation into student_details1 and student_details2 base on given condition
grunt> SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25);
DISTINCT:
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
grunt> Relation_name2 = DISTINCT Relatin_name1;
ORDER BY:
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
DESCRIBE:
The describe operator is used to view the schema of a relation.
grunt>Describe Relation_name
LIMIT:
The LIMIT operator is used to get a limited number of tuples from a relation.
grunt> Result = LIMIT Relation_name required number of tuple
EXPLAIN:
The describe operator is used to view the schema of a relation.
grunt>explain Relation_name
ILLUSTRATE:
The illustrate operator gives you the step-by-step execution of a sequence of statements.
grunt>Describe Relation_name

Apache Pig - Eval Functions:

Apache Pig provides various built-in functions namely eval, load, store, math, string, bag and tuple functions. AVG: The Pig-Latin AVG() function is used to compute the average of the numerical values within a bag. While calculating the average value, the AVG() function ignores the NULL values.
grunt>AVG(expression) 
Note *To get the average value of a group, we need to group it using the Group By operator and proceed with the average function. *To get the global average value, we need to perform a Group All operation, and calculate the average value using the AVG() function.

BagToString:
The Pig Latin BagToString() function is used to concatenate the elements of a bag into a string. While concatenating, we can place a delimiter between these values (optional).
Generally bags are disordered and can be arranged by using ORDER BY operator.
grunt> BagToString(vals:bag [, delimiter:chararray])
CONCAT: The CONCAT() function of Pig Latin is used to concatenate two or more expressions of the same type.
grunt> CONCAT (expression, expression, [...expression])
COUNT: The COUNT() function of Pig Latin is used to get the number of elements in a bag. While counting the number of tuples in a bag, the COUNT() function ignores (will not count) the tuples having a NULL value in the FIRST FIELD.
grunt> COUNT(expression)
COUNT_STAR: The COUNT_STAR() function of Pig Latin is similar to the COUNT() function. It is used to get the number of elements in a bag. While counting the elements, the COUNT_STAR() function includes the NULL values.
grunt> COUNT_STAR(expression)
DIFF: The DIFF() function of Pig Latin is used to compare two bags (fields) in a tuple. It takes two fields of a tuple as input and matches them. If they match, it returns an empty bag. If they do not match, it finds the elements that exist in one field (bag) and not found in the other, and returns these elements by wrapping them within a bag.
grunt> DIFF (expression, expression)

isEMPTY: The IsEmpty() function of Pig Latin is used to check if a bag or map is empty.
grunt> IsEmpty(expression)

MAX: The Pig Latin MAX() function is used to calculate the highest value for a column (numeric values or chararrays) in a single-column bag. While calculating the maximum value, the Max() function ignores the NULL values.
grunt> Max(expression)
MIN: The MIN() function of Pig Latin is used to get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag. While calculating the minimum value, the MIN() function ignores the NULL values.
grunt> MIN(expression)
PLUCKTuple: After performing operations like join to differentiate the columns of the two schemas, we use the function PluckTuple(). To use this function, first of all, we have to define a string Prefix and we have to filter for the columns in a relation that begin with that prefix.
DEFINE pluck PluckTuple(expression1) 
pluck(expression2)
SIZE:
The SIZE() function of Pig Latin is used to compute the number of elements based on any Pig data type.
grunt> SIZE(expression)
SUBTRACT:
The SUBTRACT() function of Pig Latin is used to subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag.

grunt> SUBTRACT(expression, expression)
SUM: You can use the SUM() function of Pig Latin to get the total of the numeric values of a column in a single-column bag. While computing the total, the SUM() function ignores the NULL values.

grunt> SUM(expression)

TOKENIZE:The TOKENIZE() function of Pig Latin is used to split a string (which contains a group of words) in a single tuple and returns a bag which contains the output of the split operation.
grunt> TOKENIZE(expression [, 'field_delimiter']) 

STRING functions

ENDSWITH – Compare two strings and to find if the string ends with the second string.
grunt> ENDSWITH(first_string, test_string);

STARTSWITH – Compare two strings and to find if the first stirng starts with second string.
grunt> STARTSWITH(first_string, test_string);

SUBSTRING – Returns a substring from string. It takes string, startIndex and stopIndex as parameters.
grunt> SUBSTRING(string, start_index, stop_index);

UPPER – Converts all strings into uppercase
grunt> UPPER(string);

LOWER – Converts all strings into lowercase
grunt> LOWER(string);

REPLACE – Replaces characters in a string with new characters. It accepts string, regular expression and new characters as arguments.
grunt> REPLACE(string, regex, newChraracters);

STRSPLIT – Splits the string as per the delimiter
grunt> STRSPLIT(string, regex_expression, limit);


PigStorage:
PigStorage() is the default load/store function in pig. PigStorage expects data to be formatted using field delimiters and the default delimiter is ‘\t’. PigStorage() itself can be used for both Load and Store functions. It Loads/stores data as structured text files. All Pig simple and complex data types can be read/written using this function. The input data to the load can be a file, a directory or a glob.
PigStorage( ['delimiter'] , ['options'] )
grunt> A = LOAD '/in/passwd' USING PigStorage(':', 'schema tagFile');

Here, the default delimiter is ‘\t’, but we can provide any other character/symbol as delimiter as shown in below examples. Options can be any of the below values, or multiple options can be space separated but enclosed in single quotes as shown in below examples.

TextLoader:
TextLoader works with unstructured Text files with UTF8 format. Its mainly useful when, we are not able to impose any structure (schema) to the records in the input text files, then we can load such files with this loader and each line of text is treated as a single field of bytearray type, which is the default data type in pig. These aliases will not have any schema.

TextLoader supports only loading of file but cannot be used to store data. TextLoader doesn’t accept any parameters.
grunt> A = LOAD '/in/hadooplogs' USING TextLoader();

BinStorage:
BinStorage is used to both load and store binary files. Users rarely use this but Pig internally uses this to load and store the temporary data that is generated between multiple MapReduce jobs.


When we try to save text data into binary format files with this BinStorage, we need to be careful to specify custom converter to convert the bytearray data type fields into correct data types when using/loading these Binary files created in the previous STORE statements.


grunt> STORE B INTO '/out/binary' USING BinStorage();
grunt> C = LOAD '/out/binary/part-*' USING BinStorage();


In comparison to SQL, Pig:
  • uses lazy evaluation,
  • uses extract, transform, load (ETL),
  • is able to store data at any point during a pipeline,
  • declares execution plans,
  • supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines
Difference Between Hive and Pig:
Hive can be treated as competitor for Pig in some cases and Hive also operates on HDFS similar to Pig but there are some significant differences. HiveQL is query language based on SQL but Pig Latin is not a query language. It is data flow scripting language.
Since Pig Latin is procedural, it fits very naturally in the pipeline paradigm. HiveQL on the other hand is declarative.

Projects:
  1. ZikaVirus Analysis and Out-Break Using PIG


Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World