ZikaVirus Analysis and Out-Break Using PIG

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

Project Introduction:
Domain Health Care
Technology Use -  PIG Latin on hadoop cluster
DataSet -  cdc_zika.csv (DataSet Link: https://www.kaggle.com/cdc/zika-virus-epidemic)

An outbreak of the Zika virus, an infection transmitted mostly by the Aedes species mosquito (Ae. aegypti and Ae. albopictus), has been sweeping across the Americas and the Pacific since mid-2015. This dataset shares publicly available data related to the ongoing Zika epidemic.
With help of this dataset,we are going to generate few reports which will help us understanding  that how many countries are impacted with zika virus, is virus increasing/deceasing in affected countries and affecting more to what age groups and many more.. )

DATASET DESCRIPTION:
  • report_date ::The report date is the date that the report was published.
  • location ::A location is specified for each observation following the specific names specified in the country place name database
  • location_type:: A location code is included indicating: city, district, municipality, county, state, province, or country. 
  • data_field::The data field is a short description of what data is represented in the row and is related to a specific definition defined by the report from which it comes.
  • data_field_code::This code is defined in the country data guide. It includes a two letter country code (ISO-3166 alpha-2, list), followed by a 4-digit number corresponding to a specific report type and data type.
  • time_period::Optional
  • time_period_type::Required only if 'time_period' is specified.
  • value::The observation indicated for the specific 'report_date', 'location', 'data_field' 
  • unit :The unit of measurement for the 'data_field'
Now lets do some analysis on this dataset.

Q1::Most affected country in terms of Zika confirmed and Zika suspected cases, as well as the count of total number of Zika confirmed and Zika suspected cases in all the available countries in the dataset.

// Load data file with below schema

raw = LOAD 'cdc_zika.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (report_date:chararray,location:chararray,location_type:chararray,data_field:chararray,data_field_code:chararray,time_period:chararray,time_period_type:chararray,value:int,unit:chararray); 

// Filter loaded raw data having 'zika confirmed' & 'zika suspected' cases 
filter_reportData = FILTER raw BY (location_type == 'country' and ((data_field matches '.*(zika).*' and data_field matches '.*(confirmed).*') or (data_field matches '.*(zika).*' and data_field matches '.*(suspected).*')));

// Get only field required for report
reqReportField = FOREACH filter_reportData GENERATE location,value;

//Group by location field to get location wise total zika cases
grp_reportData  =  GROUP reqReportField BY location;

//Get county and Total zika cases found against that county
grp_reportData_count = FOREACH grp_reportData Generate  group as Country,SUM(reqReportField.value) as TotalZikaCases;

//Sort results by total zika cases found for each county
Zika_AffectedCases = ORDER grp_reportData_count BY TotalZikaCases DESC;

//Store the results to hadoop cluster
STORE Zika_AffectedCases INTO 'Zika_AffectedCases';

//Below are the results from this report  showing country name and total 'zika confirmed' & 'zika suspected' cases(Dominican_Republic,48025)(Ecuador,7146)(Guatemala,4923)(Nicaragua,4863)(Haiti,329)

//Limit the result set  to 1 which will show most affected county from  'zika confirmed' & 'zika suspected' cases
most_Zika_AffectedCountry = LIMIT  Zika_AffectedCases 1;

//Store the results to hadoop cluster
STORE most_Zika_AffectedCountry INTO 'most_Zika_AffectedCountry';

//Below is the results from this report  showing most affected county name with total 'zika confirmed' & 'zika suspected' cases
(Dominican_Republic,48025)

Here is  Data visualisation using tableau tool:
.


Q2::As the analysis in the dataset is weekly or twice a week, reported date wise analysis of most number of Zika confirmed and Zika discarded cases, which will be helpful in understanding in which month or season a country is most affected to the disease, which in turn helps in understanding the temperature or the climate condition that leading the source of this disease.

// Load data file with below schema
raw = LOAD 'cdc_zika_test.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (report_date:chararray, location:chararray,location_type:chararray,data_field:chararray, data_field_code:chararray, time_period:chararray, time_period_type:chararray,value:int, unit:chararray); 

// Filter loaded raw data having 'zika confirmed' & 'zika discarded' cases and location_type as 'county'
filter_reportData = FILTER raw BY (location_type == 'country' and ((data_field matches '.*(zika).*' and data_field matches '.*(confirmed).*') or (data_field matches '.*(zika).*' and data_field matches '.*(discarded).*')));

// Get only field required for report
getReportField = FOREACH filter_reportData GENERATE report_date,location,value;

//Group by report_date ,location field to get ReportedDate & Country wise total zika cases
grpBY_reportData  =  GROUP getReportField BY (report_date,location);

//Get ReportedDate ,county and Total zika cases found against that ReportedDate ,county
totalCase_By_ReportedDate_n_Country = FOREACH grpBY_reportData Generate  group.report_date,group.location, SUM(getReportField.value) as TotalZikaCases;

//Group By report_date to group record by report date.
grpByCases_By_reportedDate  =  GROUP totalCase_By_ReportedDate_n_Country BY report_date;

// Get most Affected country against each reported date.
mostAfectedCounty_OnReporteDates = FOREACH grpByCases_By_reportedDate {
 inner_sorted = ORDER totalCase_By_ReportedDate_n_Country BY TotalZikaCases DESC;  mostAfectedcountry_Record = LIMIT inner_sorted 1;
 GENERATE BagToString(mostAfectedcountry_Record) AS MostAfectedcountryOnReportedDate ;
}

//Store the results into Hadoop Cluster.
STORE mostAfectedCounty_OnReporteDates  INTO mostAfectedCounty_OnReporteDates';
//Below is result from above relation(2015-12-09_Guatemala_29)(2015-12-16_Guatemala_29) (2015-12-23_Guatemala_29) (2015-12-29_Guatemala_29) (2016-01-14_Guatemala_68) (2016-01-23_Dominican_Republic_10) (2016-01-26_Guatemala_37) (2016-02-06_Dominican_Republic_16) (2016-02-09_Nicaragua_44) (2016-02-11_Nicaragua_55) (2016-02-12_Nicaragua_63) (2016-02-13_Dominican_Republic_34) (2016-02-15_Nicaragua_92) (2016-02-16_Guatemala_146) (2016-02-20_Dominican_Republic_18) (2016-02-22_Nicaragua_89) (2016-02-23_Nicaragua_81) (2016-02-29_Nicaragua_98) (2016-03-01_Guatemala_167) (2016-03-07_Nicaragua_128) (2016-03-11_Nicaragua_112) (2016-03-14_Nicaragua_145)(2016-03-16_Nicaragua_140)(2016-03-19_Dominican_Republic_18)(2016-03-26_Dominican_Republic_40)(2016-03-28_Nicaragua_148)(2016-03-30_Ecuador_184)(2016-04-02_Dominican_Republic_70)(2016-04-04_Nicaragua_147)(2016-04-06_Ecuador_188)(2016-04-09_Dominican_Republic_73)(2016-04-13_Ecuador_191)(2016-04-16_Dominican_Republic_104)(2016-04-18_Nicaragua_158)(2016-04-20_Ecuador_197)(2016-04-23_Dominican_Republic_104)(2016-04-25_Nicaragua_178)(2016-04-27_Ecuador_202)(2016-04-30_Dominican_Republic_104)(2016-05-03_Nicaragua_196)(2016-05-04_Nicaragua_226)(2016-05-05_Nicaragua_217)(2016-05-07_Dominican_Republic_104)(2016-05-09_Nicaragua_217)(2016-05-11_Nicaragua_218)(2016-05-14_Dominican_Republic_104)(2016-05-16_Nicaragua_196)(2016-05-18_Ecuador_276)(2016-05-19_Nicaragua_234)(2016-05-21_Dominican_Republic_104)(2016-05-23_Nicaragua_246)(2016-05-25_Ecuador_338)(2016-05-26_Nicaragua_9)(2016-05-27_Nicaragua_254)(2016-05-28_Dominican_Republic_159)(2016-06-01_Ecuador_426)(2016-06-03_Nicaragua_47)(2016-06-04_Dominican_Republic_161)(2016-06-06_Nicaragua_232)(2016-06-07_Nicaragua_12)(2016-06-13_Nicaragua_328)(2016-06-15_Ecuador_550)(2016-06-22_Ecuador_1174)(2016-06-29_Ecuador_1316)

Here is  Data visualisation using tableau tool:
This below graph shows Reported Date wise, total Zika Confirmed & Discarded cases for each county.

Q3::Individual country’s reported date data to analyze the number of Zika confirmed and suspected cases increasing or decreasing day by day, which helps in identifying the countries where it’s growing adversely as well as about the countries taking precaution against the disease where the numbers are controlled or decreasing

// Load data file with below schema
raw = LOAD 'cdc_zika.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS   (report_date:chararray,location:chararray, location_type:chararray,data_field:chararray,data_field_code:chararray,time_period:chararray,time_period_type:chararray, value:int,unit:chararray); 

//Filter loaded raw data having 'zika confirmed' & 'zika suspected' cases and location_type as 'county'
filter_reportData = FILTER raw BY (location_type == 'country' and ((data_field matches '.*(zika).*' and data_field matches '.*(confirmed).*') or (data_field matches '.*(zika).*' and data_field matches '.*(suspected).*'))); 

//Get only field required for report
getReportField = FOREACH filter_reportData GENERATE location,report_date,value;

//Group By Country and ReportedDate
grpBY_reportData  =  GROUP getReportField BY (location,report_date);

// Get country's  total Zika case against each reported date.
totalCase_By_Country_n_ReportedDate = FOREACH grpBY_reportData{
 sort = ORDER getReportField BY location ASC, report_date ASC;
  latest = LIMIT sort 1;
  GENERATE FLATTEN(latest);
 };

//Store result in Hadoop Cluster
STORE totalCase_By_Country_n_ReportedDate INTO TotalZikaCaseBy_Country_n_ReportedDate';

//Below is sample of the results
(Haiti,2016-02-03,329)(Ecuador,2016-03-30,0)(Ecuador,2016-04-06,3)(Ecuador,2016-04-13,0)(Ecuador,2016-04-20,3)(Ecuador,2016-04-27,1)(Ecuador,2016-05-04,1)(Ecuador,2016-05-18,9)(Ecuador,2016-05-25,1)(Ecuador,2016-06-01,16)(Ecuador,2016-06-15,2)(Ecuador,2016-06-22,5)(Ecuador,2016-06-29,12)(Guatemala,2015-12-09,29)(Guatemala,2015-12-16,10)(Guatemala,2015-12-23,29)(Guatemala,2015-12-29,21)(Guatemala,2016-01-14,68)(Guatemala,2016-01-19,25)(Guatemala,2016-01-26,78)(Guatemala,2016-02-02,112)(Guatemala,2016-02-09,57)(Guatemala,2016-02-16,318)(Guatemala,2016-02-23,127)(Guatemala,2016-03-01,25)(Guatemala,2016-03-08,41)(Guatemala,2016-03-15,29)(Guatemala,2016-03-26,5)(Nicaragua,2016-02-09,5)(Nicaragua,2016-02-11,17)(Nicaragua,2016-02-12,4)(Nicaragua,2016-02-15,27)(Nicaragua,2016-02-16,1)(Nicaragua,2016-02-22,77)(Nicaragua,2016-02-23,79)(Nicaragua,2016-02-29,6)(Nicaragua,2016-03-01,93)(Nicaragua,2016-03-07,104)(Nicaragua,2016-03-11,110) (Nicaragua,2016-03-14,11) (Nicaragua,2016-03-16,12) (Nicaragua,2016-03-28,129) (Nicaragua,2016-04-04,131) (Nicaragua,2016-04-18,139) (Nicaragua,2016-04-25,17) (Nicaragua,2016-05-03,29) (Nicaragua,2016-05-04,179) (Nicaragua,2016-05-05,32) (Nicaragua,2016-05-09,33) (Nicaragua,2016-05-11,185) (Nicaragua,2016-05-16,196) (Nicaragua,2016-05-19,196) (Nicaragua,2016-05-23,206) (Nicaragua,2016-05-26,9) (Nicaragua,2016-05-27,212) (Nicaragua,2016-06-03,42) (Nicaragua,2016-06-06,232) (Nicaragua,2016-06-07,12) (Nicaragua,2016-06-13,14) (Dominican_Republic,2016-01-09,0) (Dominican_Republic,2016-01-16,0) (Dominican_Republic,2016-01-23,10) (Dominican_Republic,2016-01-30,32) (Dominican_Republic,2016-02-06,8) (Dominican_Republic,2016-02-13,82) (Dominican_Republic,2016-02-20,395) (Dominican_Republic,2016-02-27,101) (Dominican_Republic,2016-03-05,87) (Dominican_Republic,2016-03-12,9) (Dominican_Republic,2016-03-19,991) (Dominican_Republic,2016-03-26,36) (Dominican_Republic,2016-04-02,12) (Dominican_Republic,2016-04-09,208) (Dominican_Republic,2016-04-16,235) (Dominican_Republic,2016-04-23,31) (Dominican_Republic,2016-04-30,95) (Dominican_Republic,2016-05-07,265) (Dominican_Republic,2016-05-14,77) (Dominican_Republic,2016-05-21,73) (Dominican_Republic,2016-05-28,50)(Dominican_Republic,2016-06-04,216)

Data visualisation using tableau tool:
This below graph shows Individual Country's reported date Zika confirmed and suspected cases which helps understanding if cases are increasing or decreasing day by day in that country.


Q4 -Age group wise analysis of the number of Zika confirmed and suspected cases found, which helps in understanding which age group has the risk of the disease as outcome varies across the age groups.

// Load data file with below schema
raw = LOAD 'cdc_zika.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS(report_date:chararray, location:chararray, location_type:chararray,data_field:chararray,data_field_code:chararray,time_period:chararray, time_period_type:chararray,value:int,unit:chararray); 

// Filter loaded raw data having 'zika confirmed' & 'zika suspected' cases and location_type as 'county'
filter_reportData = FILTER raw BY ( ((data_field matches '.*(zika).*' and data_field matches '.*(confirmed).*') or (data_field matches '.*(zika).*' and data_field matches '.*(suspected).*')) and (data_field matches '.*(ages).*')); 

// Get only field required for report
getReportField = FOREACH filter_reportData GENERATE SUBSTRING(data_field,INDEXOF(data_field,'ages_',0)+5,(int)SIZE(data_field)-2) as AgeGrp, value ;

//Group By Age Group
grp_reportData = GROUP getReportField BY AgeGrp;

//Get AgeGroup,Totak Zika Cases against that age group.
totalCase_By_AgeGroup = FOREACH grp_reportData Generate  group As AgeGrp,SUM(getReportField.value) as TotalZikaCases;

//Store the results into Hadoop Cluster.
STORE totalCase_By_AgeGroup INTO 'TotalZikaCase_By_AgeGroup';

//Below is result from this report.
(0-11mo,45)(1-4yrs,160)(5-9yrs,154)(over65,49)(10-14yrs,235)(15-19yrs,259)(20-49yrs,1358)(50-64yrs,321)

Data visualisation using tableau tool:
This graphs show AgeGroup wise Zika case report.


Q5- Predicting the outbreak of Zika virus based on the existing data

Q1 report shows that Dominican Republic is most affected county.

Q2 report shows that May month is the month where countries are most affected to this Zika virus.

Q3 report shows that Zika virus is increasing day by day.

Q4 reports shows that 20-49yrs Age group is most affected by this Zika virus.

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World