Analysis on Dataset available with baby name using SPARK -Core

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

Project Introduction:

Domain -
Technology Use -  SPARK CORE (using scala)
DataSet -  part-r-00001_gz.parquet (DataSet Link: https://drive.google.com/open?id=1Tr4sDNE64BTm0Qf7wg9AanrQizyCHOCT)

This data comes from the United States Social Security Administration.Here in this dataset, for each year,we have baby names and total how manytimes that name was used in that year.This data has following fields in file.
FirstName,Gender,Year,Total 
Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.

// Lets load this file in dataframe
val df = sqlContext.read.parquet("part-r-00001_gz.parquet")
//we need to define a utility function. The DataFrame SCala API doesn't support the SQL `LOWER` function. To ensure that our data matches up properly, it'd be nice to force the names to lower case before doing the match. Fortunately, it's easy to define our own `LOWER` function:

val lower = sqlContext.udf.register("lower", (s: String) => s.toLowerCase)

Q-1)How many distinct name available in our dataset
   
df.select(lower($"firstName")).distinct.count
Result:
res10: Long = 45019

2) Let's use the original data frame to find the five most popular names for girls born in 1980
(df.filter(df("year") === 1980).filter(df("gender") === "F").orderBy(df("total").desc, df("firstName")).select("firstName").limit(5)).show
Result:+---------+|firstName|+---------+| Jennifer|| Amanda|| Jessica|| Melissa|| Sarah|+---------+


3)How popular were the top 10 female names of 1890 back in 1880

//Let's use two views of our data to answer this question: How popular were the top 10 female names of 1890 back in 1880?

val ssn1890 = df.filter($"year" === 1890).select($"total".as("total1890"),$"gender".as("gender1890"),lower($"firstName").as("name1890"))
val ssn1880 = df.filter($"year" === 1880).select($"total".as("total1880"),$"gender".as("gender1880"),lower($"firstName").as("name1880"))

// Now, let's find out how popular the top 10 1890 girls' names were in 1880.

val joined = ssn1890.join(ssn1880, ($"name1890" === $"name1880") && ($"gender1890" === $"gender1880")).filter($"gender1890" === "F").orderBy($"total1890".desc).limit(10).select($"name1890".as("BabyName"), $"total1880", $"total1890")

joined.show()

Result:
+---------+---------+---------+| BabyName|total1880|total1890|+---------+---------+---------+| mary| 7065| 12078|| anna| 2604| 5233||elizabeth| 1939| 3112|| margaret| 1578| 3100|| emma| 2003| 2980|| florence| 1063| 2745|| ethel| 633| 2718|| minnie| 1746| 2650|| clara| 1226| 2496|| bertha| 1320| 2388|+---------+---------+---------+

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World