Analysis on EBay Auction Dataset Using Spark-SQL

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

Project Introduction:

Domain -
Technology Use -  SPARK SQL (using scala)
DataSet - ebay.csv (DataSet Link: https://drive.google.com/open?id=1-qvdt8LIomv8ZIZy0LwwGH0A-91HULXz)

we will see how to perform analysis on ebay Auction data using Spark SQL.The data present in the ebay file is  Aucid, bid, bidtime, bidder, bidderrate, openbid, price, itemtype ,  daystolive.

Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.

//Launch sprak shell with data bricks package  using below command
spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

//Import below package
import org.apache.spark.sql.types.{StringType, StructField, StructType}

//Create schema using below command
val schema = StructType(Array(StructField("auctionid", StringType, false),StructField("bid", StringType, false),StructField("bidder", StringType, false),StructField("bidderrate", StringType, false),StructField("openbid", StringType, false),StructField("price", StringType, false),StructField("item", StringType, false),StructField("daystolive", StringType, false)))

// Load the file into dataframe using databricks package
val ebay_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").schema(schema).load("ebay.csv")

//Lets see,how this data looks
ebay_df.show(5)
+----------+-----+--------+--------------+-------+-----+-----+----------+| auctionid| bid| bidder| bidderrate|openbid|price| item|daystolive|+----------+-----+--------+--------------+-------+-----+-----+----------+|8213034705| 95|2.927373| jake7870| 0| 95|117.5| xbox||8213034705| 115|2.943484| davidbresler2| 1| 95|117.5| xbox||8213034705| 100|2.951285|gladimacowgirl| 58| 95|117.5| xbox||8213034705|117.5|2.998947| daysrus| 10| 95|117.5| xbox||8213060420| 2|0.065266| donnie4814| 5| 1| 120| xbox|+----------+-----+--------+--------------+-------+-----+-----+----------+

// Register this dataframe as tempTable
ebay_df.registerTempTable("ebay")

//Now lets analyse this data by queering this sql table
//Q-1.How many auctions were held?-----------------------------------------------------------------------------------------------------------
sqlContext.sql("select count(distinct(auctionid)) as auctionsCount from ebay").show

Result:

+-------------+|auctionsCount|+-------------+| 627|+-------------+

//Q-2.How many bids per item?
-----------------------------------------------------------------------------------------------------------
sqlContext.sql("select auctionid,item,count(*) from ebay group by auctionid,item").show

Result: This is partial result showing only first 20 rows.
+----------+------+---+| auctionid| item|_c2|+----------+------+---+|3016651485| 212.5| 20||3019271858| 245| 28||8213511406| 89| 23||1641722275| 155| 20||8212755285| 182.5| 26||3025778985|229.51| 8||3020805007| 232.5| 18||3023004423| 227.5| 16||1648968984| 960| 12||1649718196| 1799| 26||3014835507| 207.5| 27||3022527291| 222.5| 19||3014792711|233.02| 7||3016771147| 255| 16||3020143804| 232.5| 19||3025677685| 202.5| 6||3019428008| 240.5| 32||3023639316| 212.5| 31||8213188385| 106.5| 38||1646353713| 1735| 11|+----------+------+---+only showing top 20 rows

 //Q-3. What's the min number of bids per item? what's the average? what's the max?---------------------------------------------------------------------------------------------------------------
 sqlContext.sql("select min(a.count) as min ,max(a.count) as max ,avg(a.count)as avg from (select auctionid,item,count(*) as count from ebay group by auctionid,item ) as a ").show

Result:
+---+---+------------------+|min|max| avg|+---+---+------------------+| 1| 75|16.992025518341308|+---+---+------------------+

//Q-4.Get the auctions with closing price > 100
---------------------------------------------------------------------------------------------

sqlContext.sql("select * from ebay where cast(price as int)>100").show

Result: This is partial result showing only first 20 rows.
+----------+------+--------+------------+-------+-----+------+----------+| auctionid| bid| bidder| bidderrate|openbid|price| item|daystolive|+----------+------+--------+------------+-------+-----+------+----------+|8214435808| 122.5| 2.87375|darkaglmax84| 4| 110| 122.5| xbox||8214435808| 120|2.961817| elai2005| 26| 110| 122.5| xbox||8212830525|386.87|0.577431|sybercool123| 536| 150|501.77| xbox||8212830525| 153|0.811042| adavisa1| 0| 150|501.77| xbox||8212830525| 160|0.811551| adavisa1| 0| 150|501.77| xbox||8212830525| 165|1.167859| donna4108| 0| 150|501.77| xbox||8212830525| 175|1.168252| donna4108| 0| 150|501.77| xbox||8212830525| 180|1.212951| donna4108| 0| 150|501.77| xbox||8212830525| 185|1.213438| donna4108| 0| 150|501.77| xbox||8212830525| 190| 1.21375| donna4108| 0| 150|501.77| xbox||8212830525| 198|1.214051| donna4108| 0| 150|501.77| xbox||8212830525| 203|1.518009| adavisa1| 0| 150|501.77| xbox||8212830525| 210|1.518148| adavisa1| 0| 150|501.77| xbox||8212830525| 216|1.527442| adavisa1| 0| 150|501.77| xbox||8212830525| 222|1.570139| adavisa1| 0| 150|501.77| xbox||8212830525| 228|1.570266| adavisa1| 0| 150|501.77| xbox||8212830525| 239|1.570417| adavisa1| 0| 150|501.77| xbox||8212830525| 245| 1.57059| adavisa1| 0| 150|501.77| xbox||8212830525| 250|1.571354| adavisa1| 0| 150|501.77| xbox||8212830525| 260|1.571539| adavisa1| 0| 150|501.77| xbox|+----------+------+--------+------------+-------+-----+------+----------+only showing top 20 rows



Note: You can download this dataset from above given link and follow all the above command and you can get full results and even can do more different other analysis on this dataset.

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World