Analysis on EBay Auction Dataset Using Spark-SQL
Project Introduction:
Domain -
Technology Use - SPARK SQL (using scala)
DataSet - ebay.csv (DataSet Link: https://drive.google.com/open?id=1-qvdt8LIomv8ZIZy0LwwGH0A-91HULXz)
we will see how to perform analysis on ebay Auction data using Spark SQL.The data present in the ebay file is Aucid, bid, bidtime, bidder, bidderrate, openbid, price, itemtype , daystolive.
Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.
//Launch sprak shell with data bricks package using below command
spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
//Import below package
import org.apache.spark.sql.types.{StringType, StructField, StructType}
//Create schema using below command
val schema = StructType(Array(StructField("auctionid", StringType, false),StructField("bid", StringType, false),StructField("bidder", StringType, false),StructField("bidderrate", StringType, false),StructField("openbid", StringType, false),StructField("price", StringType, false),StructField("item", StringType, false),StructField("daystolive", StringType, false)))
// Load the file into dataframe using databricks package
val ebay_df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").schema(schema).load("ebay.csv")
//Lets see,how this data looks
ebay_df.show(5)
// Register this dataframe as tempTable
ebay_df.registerTempTable("ebay")
//Now lets analyse this data by queering this sql table
//Q-1.How many auctions were held?-----------------------------------------------------------------------------------------------------------
sqlContext.sql("select count(distinct(auctionid)) as auctionsCount from ebay").show
Result:
//Q-2.How many bids per item?-----------------------------------------------------------------------------------------------------------
sqlContext.sql("select auctionid,item,count(*) from ebay group by auctionid,item").show
Result: This is partial result showing only first 20 rows.
//Q-3. What's the min number of bids per item? what's the average? what's the max?---------------------------------------------------------------------------------------------------------------
sqlContext.sql("select min(a.count) as min ,max(a.count) as max ,avg(a.count)as avg from (select auctionid,item,count(*) as count from ebay group by auctionid,item ) as a ").show
Result:
//Q-4.Get the auctions with closing price > 100
---------------------------------------------------------------------------------------------
sqlContext.sql("select * from ebay where cast(price as int)>100").show
Result: This is partial result showing only first 20 rows.
Note: You can download this dataset from above given link and follow all the above command and you can get full results and even can do more different other analysis on this dataset.
Comments
Post a Comment