Analysis on Movie Rating DataSet Using Spark Core/SQL

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

Project Introduction:

Domain - Entertainment 
Technology Use -  SPARK CORE/SQL (using scala)
DataSet - movies.dat , ratings.dat , users.dat (DataSet Link: https://drive.google.com/open?id=1ovZIzQ9p4sKgLQE_ka8RO6kGezYk9wU1)
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.
RATINGS FILE DESCRIPTION:All ratings are contained in the file "ratings.dat" and are in the following format:
  UserID :: MovieID :: Rating :: Timestamp
- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

MOVIES FILE DESCRIPTION:Movie information is in the file "movies.dat" and is in the following format:
  MovieID :: Title :: Genres
- Titles are identical to titles provided by the IMDB (including year of release)
- Genres are pipe-separated and are selected from the following genres:
     * Action , * Adventure, * Animation, * Children's, * Comedy, * Crime,
     * Documentary, * Drama,     * Fantasy, * Film-Noir, * Horror, * Musical,
     * Mystery, * Romance, * Sci-Fi, * Thriller, * War, * Western

USERS FILE DESCRIPTION:User information is in the file "users.dat" and is in the following format:
   UserID::Gender::Age::Occupation::Zip-code
Gender is denoted by a "M" for male and "F" for female.Age is chosen from the following ranges:
 *  1:  "Under 18"
 * 18:  "18-24"
 * 25:  "25-34"
 * 35:  "35-44"
 * 45:  "45-49"
 * 50:  "50-55"
 * 56:  "56+"

Occupation is chosen from the following choices:
   0:  "other" or not specified  ,   1:  "academic/educator" ,   2:  "artist"
   3:  "clerical/admin" ,   4:  "college/grad student",   5:  "customer service"
   6:  "doctor/health care" ,  7:  "executive/managerial" ,  8:  "farmer",
   9:  "homemaker" ,  10:  "K-12 student",  11:  "lawyer", 12:  "programmer"
  13:  "retired" , 14:  "sales/marketing",  15:  "scientist",  16:  "self-employed"
  17:  "technician/engineer",  18:  "tradesman/craftsman",  19:  "unemployed"

Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.
// Load all 3 files into RDD"rawMovies " ,"rawRatings " ,"rawUsers "
   val rawMovies = sc.textFile("movies.dat");
   val rawRatings = sc.textFile("ratings.dat");
   val rawUsers    = sc.textFile("users.dat");

//Create case class for creating schema for all above RDD's
   case class clsMovies(movieID:Int,title:String, genres:String);
   case class clsRatings(userid:Int,movieID:Int,rating:Int,timestamp:String);
   case class clsUser(userid:Int,gender:String,age:Int,occupation:Int,zipcode:String);

//Transforming above RDD's to DataFrame using above schema and register as Temp table.
 val moviesWithSchema = rawMovies.map(x=>x.split("::")).map(x=>clsMovies(x(0).toInt,x(1),x(2)));
 val moviesDF = moviesWithSchema .toDF;
 moviesDF.registerTempTable("movies");
val ratingsWithSchema = rawRatings.map(x=>x.split("::")).map(x=> clsRatings(x(0).toInt,x(1).toInt, x(2).toInt,x(3)));
val ratingsDF = ratingsWithSchema .toDF;
ratingsDF.registerTempTable("ratings");

val usersWithSchema = rawUsers.map(x=>x.split("::")).map(x=> clsUser(x(0).toInt,x(1),x(2).toInt, x(3).toInt,x(4)));
val usersDF = usersWithSchema .toDF;
usersDF.registerTempTable("users");

Q-1) Find out the animated movies that are rated 4 or above:
 val q1res=sqlContext.sql("select m.title,r.rating,m.genres from movies m inner join ratings r on m.movieID =r.movieID and r.rating >=4 and genres='Animation' ");
q1res.show()
Result: This is partial result showing only first 20 rows.
+--------------------+------+---------+| title|rating| genres|+--------------------+------+---------+|Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 4|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation||Wallace & Gromit:...| 5|Animation|+--------------------+------+---------+only showing top 20 rows

Q-2) Detect the gender bias on movie ratings for a genres:

val q2res1 = sqlContext.sql("select m.movieID,m.genres,r.userid,r.rating from movies m inner join ratings r on m.movieID =r.movieID ");

val q2myres1= q2res1.withColumn("genres", explode(split($"genres", "\\|")))
q2myres1.registerTempTable("tblMovRating");

val q2res2 = sqlContext.sql("select tmr.movieID,tmr.genres,tmr.rating,u.gender from tblMovRating tmr inner join users u on tmr.userid =u.userid ");

q2res2.registerTempTable("tblMovRatingfinal");

val q2finalResult = sqlContext.sql("select genres,round((sum(case when gender='M' then rating end) /count(case when gender='M' then 1 end) ),2) as male_rating,round((sum(case when gender='F' then rating end) /count(case  when gender='F' then 1 end)),2 ) as Female_rating from tblMovRatingfinal group by genres order by genres desc");

q2finalResult.show();

+-----------+-----------+-------------+| genres|male_rating|Female_rating|+-----------+-----------+-------------+| Western| 3.66| 3.55|| War| 3.89| 3.89|| Thriller| 3.57| 3.57|| Sci-Fi| 3.47| 3.45|| Romance| 3.57| 3.67|| Mystery| 3.66| 3.69|| Musical| 3.6| 3.81|| Horror| 3.22| 3.2|| Film-Noir| 4.09| 4.02|| Fantasy| 3.43| 3.51|| Drama| 3.77| 3.77||Documentary| 3.93| 3.95|| Crime| 3.71| 3.69|| Comedy| 3.5| 3.57|| Children's| 3.36| 3.57|| Animation| 3.66| 3.74|| Adventure| 3.47| 3.51|| Action| 3.49| 3.49|+-----------+-----------+-------------+


Q-3) Group the ratings by age:
val q3res1 = sqlContext.sql("select u.age,r.rating from users u inner join ratings r on u.userid = r.userid ");

q3res1.registerTempTable("tblMovRatingByAge");

val q3res2 = sqlContext.sql("select age,round(avg(rating),2)as Avg_rating from tblMovRatingByAge group by age order by age asc");

q3res2.show();

+---+----------+|age|Avg_rating|+---+----------+| 1| 3.55|| 18| 3.51|| 25| 3.55|| 35| 3.62|| 45| 3.64|| 50| 3.71|| 56| 3.77|+---+----------+


Q-4) Find out the average rating for movies:
val q4res1 = sqlContext.sql("select m.title,r.rating from movies m inner join ratings r on m.movieID =r.movieID ");

q4res1.registerTempTable("tblAvgMovRating");

val q4finalResult = sqlContext.sql(" select title , round(avg(rating),2) as AvgRating from tblAvgMovRating group by title order by AvgRating desc");

q4finalResult .show();

Result: This is partial result showing only first 20 rows.
+--------------------+---------+| title|AvgRating|+--------------------+---------+|Follow the Bitch ...| 5.0||Smashing Time (1967)| 5.0||Song of Freedom (...| 5.0||Bittersweet Motel...| 5.0|| Baby, The (1973)| 5.0||One Little Indian...| 5.0|| Lured (1947)| 5.0||Schlafes Bruder (...| 5.0||Gate of Heavenly ...| 5.0||Ulysses (Ulisse) ...| 5.0||I Am Cuba (Soy Cu...| 4.8|| Lamerica (1994)| 4.75||Apple, The (Sib) ...| 4.67|| Sanjuro (1962)| 4.61||Seven Samurai (Th...| 4.56||Shawshank Redempt...| 4.55||Godfather, The (1...| 4.52||Usual Suspects, T...| 4.52||Close Shave, A (1...| 4.52||Wrong Trousers, T...| 4.51|+--------------------+---------+only showing top 20 rows

Q-5) Find out the titles of the best rated movies:
val q5res1 = q4finalResult.toDF;

val q5finalResult=q5res1.filter($"AvgRating" === 5);

q5finalResult.select("title","AvgRating").show();

Result:
+--------------------+---------+| title|AvgRating|+--------------------+---------+|One Little Indian...| 5.0|| Lured (1947)| 5.0||Schlafes Bruder (...| 5.0||Gate of Heavenly ...| 5.0||Ulysses (Ulisse) ...| 5.0||Bittersweet Motel...| 5.0||Follow the Bitch ...| 5.0||Smashing Time (1967)| 5.0|| Baby, The (1973)| 5.0||Song of Freedom (...| 5.0|+--------------------+---------+



NoteYou can download this dataset from above given link and follow all the above command and you can get full results and even can do more different other analysis on this dataset.

Comments

Post a Comment

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World