Project: Airline on-time performance , Using Spark Core

Hadoop
------------------------------
Map Reduce | Pig | Hive
SPARK
------------------------------------------------SQL & Hive | StreamingML | GraphX
NOSQL
-----------------------
MongoDB HBase
Data Ingestion Tools
--------------------------
Sqoop Flume

Project Introduction:

Domain - Aviation
Technology Use -  SPARK CORE (using scala)
DataSet -  2008.csv (DataSet Link: http://stat-computing.org/dataexpo/2009/2008.csv.bz)

Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is your chance to find out.

This table contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

DataSet Description: Here is description of each field used in dataset.
NameDescription
1Year1987-2008
2Month1-12
3DayofMonth1-31
4DayOfWeek1 (Monday) - 7 (Sunday)
5DepTimeactual departure time (local, hhmm)
6CRSDepTimescheduled departure time (local, hhmm)
7ArrTimeactual arrival time (local, hhmm)
8CRSArrTimescheduled arrival time (local, hhmm)
9UniqueCarrierunique carrier code
10FlightNumflight number
11TailNumplane tail number
12ActualElapsedTimein minutes
13CRSElapsedTimein minutes
14AirTimein minutes
15ArrDelayarrival delay, in minutes
16DepDelaydeparture delay, in minutes
17Originorigin IATA airport code
18Destdestination IATA airport code
19Distancein miles
20TaxiIntaxi in time, in minutes
21TaxiOuttaxi out time in minutes
22Cancelledwas the flight cancelled?
23CancellationCodereason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24Diverted1 = yes, 0 = no
25CarrierDelayin minutes
26WeatherDelayin minutes
27NASDelayin minutes
28SecurityDelayin minutes
29LateAircraftDelayin minutes

Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.

//Launch sprak shell with data bricks package  using below command
spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

// Load the file into dataframe using databricks package and cache the flights dataframe
val flights = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("2008.csv")
flights.cache()

//Create User defined function
import org.apache.spark.sql.functions.udf

val isDelayedUDF = udf((time: String) => if (time == "NA") 0 else if (time.toInt > 15) 1 else 0)

Q1) BY UniqueCarrier, Find what Percent of flights are delayed in their arrival. 

val flightsWithArrivalDelays = flights.select($"UniqueCarrier",isDelayedUDF($"ArrDelay").alias("IsArrivalDelayed"))

flightsWithArrivalDelays.groupBy($"UniqueCarrier").agg((sum("IsArrivalDelayed") * 100 / count("IsArrivalDelayed")).alias("Percentage of Arrival Flights which got delayed")).show()
Result:
+-------------+-----------------------------------------------+|UniqueCarrier|Percentage of Arrival Flights which got delayed|+-------------+-----------------------------------------------+| AA| 26.018664704861255|| AQ| 4.743589743589744|| HA| 8.483485912075826|| AS| 19.10034281478736|| UA| 24.95244875031979|| B6| 24.32799057580409|| NW| 21.21633127380254|| US| 17.48984212580111|| OH| 25.486445318232654|| OO| 17.814757413705856|| CO| 23.583119733293127|| WN| 17.53536913544702|| DL| 20.884382792948482|| XE| 22.66855357667352|| EV| 22.909382518043305|| F9| 19.485808567072535|| 9E| 16.777138760068343|| YV| 22.401443533519004|| FL| 21.271075037067607|| MQ| 22.391597190096455|+-------------+-----------------------------------------------+

Lets draw the results using Tableaue tool:

Q2) BY UniqueCarrier, Find what Percent of flights are delayed in their departure. val val flightsWithDepDelays = flights.select($"UniqueCarrier",isDelayedUDF($"DepDelay"). alias("IsDelayed"))
flightsWithDepDelays.groupBy($"UniqueCarrier").agg((sum("IsDelayed") * 100 / count("IsDelayed")).alias("Percentage of Departure Delayed Flights")).show()
Results:+-------------+---------------------------------------+|UniqueCarrier|Percentage of Departure Delayed Flights|+-------------+---------------------------------------+| AA| 22.277788339932385|| AQ| 4.564102564102564|| HA| 6.185100119690745|| AS| 16.169210202379848|| UA| 22.203930903306897|| B6| 20.373194078259584|| NW| 14.52486969728349|| US| 13.351293792397964|| OH| 19.152661595996094|| OO| 15.439409407238534|| CO| 21.145231274396476|| WN| 18.95962068776139|| DL| 15.64796395909995|| XE| 19.277989906811566|| EV| 20.49790608571683|| F9| 15.476911509784674|| 9E| 13.685699902367586|| YV| 19.482995332051935|| FL| 17.555142844040905|| MQ| 20.19633457171796|+-------------+---------------------------------------+
Lets draw this result using tableaue tool:





Q)Find Avg Taxi-in
--------------------------
flights.select("Origin", "Dest", "TaxiIn").groupBy("Origin", "Dest").agg(avg("TaxiIn").alias("AvgTaxiIn")).orderBy(desc("AvgTaxiIn")).show(10) Results:
+------+----+------------------+|Origin|Dest| AvgTaxiIn|+------+----+------------------+| BNA| FSD| 48.0|| VPS| LGA| 29.0|| RIC| RDU| 29.0|| XNA| SGF| 21.0|| CHA| DTW|20.732394366197184|| LAX| JAX| 20.5|| HSV| LGA|20.328358208955223|| MFE| ATL| 20.0|| BUR| RNO| 20.0|| MYR| PHL| 19.0|+------+----+------------------+only showing top 10 rows

Q-4)Calculate percentage of Arrival delayed flights using flights DataFrame

val flights_Percent_WithArrivalDelays = flightsWithArrivalDelays.agg((sum("IsArrivalDelayed") * 100 / count("IsArrivalDelayed")).alias("Percentage of Arrival Delayed Flights")).show()+-------------------------------------+|Percentage of Arrival Delayed Flights|+-------------------------------------+| 20.91651773078784|+-------------------------------------+

So, Percentage of Arrival with No Delayed Flights ==100-20.91651 = 79.08349


Q5)/Calculate percentage of Departure delayed flights using flights DataFrame
val flightsWithDepDelays = flights.select($"UniqueCarrier",isDelayedUDF($"DepDelay").alias("IsDelayed"))
flightsWithDepDelays .agg((sum("IsDelayed") * 100 / count("IsDelayed")).alias("Percentage of Departure Delayed Flights")).show()+---------------------------------------+|Percentage of Departure Delayed Flights|+---------------------------------------+| 18.20892337049312|+---------------------------------------+

So, Percentage of Flights with No Delayed in Departure ==100-18.2089 = 81.7911

Comments

Popular posts from this blog

Exploring BigData Analytics Using SPARK in BigData World