Project: Airline on-time performance , Using Spark Core
Project Introduction:
Domain - Aviation
Technology Use - SPARK CORE (using scala)
DataSet - 2008.csv (DataSet Link: http://stat-computing.org/dataexpo/2009/2008.csv.bz)
Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you'd had more data? This is your chance to find out.
This table contains on-time arrival data for non-stop domestic flights by major air carriers, and provides such additional items as departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
DataSet Description: Here is description of each field used in dataset.
Name | Description | |
---|---|---|
1 | Year | 1987-2008 |
2 | Month | 1-12 |
3 | DayofMonth | 1-31 |
4 | DayOfWeek | 1 (Monday) - 7 (Sunday) |
5 | DepTime | actual departure time (local, hhmm) |
6 | CRSDepTime | scheduled departure time (local, hhmm) |
7 | ArrTime | actual arrival time (local, hhmm) |
8 | CRSArrTime | scheduled arrival time (local, hhmm) |
9 | UniqueCarrier | unique carrier code |
10 | FlightNum | flight number |
11 | TailNum | plane tail number |
12 | ActualElapsedTime | in minutes |
13 | CRSElapsedTime | in minutes |
14 | AirTime | in minutes |
15 | ArrDelay | arrival delay, in minutes |
16 | DepDelay | departure delay, in minutes |
17 | Origin | origin IATA airport code |
18 | Dest | destination IATA airport code |
19 | Distance | in miles |
20 | TaxiIn | taxi in time, in minutes |
21 | TaxiOut | taxi out time in minutes |
22 | Cancelled | was the flight cancelled? |
23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) |
24 | Diverted | 1 = yes, 0 = no |
25 | CarrierDelay | in minutes |
26 | WeatherDelay | in minutes |
27 | NASDelay | in minutes |
28 | SecurityDelay | in minutes |
29 | LateAircraftDelay | in minutes |
Lets write some Spark SQL code to analyse this data.Lets use Scala shell to write this code.
//Launch sprak shell with data bricks package using below command
spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
// Load the file into dataframe using databricks package and cache the flights dataframe
val flights = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("2008.csv")
flights.cache()
//Create User defined function
import org.apache.spark.sql.functions.udf
val isDelayedUDF = udf((time: String) => if (time == "NA") 0 else if (time.toInt > 15) 1 else 0)
val flightsWithArrivalDelays = flights.select($"UniqueCarrier",isDelayedUDF($"ArrDelay").alias("IsArrivalDelayed"))
flightsWithArrivalDelays.groupBy($"UniqueCarrier").agg((sum("IsArrivalDelayed") * 100 / count("IsArrivalDelayed")).alias("Percentage of Arrival Flights which got delayed")).show()
Result:
Lets draw the results using Tableaue tool:
Q2) BY UniqueCarrier, Find what Percent of flights are delayed in their departure. val val flightsWithDepDelays = flights.select($"UniqueCarrier",isDelayedUDF($"DepDelay"). alias("IsDelayed"))
flightsWithDepDelays.groupBy($"UniqueCarrier").agg((sum("IsDelayed") * 100 / count("IsDelayed")).alias("Percentage of Departure Delayed Flights")).show()
Results:
Lets draw this result using tableaue tool:
Q)Find Avg Taxi-in
--------------------------
flights.select("Origin", "Dest", "TaxiIn").groupBy("Origin", "Dest").agg(avg("TaxiIn").alias("AvgTaxiIn")).orderBy(desc("AvgTaxiIn")).show(10) Results:
Q-4)Calculate percentage of Arrival delayed flights using flights DataFrame
val flights_Percent_WithArrivalDelays = flightsWithArrivalDelays.agg((sum("IsArrivalDelayed") * 100 / count("IsArrivalDelayed")).alias("Percentage of Arrival Delayed Flights")).show()
So, Percentage of Arrival with No Delayed Flights ==100-20.91651 = 79.08349
Q5)/Calculate percentage of Departure delayed flights using flights DataFrame
val flightsWithDepDelays = flights.select($"UniqueCarrier",isDelayedUDF($"DepDelay").alias("IsDelayed"))
flightsWithDepDelays .agg((sum("IsDelayed") * 100 / count("IsDelayed")).alias("Percentage of Departure Delayed Flights")).show()
So, Percentage of Flights with No Delayed in Departure ==100-18.2089 = 81.7911
Comments
Post a Comment