United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Big data at United Airlines
1. Big Data At
United Airlines
Joe Olson
Senior Manager, Big Data Analytics
DataWorks Summit San Jose - June 2018
2. Agenda
Data Landscape at United
Current Big Data Analytics Environment
Target Big Data Analytics Environment
A Few Big Data Analytics Use Cases
3. 2
About United Airlines…..
~ 750 aircraft, with 250+ on order (supply chain)
148M passengers in 2017
(public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)
4500 daily departures (scheduling, operations, weather, route planning)
338 airports served, in 49 countries (baggage claim, check-ins)
86,000 employees (scheduling, pay)
Constantly in motion! Future (and past) always changing.
A data scientist / data engineer dream.
Source: https://hub.united.com/corporate-fact-sheet/
4. 3
Goals Of The Enterprise Analytics Platform
Improve Customer Experience
- How can we reduce friction when booking a reservation? Maneuvering through an airport?
- How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)
Improve Employee Experience
- How can we keep employees better informed of the current situation so they can relay it to the customers?
- What are we learning from our surveys about what the customer bases says is / isn’t working?
Revenue Generation
- What personalized offers can we make to our customers?
- Are our offers competitive with the rest of the industry?
Improve Operational Reliability
- How can we better prepare for weather or other operational interruptions?
- How can we manage the fleet better and insure spare parts are where they need to be?
6. 5
Current Analytics Environment
Two Main Data Warehouse Platforms
- Teradata – mature data platform, in place for 20+ years. Dedicated team of 25+ people.
ACID compliance allowing for updates. Most ETL here tightly coupled with platform.
- Hortonworks Platform – emerging technology. Economical data science. Data lake
friendly. Community and support frameworks changing faster than more mature Teradata. Log
parsing. Unstructured data and streaming message friendly. Schema-on-read.
- How to get these to play together nicely?
Enterprise Analytics Team Skills
- Very comfortable with SQL – jobs and dash boarding.
- Not so comfortable with parallel processing and APIs.
- Dependency on Hive.
7. 6
Current Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
8. 7
Challenge #1 – Data Analytics / Science Where The Data Ain’t
Bookings & flight schedule constantly in motion – all captured in real time in Teradata
- New state = current state + change
24 hr lagging snapshot refreshes for data science?
- Teradata not optimized for “give me what changed yesterday” – especially in <k,v> situations.
- Extra bookkeeping TD side to enable offload for data science?
Straight to the source into data lake?
- ACID tables Hortonworks side? Write optimization compromises read.
- Updates not be able to keep up with stream – Hive concurrency model
- Stream to raw, batch process after lands on disk? Introduces latency.
Pass though queries?
- Still uses Teradata resources – Spool space.
9. 8
Challenge #1A – Structuring Data Big Data Side
Bookings & flight schedule – mature relational model with (heavy) secondary indexing
- Needs to be queried from multiple directions
- LLAP cache of bookings and flight schedule? Enough space in RAM?
- De-normalized data model
• Not practical in a lot of cases.
- Partitioning, bucketing, ACID.
• Hive concurrency model read blocks write and write blocks read. Complicates job
scheduling.
10. 9
So What’s Working?
Data sync Teradata -> Hive – QueryGrid (Teradata)
- Pass through queries vs data replication
- For replication, 4 – 5 patterns practical:
• ‘Small’ data sets
• ‘Large’ data sets where new data is append only and immutable
(Think appending yesterday on a as a new partition)
• ‘Large’ data sets where new data changes ‘small’ number of existing partitions
(Think yesterday’s changes can affect data going back a full year)
- Works even better if full year is partitioned by month, rather than by day. (create new)
• ‘Large’ data sets accessed in a <k,v> manner. (ACID)
- May need to re-partition a bucketed data set to allow time series queries
11. 10
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #1 replicate data
Queries served using
only HDP resources
12. 11
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #2 database link
Queries served using
Teradata resources
13. 12
So What’s Working?
Longer Term - Platform Independent ETL - Nifi
- Nifi – stateless streaming, and stateful streaming where latency can be tolerated.
• Append only to disk + consolidation job
- Common ingestion layer
- Need connectors from operational systems. Not always easy due to ‘operations’
Option to buffer here, or run
compaction job external to Nifi
Cosmetic enrichment.
or
Can also be replaced with a custom (k,v) parser
14. 13
So What’s NOT Working (yet)?
Data sync Teradata -> Hive – QueryGrid (Teradata)
- ‘Large’ data sets where new data changes ‘large’ number of existing partitions.
- Leveraging QG’s pass-through query abilities here.
Platform Independent ETL
- Streaming stateful messages
• Customized C++ code / Teradata
• Hortonworks Data Flow, Apache Apex, Apache Flink, Nifi + Hbase, Spark micro batching.
- Enterprise message bus - issues
• Not designed with analytics in mind
• No schema registry
15. 14
Target Architecture – Other Considerations
Security
- Common Security strategy with Teradata - GDPR
• Groups defined in Active Directory based on access needs, user assigned to them.
• Groups and users replicated to Teradata and Apache Ranger
• Database roles / permissions defined and reviewed on each platform
Governance
- Looking for a (reasonably priced) solution covering both platforms.
- Apache Atlas – Traceability through Hive, Nifi, HDFS, and Spark (soon) is encouraging.
- May have to resort to custom development using APIs
16. 15
State
Store
Target Architecture Data Lake / Curated Layer
15
Batch
sources
FTP, SCP
Enterprise Message Bus
(JMS sources: Apache Kafka, IBM MQ Series, Tibco EMS)
Data Lake
Hortonworks (ORC on HDFS)
7
Stateless / Stateful High Latency Tolerant
Common Ingestion Layer
Stateful, Low Latency
Ingestion Layer
Curated Layer
Teradata, Hortonworks
Spark ETL
Apache Nifi
Advanced Analytics / ML /
Data Science
Analytics / KPI Dashboards
SQL Spark, SAS, R, etc
17. 16
Analytics Environment
Systems of Record:
- Logs
- Operations
- Customer / Loyalty
- Supply Chain
- Bookings
Systems of Truth:
Batch
sources
FTP, SCP
Enterprise
Message
Bus
Stateless / Stateful
High Latency
Tolerant Ingestion
Layer
Stateful, Low
Latency
Ingestion Layer
Platform Independent ETL
???
Raw Data Lake
Curated Layer
Flight
Narrative
Trip
Narrative
Active
Trip
Narrative
History
18. 17
Use Case: Flight Narrative
LAX – ORD UA 2032 06/11/18 11:00pm
Added to schedule
Aircraft assigned (737-800) #0523
Equipment change 737-800 #0215
Seat reaccomodation (click to see impact)
Crew schedule finalized
Gate assignment B22
Departure change 11:22pm (Late Inbound Crew)
MRD released
Boarding begins
Catering
Boarding ends
Last bag scanned
Out/Off/Taxi
On/In/Taxi
Bags delivered to claim
All events that can be tied to a unique flight are
stored in a time series JSON objects
<T, E, [<k,v>,<k,v>…]>
Inflight Stats
Altitude
Temperature
Wind
Fuel
Catering
Catering Arrival Time
Catering Inventory
Catering Sign off time
Crew List
Pilot
Flight Attendants
02/01/18 – 1:00pm
05/01/18 – 2:30pm
06/02/18 – 10:15am
06/02/18 – 10:20am
06/09/18 – 11:20am
06/10/18 – 9:00pm
06/11/18 – 5:00 pm
06/11/18 – 8:00 pm
06/11/18 – 11:00pm
06/11/18 – 11:25pm
06/11/18 – 11:27pm
06/11/18 – 11:28pm
06/11/18 – 11:32pm
06/12/18 – 5:30am
06/12/18 – 6:05am
Bag Data
Gate Checked Bags (Predicted/Actual)
Bulkhead Timeout
# of Checked Bags
First/Last Bag Scanned on board
First/Last Bag Scanned to baggage claim
19. 18
Ticket Issued
Schedule Change
Itinerary Change
Ancillary Purchase Return to Blocks
Denied Boarding
Bag Delivered to Claim
Rebooked on OA
Cleared Standby
In/Out/On/Off
Upgrade Cleared
Flight Status Notification Sent
Mis-connect
Staisfaction Survey Submitted
Bag File Opened
Pre-Travel Day-of-Travel Post-Travel
• Trip Narrative is a chronological collection of events that define a customer’s experience:
Flight Delayed / Cancelled
Use Case: Trip Narrative
20. Q & A
We’re hiring!
- Data Engineers
- Data Scientists