SlideShare a Scribd company logo
1 of 20
Big Data At
United Airlines
Joe Olson
Senior Manager, Big Data Analytics
DataWorks Summit San Jose - June 2018
Agenda
Data Landscape at United
Current Big Data Analytics Environment
Target Big Data Analytics Environment
A Few Big Data Analytics Use Cases
2
About United Airlines…..
 ~ 750 aircraft, with 250+ on order (supply chain)
 148M passengers in 2017
(public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)
 4500 daily departures (scheduling, operations, weather, route planning)
 338 airports served, in 49 countries (baggage claim, check-ins)
 86,000 employees (scheduling, pay)
 Constantly in motion! Future (and past) always changing.
 A data scientist / data engineer dream.
Source: https://hub.united.com/corporate-fact-sheet/
3
Goals Of The Enterprise Analytics Platform
 Improve Customer Experience
- How can we reduce friction when booking a reservation? Maneuvering through an airport?
- How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)
 Improve Employee Experience
- How can we keep employees better informed of the current situation so they can relay it to the customers?
- What are we learning from our surveys about what the customer bases says is / isn’t working?
 Revenue Generation
- What personalized offers can we make to our customers?
- Are our offers competitive with the rest of the industry?
 Improve Operational Reliability
- How can we better prepare for weather or other operational interruptions?
- How can we manage the fleet better and insure spare parts are where they need to be?
4
Industry Ideas – Customer Experience
5
Current Analytics Environment
 Two Main Data Warehouse Platforms
- Teradata – mature data platform, in place for 20+ years. Dedicated team of 25+ people.
ACID compliance allowing for updates. Most ETL here tightly coupled with platform.
- Hortonworks Platform – emerging technology. Economical data science. Data lake
friendly. Community and support frameworks changing faster than more mature Teradata. Log
parsing. Unstructured data and streaming message friendly. Schema-on-read.
- How to get these to play together nicely?
 Enterprise Analytics Team Skills
- Very comfortable with SQL – jobs and dash boarding.
- Not so comfortable with parallel processing and APIs.
- Dependency on Hive.
6
Current Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
7
Challenge #1 – Data Analytics / Science Where The Data Ain’t
 Bookings & flight schedule constantly in motion – all captured in real time in Teradata
- New state = current state + change
 24 hr lagging snapshot refreshes for data science?
- Teradata not optimized for “give me what changed yesterday” – especially in <k,v> situations.
- Extra bookkeeping TD side to enable offload for data science?
 Straight to the source into data lake?
- ACID tables Hortonworks side? Write optimization compromises read.
- Updates not be able to keep up with stream – Hive concurrency model
- Stream to raw, batch process after lands on disk? Introduces latency.
 Pass though queries?
- Still uses Teradata resources – Spool space.
8
Challenge #1A – Structuring Data Big Data Side
 Bookings & flight schedule – mature relational model with (heavy) secondary indexing
- Needs to be queried from multiple directions
- LLAP cache of bookings and flight schedule? Enough space in RAM?
- De-normalized data model
• Not practical in a lot of cases.
- Partitioning, bucketing, ACID.
• Hive concurrency model read blocks write and write blocks read. Complicates job
scheduling.
9
So What’s Working?
 Data sync Teradata -> Hive – QueryGrid (Teradata)
- Pass through queries vs data replication
- For replication, 4 – 5 patterns practical:
• ‘Small’ data sets
• ‘Large’ data sets where new data is append only and immutable
(Think appending yesterday on a as a new partition)
• ‘Large’ data sets where new data changes ‘small’ number of existing partitions
(Think yesterday’s changes can affect data going back a full year)
- Works even better if full year is partitioned by month, rather than by day. (create new)
• ‘Large’ data sets accessed in a <k,v> manner. (ACID)
- May need to re-partition a bucketed data set to allow time series queries
10
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #1 replicate data
Queries served using
only HDP resources
11
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #2 database link
Queries served using
Teradata resources
12
So What’s Working?
 Longer Term - Platform Independent ETL - Nifi
- Nifi – stateless streaming, and stateful streaming where latency can be tolerated.
• Append only to disk + consolidation job
- Common ingestion layer
- Need connectors from operational systems. Not always easy due to ‘operations’
Option to buffer here, or run
compaction job external to Nifi
Cosmetic enrichment.
or
Can also be replaced with a custom (k,v) parser
13
So What’s NOT Working (yet)?
 Data sync Teradata -> Hive – QueryGrid (Teradata)
- ‘Large’ data sets where new data changes ‘large’ number of existing partitions.
- Leveraging QG’s pass-through query abilities here.
 Platform Independent ETL
- Streaming stateful messages
• Customized C++ code / Teradata
• Hortonworks Data Flow, Apache Apex, Apache Flink, Nifi + Hbase, Spark micro batching.
- Enterprise message bus - issues
• Not designed with analytics in mind
• No schema registry
14
Target Architecture – Other Considerations
 Security
- Common Security strategy with Teradata - GDPR
• Groups defined in Active Directory based on access needs, user assigned to them.
• Groups and users replicated to Teradata and Apache Ranger
• Database roles / permissions defined and reviewed on each platform
 Governance
- Looking for a (reasonably priced) solution covering both platforms.
- Apache Atlas – Traceability through Hive, Nifi, HDFS, and Spark (soon) is encouraging.
- May have to resort to custom development using APIs
15
State
Store
Target Architecture Data Lake / Curated Layer
15
Batch
sources
FTP, SCP
Enterprise Message Bus
(JMS sources: Apache Kafka, IBM MQ Series, Tibco EMS)
Data Lake
Hortonworks (ORC on HDFS)
7
Stateless / Stateful High Latency Tolerant
Common Ingestion Layer
Stateful, Low Latency
Ingestion Layer
Curated Layer
Teradata, Hortonworks
Spark ETL
Apache Nifi
Advanced Analytics / ML /
Data Science
Analytics / KPI Dashboards
SQL Spark, SAS, R, etc
16
Analytics Environment
Systems of Record:
- Logs
- Operations
- Customer / Loyalty
- Supply Chain
- Bookings
Systems of Truth:
Batch
sources
FTP, SCP
Enterprise
Message
Bus
Stateless / Stateful
High Latency
Tolerant Ingestion
Layer
Stateful, Low
Latency
Ingestion Layer
Platform Independent ETL
???
Raw Data Lake
Curated Layer
Flight
Narrative
Trip
Narrative
Active
Trip
Narrative
History
17
Use Case: Flight Narrative
LAX – ORD UA 2032 06/11/18 11:00pm
Added to schedule
Aircraft assigned (737-800) #0523
Equipment change 737-800 #0215
Seat reaccomodation (click to see impact)
Crew schedule finalized
Gate assignment B22
Departure change 11:22pm (Late Inbound Crew)
MRD released
Boarding begins
Catering
Boarding ends
Last bag scanned
Out/Off/Taxi
On/In/Taxi
Bags delivered to claim
All events that can be tied to a unique flight are
stored in a time series JSON objects
<T, E, [<k,v>,<k,v>…]>
Inflight Stats
Altitude
Temperature
Wind
Fuel
Catering
Catering Arrival Time
Catering Inventory
Catering Sign off time
Crew List
Pilot
Flight Attendants
02/01/18 – 1:00pm
05/01/18 – 2:30pm
06/02/18 – 10:15am
06/02/18 – 10:20am
06/09/18 – 11:20am
06/10/18 – 9:00pm
06/11/18 – 5:00 pm
06/11/18 – 8:00 pm
06/11/18 – 11:00pm
06/11/18 – 11:25pm
06/11/18 – 11:27pm
06/11/18 – 11:28pm
06/11/18 – 11:32pm
06/12/18 – 5:30am
06/12/18 – 6:05am
Bag Data
Gate Checked Bags (Predicted/Actual)
Bulkhead Timeout
# of Checked Bags
First/Last Bag Scanned on board
First/Last Bag Scanned to baggage claim
18
Ticket Issued
Schedule Change
Itinerary Change
Ancillary Purchase Return to Blocks
Denied Boarding
Bag Delivered to Claim
Rebooked on OA
Cleared Standby
In/Out/On/Off
Upgrade Cleared
Flight Status Notification Sent
Mis-connect
Staisfaction Survey Submitted
Bag File Opened
Pre-Travel Day-of-Travel Post-Travel
• Trip Narrative is a chronological collection of events that define a customer’s experience:
Flight Delayed / Cancelled
Use Case: Trip Narrative
Q & A
We’re hiring!
- Data Engineers
- Data Scientists

More Related Content

What's hot

Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...Eastern European Computer Vision Conference
 
Mobility as a Service - May 2018
Mobility as a Service - May 2018Mobility as a Service - May 2018
Mobility as a Service - May 2018ISSY MEDIA
 
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E A
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E AFuture Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E A
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E AGlobalLogic Croatia
 
Intelligent transport system (ITS)
Intelligent transport system (ITS)Intelligent transport system (ITS)
Intelligent transport system (ITS)Aravind Samala
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationDavid Solomon
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Krishnaram Kenthapadi
 
Presentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction systemPresentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction systemtanzir3
 
Java Image Processing for Geospatial Community
Java Image Processing for Geospatial CommunityJava Image Processing for Geospatial Community
Java Image Processing for Geospatial CommunityJody Garnett
 
Fedex Business Model And Competitor Also
Fedex Business Model And Competitor AlsoFedex Business Model And Competitor Also
Fedex Business Model And Competitor AlsoShantam Vaish
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AINUS-ISS
 
Vehicle Tracking System
Vehicle Tracking SystemVehicle Tracking System
Vehicle Tracking SystemVikas Agarwal
 
Intelligent transport system himanshi
Intelligent transport system   himanshiIntelligent transport system   himanshi
Intelligent transport system himanshiPreeti Rashmi
 
Automated anti money laundering using artificial intelligence and machine lea...
Automated anti money laundering using artificial intelligence and machine lea...Automated anti money laundering using artificial intelligence and machine lea...
Automated anti money laundering using artificial intelligence and machine lea...Santhosh L
 
Predicting Vehicle Fuel Consumption & Emissions
Predicting Vehicle Fuel Consumption & EmissionsPredicting Vehicle Fuel Consumption & Emissions
Predicting Vehicle Fuel Consumption & EmissionsSGS
 
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...Bernard Marr
 

What's hot (20)

Fleet management software,Fleet management System - Odoo
Fleet management software,Fleet management System - Odoo Fleet management software,Fleet management System - Odoo
Fleet management software,Fleet management System - Odoo
 
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
Viktor Sdobnikov - Computer Vision for Advanced Driver Assistance Systems (AD...
 
Mobility as a Service - May 2018
Mobility as a Service - May 2018Mobility as a Service - May 2018
Mobility as a Service - May 2018
 
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E A
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E AFuture Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E A
Future Intelligent Mobility with Adaptive AUTOSAR - Transforming Vehicle E/E A
 
Data mart consolidation
Data mart consolidationData mart consolidation
Data mart consolidation
 
Intelligent transport system (ITS)
Intelligent transport system (ITS)Intelligent transport system (ITS)
Intelligent transport system (ITS)
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
 
Presentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction systemPresentation on intelligent traffic prediction system
Presentation on intelligent traffic prediction system
 
AUTOMOTIVE CYBER SECURITY PPT
AUTOMOTIVE CYBER SECURITY PPTAUTOMOTIVE CYBER SECURITY PPT
AUTOMOTIVE CYBER SECURITY PPT
 
Java Image Processing for Geospatial Community
Java Image Processing for Geospatial CommunityJava Image Processing for Geospatial Community
Java Image Processing for Geospatial Community
 
Fedex Business Model And Competitor Also
Fedex Business Model And Competitor AlsoFedex Business Model And Competitor Also
Fedex Business Model And Competitor Also
 
AI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AIAI Governance – The Responsible Use of AI
AI Governance – The Responsible Use of AI
 
Vehicle Tracking System
Vehicle Tracking SystemVehicle Tracking System
Vehicle Tracking System
 
7 p's of DHL
7 p's of DHL 7 p's of DHL
7 p's of DHL
 
Intelligent transport system himanshi
Intelligent transport system   himanshiIntelligent transport system   himanshi
Intelligent transport system himanshi
 
Automated anti money laundering using artificial intelligence and machine lea...
Automated anti money laundering using artificial intelligence and machine lea...Automated anti money laundering using artificial intelligence and machine lea...
Automated anti money laundering using artificial intelligence and machine lea...
 
Predicting Vehicle Fuel Consumption & Emissions
Predicting Vehicle Fuel Consumption & EmissionsPredicting Vehicle Fuel Consumption & Emissions
Predicting Vehicle Fuel Consumption & Emissions
 
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...
How Tesla Is Using Artificial Intelligence to Create The Autonomous Cars Of T...
 
OpenERP / Odoo Fleet management
OpenERP / Odoo Fleet managementOpenERP / Odoo Fleet management
OpenERP / Odoo Fleet management
 

Similar to Big data at United Airlines

EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...RainStor
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseAltibase
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 

Similar to Big data at United Airlines (20)

EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Big data at United Airlines

  • 1. Big Data At United Airlines Joe Olson Senior Manager, Big Data Analytics DataWorks Summit San Jose - June 2018
  • 2. Agenda Data Landscape at United Current Big Data Analytics Environment Target Big Data Analytics Environment A Few Big Data Analytics Use Cases
  • 3. 2 About United Airlines…..  ~ 750 aircraft, with 250+ on order (supply chain)  148M passengers in 2017 (public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)  4500 daily departures (scheduling, operations, weather, route planning)  338 airports served, in 49 countries (baggage claim, check-ins)  86,000 employees (scheduling, pay)  Constantly in motion! Future (and past) always changing.  A data scientist / data engineer dream. Source: https://hub.united.com/corporate-fact-sheet/
  • 4. 3 Goals Of The Enterprise Analytics Platform  Improve Customer Experience - How can we reduce friction when booking a reservation? Maneuvering through an airport? - How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)  Improve Employee Experience - How can we keep employees better informed of the current situation so they can relay it to the customers? - What are we learning from our surveys about what the customer bases says is / isn’t working?  Revenue Generation - What personalized offers can we make to our customers? - Are our offers competitive with the rest of the industry?  Improve Operational Reliability - How can we better prepare for weather or other operational interruptions? - How can we manage the fleet better and insure spare parts are where they need to be?
  • 5. 4 Industry Ideas – Customer Experience
  • 6. 5 Current Analytics Environment  Two Main Data Warehouse Platforms - Teradata – mature data platform, in place for 20+ years. Dedicated team of 25+ people. ACID compliance allowing for updates. Most ETL here tightly coupled with platform. - Hortonworks Platform – emerging technology. Economical data science. Data lake friendly. Community and support frameworks changing faster than more mature Teradata. Log parsing. Unstructured data and streaming message friendly. Schema-on-read. - How to get these to play together nicely?  Enterprise Analytics Team Skills - Very comfortable with SQL – jobs and dash boarding. - Not so comfortable with parallel processing and APIs. - Dependency on Hive.
  • 7. 6 Current Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL
  • 8. 7 Challenge #1 – Data Analytics / Science Where The Data Ain’t  Bookings & flight schedule constantly in motion – all captured in real time in Teradata - New state = current state + change  24 hr lagging snapshot refreshes for data science? - Teradata not optimized for “give me what changed yesterday” – especially in <k,v> situations. - Extra bookkeeping TD side to enable offload for data science?  Straight to the source into data lake? - ACID tables Hortonworks side? Write optimization compromises read. - Updates not be able to keep up with stream – Hive concurrency model - Stream to raw, batch process after lands on disk? Introduces latency.  Pass though queries? - Still uses Teradata resources – Spool space.
  • 9. 8 Challenge #1A – Structuring Data Big Data Side  Bookings & flight schedule – mature relational model with (heavy) secondary indexing - Needs to be queried from multiple directions - LLAP cache of bookings and flight schedule? Enough space in RAM? - De-normalized data model • Not practical in a lot of cases. - Partitioning, bucketing, ACID. • Hive concurrency model read blocks write and write blocks read. Complicates job scheduling.
  • 10. 9 So What’s Working?  Data sync Teradata -> Hive – QueryGrid (Teradata) - Pass through queries vs data replication - For replication, 4 – 5 patterns practical: • ‘Small’ data sets • ‘Large’ data sets where new data is append only and immutable (Think appending yesterday on a as a new partition) • ‘Large’ data sets where new data changes ‘small’ number of existing partitions (Think yesterday’s changes can affect data going back a full year) - Works even better if full year is partitioned by month, rather than by day. (create new) • ‘Large’ data sets accessed in a <k,v> manner. (ACID) - May need to re-partition a bucketed data set to allow time series queries
  • 11. 10 Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL QG Option #1 replicate data Queries served using only HDP resources
  • 12. 11 Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL QG Option #2 database link Queries served using Teradata resources
  • 13. 12 So What’s Working?  Longer Term - Platform Independent ETL - Nifi - Nifi – stateless streaming, and stateful streaming where latency can be tolerated. • Append only to disk + consolidation job - Common ingestion layer - Need connectors from operational systems. Not always easy due to ‘operations’ Option to buffer here, or run compaction job external to Nifi Cosmetic enrichment. or Can also be replaced with a custom (k,v) parser
  • 14. 13 So What’s NOT Working (yet)?  Data sync Teradata -> Hive – QueryGrid (Teradata) - ‘Large’ data sets where new data changes ‘large’ number of existing partitions. - Leveraging QG’s pass-through query abilities here.  Platform Independent ETL - Streaming stateful messages • Customized C++ code / Teradata • Hortonworks Data Flow, Apache Apex, Apache Flink, Nifi + Hbase, Spark micro batching. - Enterprise message bus - issues • Not designed with analytics in mind • No schema registry
  • 15. 14 Target Architecture – Other Considerations  Security - Common Security strategy with Teradata - GDPR • Groups defined in Active Directory based on access needs, user assigned to them. • Groups and users replicated to Teradata and Apache Ranger • Database roles / permissions defined and reviewed on each platform  Governance - Looking for a (reasonably priced) solution covering both platforms. - Apache Atlas – Traceability through Hive, Nifi, HDFS, and Spark (soon) is encouraging. - May have to resort to custom development using APIs
  • 16. 15 State Store Target Architecture Data Lake / Curated Layer 15 Batch sources FTP, SCP Enterprise Message Bus (JMS sources: Apache Kafka, IBM MQ Series, Tibco EMS) Data Lake Hortonworks (ORC on HDFS) 7 Stateless / Stateful High Latency Tolerant Common Ingestion Layer Stateful, Low Latency Ingestion Layer Curated Layer Teradata, Hortonworks Spark ETL Apache Nifi Advanced Analytics / ML / Data Science Analytics / KPI Dashboards SQL Spark, SAS, R, etc
  • 17. 16 Analytics Environment Systems of Record: - Logs - Operations - Customer / Loyalty - Supply Chain - Bookings Systems of Truth: Batch sources FTP, SCP Enterprise Message Bus Stateless / Stateful High Latency Tolerant Ingestion Layer Stateful, Low Latency Ingestion Layer Platform Independent ETL ??? Raw Data Lake Curated Layer Flight Narrative Trip Narrative Active Trip Narrative History
  • 18. 17 Use Case: Flight Narrative LAX – ORD UA 2032 06/11/18 11:00pm Added to schedule Aircraft assigned (737-800) #0523 Equipment change 737-800 #0215 Seat reaccomodation (click to see impact) Crew schedule finalized Gate assignment B22 Departure change 11:22pm (Late Inbound Crew) MRD released Boarding begins Catering Boarding ends Last bag scanned Out/Off/Taxi On/In/Taxi Bags delivered to claim All events that can be tied to a unique flight are stored in a time series JSON objects <T, E, [<k,v>,<k,v>…]> Inflight Stats Altitude Temperature Wind Fuel Catering Catering Arrival Time Catering Inventory Catering Sign off time Crew List Pilot Flight Attendants 02/01/18 – 1:00pm 05/01/18 – 2:30pm 06/02/18 – 10:15am 06/02/18 – 10:20am 06/09/18 – 11:20am 06/10/18 – 9:00pm 06/11/18 – 5:00 pm 06/11/18 – 8:00 pm 06/11/18 – 11:00pm 06/11/18 – 11:25pm 06/11/18 – 11:27pm 06/11/18 – 11:28pm 06/11/18 – 11:32pm 06/12/18 – 5:30am 06/12/18 – 6:05am Bag Data Gate Checked Bags (Predicted/Actual) Bulkhead Timeout # of Checked Bags First/Last Bag Scanned on board First/Last Bag Scanned to baggage claim
  • 19. 18 Ticket Issued Schedule Change Itinerary Change Ancillary Purchase Return to Blocks Denied Boarding Bag Delivered to Claim Rebooked on OA Cleared Standby In/Out/On/Off Upgrade Cleared Flight Status Notification Sent Mis-connect Staisfaction Survey Submitted Bag File Opened Pre-Travel Day-of-Travel Post-Travel • Trip Narrative is a chronological collection of events that define a customer’s experience: Flight Delayed / Cancelled Use Case: Trip Narrative
  • 20. Q & A We’re hiring! - Data Engineers - Data Scientists