Extending Data Lake using the Lambda Architecture June 2015
1. Extending Data Lake using the Lambda Architecture
June 2015
Dr. William Kornfeld – R& D Director Think Big, a Teradata company
Peyman Mohajerian – UDA Architecture COE, Teradata
3. • What does it mean to be a real-time architecture?
• What are the use cases that real-time architecture serves?
• When would it be a mistake to use a real-time architecture?
• What are useful design patterns for implementing real-time
architectures (including lambda)?
Introduction
3
4. What is “Real Time”?
4
Data StoreData In Info Out
Generally means something is happening in seconds, not minutes or
hours.
5. What is “Real Time”?
5
Data StoreData In Info Out
Generally means something is happening in second or so, not minutes or
hours.
Push or
Pull
6. What is “Real Time”?
6
Data StoreData In Info Out
Generally means something is happening in a second give-or-take, not
minutes or hours.
Push or
Pull
For purposes of this talk, “Real Time” is measuring from Data In through Info
Out.
7. The significant component of
each individual message
coming in is stored.
Example:
- Individual prescription records to
be retrieved.
Each of the messages coming
in contriburtes to one or more
aggregates.
Example:
- Number of prescriptions for
penicillin on June 9, 2015
Two General Classes of Information for Storage and
Retrieval
7
Atomic Aggregate
8. • Question to ask: If a new message comes in, do I need to be able to
see or react to it nearly immediately?
• Case 1: A message represents a doctor ordering a prescription.
• Case 2: A message represents a student completing the SAT with a
certain score.
Atomic Retrieval
8
9. • Some aggregate types make sense in real time as an instantaneous
snapshot at the present moment.
• The “real time” value of some aggregate types are really an estimate
of the value of something at some indeterminate time in the past.
• Some aggregate types lose their meaning as real-time values.
• Some real time processes can be enabled by batch aggregates.
Aggregate Retrieval
9
10. • Includes sums and counts.
• Examples:
− Dollars of revenue earned so far today
− Number of prescriptions for penicillin written today
Aggregates with Instantaneous Meaning in Real
Time
10
11. • Includes aggregates which are ratios.
• Examples
− Click-through rate on an ad
− Conversion rate on an email marketing campaign
− Percent of prescriptions filled
Aggregates Whose Current Value may not be an
accurate reflection of what is happening NOW
11
12. • Includes aggregates which are ratios.
• Examples
− Click-through rate on an ad
− Conversion rate on an email marketing campaign
− Percent of prescriptions filled
Aggregates Whose Current Value may not be an
accurate reflection of what is happening NOW
12
Now
13. • Includes Unique User Counts
• Well-defined meaning only on intervals
Aggregates that Have no Instaneous Meaning
13
Joe
Ken
Sue
Fred
Jane
Bob
Joe
Ken
Joe
Fred
Joe
14. Real Time Aggregate Update Can be Significantly
More Expensive Than Batch
14
Web
Server
PC/Male
PC/Female
Mac/Male
Mac/Female
PC
Mac
Male
Female
Everyone
15. Real Time Aggregate Update Can be Significantly
More Expensive Than Batch
15
Web
Server
PC/Male
PC/Female
Mac/Male
Mac/Female
PC
Mac
Male
Female
Everyone
16. Real Time Processes that Use Batch Aggregates
16
Data
Model
Periodically
Rebuild
Web
Server
17. Suppose your Information Can be Real Time, Should
You Use a Real TIme Architecture?
17
Real World
Big Data
System
Do you need to know about or react to changes in the Real World
within a couple of minutes of the changes?
18. • There are use cases for both batch and real-time data processing.
• Batch tools are stabler; less subject to frequent revision.
• Real-time architectures can be significantly more expensive.
• Many systems will have some of each.
Real Time vs. Batch
18
22. Real-Time Use Cases
Lambda Architecture
- Medical: Patient Critical Care
Event Driven Architecture
- Marketing: Customer Engagement
23. Why Big Data?
Challenges in Medical Data
Health data tends to be “wide”, not “deep”
New data types are becoming more important
Unstructured
Real-time streaming
A challenge to generally move from retrospective “BI”
viewing to event-based and predictive analytics usage
Multiple layers
Lots of events, data
Complex
Lots of different languages and data structures
Difficult to maintain
Lots of moving pieces/components/technologies
Lots of changes in the business
24. Project
Optimize an existing Natural Language Processing pipeline
in support of critical Colorectal Surgery
(Move to tens of thousands of documents processed)
Replace an existing free-text search facility used by Clinical
Web Service for cancer
(Move search to milliseconds)
26. Current Storm throughput up to 1.5 million documents per hour
Average of 140,000 HL7 messages actually processed per day with average latency
of 60 milliseconds from ingest to persistence
Average of 50,000 documents passed through annotators per day versus 5,000
historically
Actual annotations of documents up to 6 times faster than previously accomplished
Free-text search use cases that took over 30 minutes on old infrastructure completing
in milliseconds in ElasticSearch
Operational Statistics
27. Applications Deliver the Company’s Brand and Customer
Experience
Social Media
The Customer Marketing
Channels
Mobile Apps
Devices &
Form-factors
• Entirety of applications combine to deliver
the full customer experience
• Today they are mostly designed in a silo’d
manner
• Applications are not designed to solicit and
extract customer experience data well
• At the core of application design should be
the considerations for obtaining and
delivering information about the customer
experience
28. The Customer Experience Universe
Day 1 Day 3 Day 7 Day 17 Day 21 Day 25
IM Campaign Fragment Email Campaign Fragment Customers Services Fragment
PaidSearch
LandingPage
CreateAccount
TXN
AttachedCC
EmailSent
EmailOpened
EmailLinkClicked
EmailClicked
AccountLogin
BannerAd1Impression
BannerAd2Impression
AddBank
EmailSent
EmailSent
TXN
AccountLogin
HelpCenter
EnterDispute
C.S.EmailSent
EmailOpened
EmailLinkClicked
HelpCenterHP
DisputePage
VirtualAgent
CallsIntoIVR
IVR:DisputeWorkflow
TransferredtoAgent
DisputeResolved
C.S.SurveyEmailed
Social Media
The Customer
Marketing
Channels
Mobile Apps
Devices &
Form-factors
A universe of customer experience data:
• Create threads
• Build graphs
• Identify patterns
HL7 actual processing based on “pull” requests from users not actual processing power
HL7 are large xml-based documents
Much larger than say JSON or others (roughly 800k-900k in size)
Contains significant data related to medical information
End goal
An architecturally-driven, internally-owned technology stack that blends:
An event-based processing fabric
A real-time processing framework
A multi-destination distillation hub
“Classic” BI delivery techniques
“Services-based” delivery techniques
A “serendipitous” discovery environment
Mutually supportive components that combine in delivering novel clinical solutions.
How the business looks to the customer
The customer experiences the company across the entirety of applications that company has developed and deployed. Applications more so represent the Brand of the company
Most applications are not designed to solicit and extract the customer experience data well. There are 2 major ways data is obtained from applications
Web-site tagging
Very detailed logging data for engineers for application development and application operational performance
One is too aggregate and difficult to administer; the other is too engineering oriented
Furthermore applications are designed within themselves and mostly are not designed to thinking about the experiences across other applications and channels. Stitching the customer experience across multiple applications is difficult.
The problem is big
7 sources by client
Ability to customize for the consumer
Ingestion: depending on the type of source TD has IP; basically there are 2 types of sources: streaming & batch. For streaming TD Listener will be the advocated solution; for batch TB has 2 pieces of IP for ingestion (Light-weight ingestion (LWI) & Buffer Server).
Light-weight ingestion (LWI) is for large 3rd party files like Omniture. Instead of having to FTP OMNI to a landing server; LWI connects directly to FTP and pulls the file and lands into HDFS in time-partitions.
Buffer Server is a set of IP that is designed to ingest large numbers of small files, concatenate them together to large files that are more Hadoop friendly and lands them into HDFS time-partitions.
Event Processing & Repository
TB has designed (but not yet implemented) 2 pieces of IP in this area
Event Processing: built using M/R it converts the incoming data sources into event objects (3 processing steps include: pre-pend an event header, pre-pend an event type header and resolve incoming ID (cookie, GUID, customer, email address, etc.) to a specific customer. Populates event records into Hbase. The Event Processing Engine processese both streaming and batch sources
Event Repository is an HBase schema that is to central storage for all events
Dashboard Engine
TB has built IP that allows quickly building KPI’s from the Event Repository. Using a UI, a developer can quickly aggregate metrics into an Hbase schema onto top of which tools like Tableau can optimall run
Guided, Metadata-driven Discovery Event Analytics