Artur Borycki - Beyond Lambda - how to get from logical to physical - code.talk.2015
1. Beyond Lambda - how to get from logical to physical
Artur Borycki
Director Technology & Innovations
2. Simplification & Efficiency
Teradata believe in the principles of self-service,
automation and on-demand resource allocation.
These enable faster, more efficient and more
effective data application development and
operation.
3. ‹#›
What is Lambda Architecture
Background
• Reference architecture for Big Data systems
• Designed by Nathan Marz (Twitter)
• Defined as a system that runs arbitrary functions on
arbitrary data
• “query = function(all data)”
Design Principles
• Human fault-tolerant, Immutability, Computable
Lambda Layers
• Batch - Contains the immutable, constantly growing master dataset.
• Speed - Deals only with new data and compensates for the high latency updates of the
serving layer.
• Serving - Loads and exposes the combined view of data so that they can be queried.
4. ‹#›
Active Executor Lambda Framework
• The way this works is that an immutable sequence of records is captured and fed into a batch system and
a stream processing system in parallel.
• You implement your transformation logic twice, once in the batch system and once in the stream
processing system.
• You stitch together the results from both systems at query time to produce a complete answer.
6. ‹#›
Lambda alternative – Kappa? (Jay Kreps – Linkedin)
Unlike the Lambda Architecture, in this approach you only do
reprocessing when your processing code changes
1. Use Kafka or some other system that
will let you retain the full log of the data
you want to be able to reprocess and
that allows for multiple subscribers. For
example, if you want to reprocess up to
30 days of data, set your retention in
Kafka to 30 days.
2. When you want to do the reprocessing,
start a second instance of your stream
processing job that starts processing
from the beginning of the retained data,
but direct this output data to a new
output table.
3. When the second job has caught up,
switch the application to read from the
new table.
4. Stop the old version of the job, and
delete the old output table.
7. Real-time
Maturity
Typical path for a customer
Customers typically go through four stages on their
path to real-time analysis.
The evolution typically starts with trying to visualize
results or reports more frequently. This leads to the
realization that the underlying data is not refreshed
frequently. The next stage of maturity is to capture and
ingest data more quickly. Once data is flowing faster,
customers then try to process the data as it is flowing.
The final stage is to remove any human intervention.
8. ‹#›
Events
/
Interactions
Consumer
of
Information
All
dataStreams
Other
feeds
Consumer
of
Information
Discovery
Advance
Analytics
Data
binding
Reporting
Beyond Lambda – Omega ;) (Artur vision)
• We need events that require actions
and interactions without much of the
analytics
• We need events that are requiring
action, but also they need to be
enhanced by the analytics in the
ecosystem (based on other information
sources)
• We need events that will be handled
later or they are supporting above cases
9. ‹#›
The Teradata UDA
UNIFIED DATA ARCHITECTURE
Security, Workload Management
Applications
INTEGRATED DATA WAREHOUSE
DATA
PLATFORM
INTEGRATED DISCOVERY PLATFORM
Security, Workload ManagementREAL TIME PROCESSING
TERADATA
PORTFOLIO FOR
HADOOP
TERADATA DATABASE
TERADATA ASTER DATABASE
RESTFULAPI
LISTENINGFRAMEWORK
RESTFULAPI
APPFRAMEWORK
10. 10
1
0
BEST
APP
EVER!!
Data Service APIs
Access Data on Teradata, Aster,
Hadoop via API calls
Logging
Push and store events about app to
UDA logging services
Ingest / Streaming
Stream data into UDA and build
applications on near real-time data
Scheduling / Orchestration
Scheduling services allow devs to
build workflows and connect apps.
Search & Metadata
Expose search capabilities in your
app via UDA level search services.
WebKit
A toolbox of UI templates,
visualizations and javascript libraries
Package/Deploy & Publish
A simple package and deployment
application to launch your app in the
AppCenter ecosystem
Operate
Leverage monitoring & alerting
services to keep track app health.
Key Services, Libraries & Templates
UDA it a concept but also allows to be Development Platform
11. Instead of a single monolithic database
1
1
Monolith
A monolithic application
puts all of its
functionality into a
single process and
scales by replicating
the monolith on
multiple servers.
Microservices
A microservices
architecture puts each
element of functionality
into a separate service
and scales by
distributing these
services across
servers.
Decoupled Services
12. Scale by distributing services and replicating as needed
1
2
Monolithic App
A monolithic application
puts all of its
functionality into a single
process and scales by
replicating the
monolith on multiple
servers.
Microservices
A microservices
architecture puts each
elementoffunctionality
into a separate service
and scales by
distributing these
services across
servers.
Think Microservice, not Monolithic
13. ‹#›
Access and move data between systems through service APIs
1
3
UDA
TD TD
INFRASTRUCTURE
DATA
SERVICES
REST API Call
Send Query
Execute Query
Send Response
Teradata Data Services
14. QueryGrid – Data Movement QueryGrid – Remote Execution
Foreign Table Select – Pass Thru
SELECT *
FROM FOREIGN TABLE (
select
parse_url(refer,'HOST') as host,
v.key as key,
ts as session_ts,
v.val, count(*) as count
from http_inline LATERAL VIEW
explode(str_to_map(parse_url(refer,'QUERY'),'&','=')) v as key,
val
where parse_url(refer, 'QUERY') is not null
group by parse_url(refer, 'HOST'), v.key, v.val
)@hdp21 hdp_dpi
WHERE
session_ts = current_date;
Push foreign grammar to remote.
Hadoop:permits Hive/Impala query for data reduction on non-partitioned
columns.
Import
SELECT source, session FROM
clickstream@Hadoop_sysd WHERE
session_ts =
ʻ‘2013-01-01ʼ’;
Can be used to:
– “Insert/select” & “create table as” to instantiate data locally.
– Joins always possible with local tables.
Export
INSERTINTO
cust_loc@Hadoop_sysd
SELECTcust_id,cust_zip
FROM cust_data
WHERE
last_update = current_date;
Move Data from Teradata to Hadoop
– And/or other Data Stores
15. 15
The Data Lake – Customer slide
• This is not skating to where the puck is going to be - It’s skating to the puck.
– Your CIO should be sitting you on the bench if you are not doing this already
Most Data Lakes Today
Passive cheap storage
•Really only using HDFS
•Limited data governance
•Staging Data
•Archiving Data
•DW offload (cost drivers)
The Data Lakes we Should be Building
Active balanced nodes
•Using full Hadoop stack+
•Good data governance
•Good information architecture
•Processing and enhancing data
•Data applications (flexibility drivers)
16. 16
New Architecture Architecture
• Information architecturesare distributed
– Focus on data and business questions, not integrating separate systems
• Application architectures are variable
– Don’t force applications into a single architecture
• Applications are Loosely Coupled
– DW is an application
– BI is an application (or many)
– Data applications are everywhere!
• But let’s be smart about it
– Still need strong information architecture and data management practices
– Still need to reduce complexity and make strategic choices on technology
20. ‹#›
INFRASTRUCTURE
QUERY GRID
TD 6xxx TD 1xxx ASTER HADOOP 1 HADOOP 2 LISTENER
Move data between systems & access through service APIs
2
0
App App App App AppApp
Data Pipeline
SERVICE LAYER
21. 21
Customer example – Integration Flow
• User starts a Workflow fromthe UI which has a single Pig Job.
• Azkaban Web requests that the Azkaban Executor start a new Pig
Job.
• Pig Job makes a REST call to the TemplateModule to render the
Pig Template.
• TemplateModule fetches config values from the ConfigModule if
needed by the template. The ConfigModule in turn fetches config
values either fromthe PCF Data Schema or from externalsystems.
• TemplateModule renders the Pig Template and returns a
complete Pig Script.
• Pig Job executes the Pig Script against the Hadoop cluster.
• During the Pig Job executionit makes REST calls to the
EventModule informing about its progress.
• As the Job progress is updated Vertx updates the Azkaban Web UI
in real time.
• When the Pig Job has completedit makes a REST call to the
AuditModule to log its completion. The AuditModule in turn stores
auditing information in the PCF Data Schema.
• Finally the Pig Job returns its executionstatus back to the Azkaban
Executor.
MySQL
Azkaban
Web
Azkaban
Executor
KAFKA
Azkaban Bridge
Service
Config
Service
Template
Service
Teradata
Service
Event
Service
Audit
Service
Pig Job
Hadoop
PCF
Pig
Templ
Pig
Script
JSON
REST
23. 23
Customer – Docker services
Azkaban Nginx Services
LogStash
Tessera
/Graphite
Consul
Consul Consul
Ambassadord
Container Third Party Used For
Nginx No Front end web server/proxy for all the other UIs.
Vert.x No Application server.
Azkaban Yes Workflow management for Hadoop, Teradata etc.
Tessera/Graphite No Aggregating and displaying applications and system level metrics
LogStash Yes Aggregating and displaying application and system level logs
Consul Yes Distributed key value store used for Service Descovery
Ambassadord Yes Makes it easier for Docker containers to access services hosted in
other Docker containers
24. Tap into the power of the platform without duplicating effort
YOUR ANALYTIC APP
MICRO SERVICES FRAMEWORK
ASTER DATA
SERVICES
…TD DATA
SERVICES
HIVE DATA
SERVICES
AUTH
SERVICES
Easily Access UDA
25. ‹#›
Extract, Load & Transform in the Layered Architecture
2
5
Level 0
Aggregation
Business Unit Specific Rollups
Calculation
Key Performance Indicators
Level 3
Level 2
Level 1
Integration
Integrated Model at Lowest
Granularity
Staging
1:1 Source Systems
EXTRACT
LOAD
TRANSFORM
APP CENTER
LISTENER
…
BUSINESS
HEALTH
WORKLOAD
ANALYTICS
MEMBER
SEGMENT
ENGINE
CATEGORY
SALES
DAILY
FINANCIALS
UDA & the LDA