SlideShare ist ein Scribd-Unternehmen logo
1 von 59
A LITTLE BIT OF
HISTORY
Everythingoldisnewagain.
SQLForever.
The story so far
Why hasn’t SQL died yet?
It’s 2016 and we’re still using it?!
Everything old is new again
Existing architecture keeps reappearing
It takes time to figure out what tools are right for what jobs
SQL is still the best tool for business analytics
A long long time ago…
Growing pains
Late 1990
Database problems
Database outage
Data integrity issues
Data latency
Late 1990
Master Slave
Late 1990
Transactions
Late 1990
Performance
Late 1990
By the time I graduated, SQL was on its last legs
2009
Cache all the things!
2009
Stop copying Twitter!
2009
SQL golden age ends, NoSQL takes off
2010
Column Graph
Key-Value Document
NoSQL
2010
Awesome things about NoSQL
No SQL, normal languages as APIs!
Non relational!
FAST!
2010
Remember ORMs?
~2000
Active Record
~2000
ORMs 👎
2011
Remember EAV(Entity Attribute Value)?
1968
Kind of looks like columns…
1968
Modern EAV
2010
Tedious to query
2010
Voila!
2010
No joins is a feature!
2010
NoSQL has some rough bumps
2010
NoSQL has A LOT of rough bumps…
2011
Throwback Thursday!
2011
Lock the doors
2011
MPP columnar DBs! Wait... SQL is back?!
2015
Hadoop on SQL
2016
A long long time ago…
What’s next?
~2020?
What’s next?
~2020?
“If you have an architecture where you’re trying to periodically
trying to dump from one system to the other and synchronize,
you can simplify your life quite a bit by just putting your data in
this storage system called Kudu.” – Todd Lipcon
SQL is far past hype
Fin
“If it ain’t broke, don’t fix it”
CUSTOMER STORY
Buildingaeventanalyticspipelineusing HadoopandSpark
Why Consider a Big Data Pipeline?
37
You arerapidly exceedingthelimits ofyour existing database
Everythingon yourwebsitecan be
analyzed.
Waitinguntilthenextdayisn’tfor
you
Datacomes andgoestomany places, andyou
wantoneprocess forit
Big DATA CULTURE
38
Summarydatais notgood enough Companyismandatingnew
technologies
Youwanttobuild adatadriven
culture
Big SQLis theheartof a data-drivenculture
CASE STUDY
39
A major healthcare provider wants to create a web event pipeline that:
Duringperiodsofhealthcareregistrationandnew
coveragestartandcan dialbacktherestoftheyear
Massive Scaling Large data volumes
10-15Mcustomersworthof data.Provides
dataforanalysisinunder1minute.
AND Utilizes existing in house technologies (such as Cloudera Impala)
Pageloads
Registrations
Logins
Errors
All events processed
Solution: Build an event processing framework
5
Events
Event Collector Hadoop
?
High Level Process
6
Events
Event Collector
Message Processing
HDFS
Looker
To be designed
Why is Hadoop so hard?
7
Needtowritein Javaand
Scala
We don’thavestructure
NoteasytogetdataoutintoBItools
EventCollectorsdon’ttendtofeed
toHDFS
outofthebox
Typicallyfollowa batchprocessing
framework
Ingestion mechanism
8
Low-Latency Inflighttransformationand
processing
Abilitytopopulatemultiple
destinations
Our ideal ingestion would have three key aspects
Spark vs Storm
9
VS
• OwnMasterServer
• Run onHDFS
• Microbatching
• Exact once delivery(eliminates
vulnerability)
• NotnativetoHadoop
• LessDeveloped
• Oneata time
• ETL inflight
• Subsecondlatency
Twoofthemajorplayersin datastreaming/processing
Flume
45
Source Interceptor Selector Channel Sinks
Managed by the Flume Agent
Web Server
Web Server
Web Server
Web Server Investor Channel
HDFSNo in flight transformation, so this just needs to meet workload
KAFKA
46
Broker
Broker
Broker
Producer Broker Consumer
Producer
Producer
SparkStreaming
Other
ZooKeeper
Broker
Flume vs. Kafka
12
Use Both: Out-of-the box with Flafka and native connectors
Flume
Kafka
Source
Spark
Custom
connector
Custom
connector
Flume KafkaSource Spark
Storing the output
48
Data can be queried viaHive, Impala, or
SparkSQL
Clouderaisour Enterprise
choice
We can process asubset in-stream with Mlib
or other machine learning algorithms
Output summaries toother
RDBMS systems
Our streaming Spark cluster consumes messages from Kafka. We batch these every
minute into a HDFS cluster. We chose this because
Final Result
14
Events
Event Collector
Kafka
Flume SparkSQL
Cloudera
Other storage
(RDBMS)
Other storage
(logs)
Pipeline Summary
15
Add datatoanypointof
thepipeline
Kafka,Flume,Impala,Looker
withoutmanycustom
connectors
Pipelineincludes additionalsources
liketeradata,oracle
Add in-flightpredictivemodeltraining
andexecutionwithoutsignificant
additionalprocessingtime
Our pipeline provides several points for flexibility as well as meets our key priorities.
Priority # 1: Scale
Kafkais easy toscale, Asmorevolumecomes in,
addingnew brokerscan be automatedusing the
PartitionReassignmentTool
Bymonitoringbatchtimesin LookeronSparkSQL,
wecan alertwhenweneed toscale up thecluster
using Scheduled Looks
16
Priority #2: Flexibility
17
Differenteventscan beparsed outtodifferentSparkstreamingapplications
withKafkatopics (Oranothertype of consumer)
Addmoredataatanypoint(flume, kafkaproducer,ordirectlytospark)
Lookerconnects towhereverthedatalands, as long as wecan query it.Perform
analysis INCLUSTER
Priority #3 Speed Analyzing the stream
53
Events per hour
Identifymissingbatches
Volume andTiming
Rightsizinghardware
Duplicate events
And missinginformation
Priority #4: In house Technologies
19
Provide access to Hadoop/Impala via
a centralized data hub:
Asingle place toaccess webbased reports,explores,
BI toolsand code libraries
Enable users to ask questions and
query web data without writing SQL or
knowing about the pipeline
Analyzing the stream
55
Looking for Lost data
=/=
Analyzing the stream
21
By connecting Looker to various points
in the stream we can verify complete
loads:
We also mask the location of
information, one dashboard may show
a variety of reliable sources.
• ImpalaSQL
• SourceLogs
• SummaryReports
Other uses and benefits
57
Match data in flight to
find bad user accounts
In flight alerts for
missing data
Analysis without
needing to know the
location in the stream
SQL on Hadoop BI
solution doesn’t
require new skillset
THANK YOU!
Sources
http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolor
http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern
http://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6
http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-
Security.pdf
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb
https://www.percona.com
http://techcrunch.com/
http://mashable.com/

Weitere ähnliche Inhalte

Mehr von DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 

Mehr von DATAVERSITY (20)

Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
 
Empowering the Data Driven Business with Modern Business Intelligence
Empowering the Data Driven Business with Modern Business IntelligenceEmpowering the Data Driven Business with Modern Business Intelligence
Empowering the Data Driven Business with Modern Business Intelligence
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

SQL on Hadoop for Enterprise Analytics

Hinweis der Redaktion

  1. State thesis here
  2. Examples
  3. Examples
  4. 7 months
  5. 7 months
  6. This slide is not linear
  7. This slide is weird
  8. This is a solution when your source and endpoints have idiosyncrasies