Hear why SQL on Hadoop is the future of analytics. Mike Xu, Looker Data Architect and Evangelist will share how recent updates to SQL query engines like Spark and Presto are finally allowing companies to harness Hadoop's processing power for analytics. Looker Data Analyst Eric Feinstein shares how a top 10 health insurance company built an in-cluster data platform using Looker to make all their data in Hadoop accessible to thousands of analysts and business users across the company every day.
In this webinar you will:
•Learn fundamentals of modeling out of Hadoop including Spark
•Hear best practices for navigating joins
•See case study demonstrations
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
SQL on Hadoop for Enterprise Analytics
1. A LITTLE BIT OF
HISTORY
Everythingoldisnewagain.
SQLForever.
2. The story so far
Why hasn’t SQL died yet?
It’s 2016 and we’re still using it?!
3. Everything old is new again
Existing architecture keeps reappearing
It takes time to figure out what tools are right for what jobs
SQL is still the best tool for business analytics
33. What’s next?
~2020?
“If you have an architecture where you’re trying to periodically
trying to dump from one system to the other and synchronize,
you can simplify your life quite a bit by just putting your data in
this storage system called Kudu.” – Todd Lipcon
37. Why Consider a Big Data Pipeline?
37
You arerapidly exceedingthelimits ofyour existing database
Everythingon yourwebsitecan be
analyzed.
Waitinguntilthenextdayisn’tfor
you
Datacomes andgoestomany places, andyou
wantoneprocess forit
38. Big DATA CULTURE
38
Summarydatais notgood enough Companyismandatingnew
technologies
Youwanttobuild adatadriven
culture
Big SQLis theheartof a data-drivenculture
39. CASE STUDY
39
A major healthcare provider wants to create a web event pipeline that:
Duringperiodsofhealthcareregistrationandnew
coveragestartandcan dialbacktherestoftheyear
Massive Scaling Large data volumes
10-15Mcustomersworthof data.Provides
dataforanalysisinunder1minute.
AND Utilizes existing in house technologies (such as Cloudera Impala)
Pageloads
Registrations
Logins
Errors
All events processed
44. Spark vs Storm
9
VS
• OwnMasterServer
• Run onHDFS
• Microbatching
• Exact once delivery(eliminates
vulnerability)
• NotnativetoHadoop
• LessDeveloped
• Oneata time
• ETL inflight
• Subsecondlatency
Twoofthemajorplayersin datastreaming/processing
45. Flume
45
Source Interceptor Selector Channel Sinks
Managed by the Flume Agent
Web Server
Web Server
Web Server
Web Server Investor Channel
HDFSNo in flight transformation, so this just needs to meet workload
47. Flume vs. Kafka
12
Use Both: Out-of-the box with Flafka and native connectors
Flume
Kafka
Source
Spark
Custom
connector
Custom
connector
Flume KafkaSource Spark
48. Storing the output
48
Data can be queried viaHive, Impala, or
SparkSQL
Clouderaisour Enterprise
choice
We can process asubset in-stream with Mlib
or other machine learning algorithms
Output summaries toother
RDBMS systems
Our streaming Spark cluster consumes messages from Kafka. We batch these every
minute into a HDFS cluster. We chose this because
51. Priority # 1: Scale
Kafkais easy toscale, Asmorevolumecomes in,
addingnew brokerscan be automatedusing the
PartitionReassignmentTool
Bymonitoringbatchtimesin LookeronSparkSQL,
wecan alertwhenweneed toscale up thecluster
using Scheduled Looks
16
52. Priority #2: Flexibility
17
Differenteventscan beparsed outtodifferentSparkstreamingapplications
withKafkatopics (Oranothertype of consumer)
Addmoredataatanypoint(flume, kafkaproducer,ordirectlytospark)
Lookerconnects towhereverthedatalands, as long as wecan query it.Perform
analysis INCLUSTER
53. Priority #3 Speed Analyzing the stream
53
Events per hour
Identifymissingbatches
Volume andTiming
Rightsizinghardware
Duplicate events
And missinginformation
54. Priority #4: In house Technologies
19
Provide access to Hadoop/Impala via
a centralized data hub:
Asingle place toaccess webbased reports,explores,
BI toolsand code libraries
Enable users to ask questions and
query web data without writing SQL or
knowing about the pipeline
56. Analyzing the stream
21
By connecting Looker to various points
in the stream we can verify complete
loads:
We also mask the location of
information, one dashboard may show
a variety of reliable sources.
• ImpalaSQL
• SourceLogs
• SummaryReports
57. Other uses and benefits
57
Match data in flight to
find bad user accounts
In flight alerts for
missing data
Analysis without
needing to know the
location in the stream
SQL on Hadoop BI
solution doesn’t
require new skillset