SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Hadoop & Hive
HOW
Data Warehousing Game
CHANGE THE
Forever
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com
2014 Hadoop Summit
San Jose, CA
June 3, 2014
The Truth about Data
44
“We think only 3% of the potentially
useful data is tagged, and even less
is analyzed.”
Source: IDC Predictions 2013: Big Data, IDC
“90% of the data in the world
today has been created in the
last two years”
Source: IBM
In 2012, 2.5 quintillion byes
of data was generated every
day
Source: IBM
2,500,000,000 Gb
7
…and that was back in 2012
The Broken Promise
What we wanted
What we got
Time for a new approach
Relational DBs
Volume: Write twice
Variety: Structured
Velocity: Early Transformation
Hadoop
Volume: Write once
Variety: Semi-structured
Velocity: Late Transformation
INPUT DATA
HADOOP
ETL
MART MART MART
QUERY ENGINE
VISUALIZER
INPUT DATA
HADOOP
ETL
MART MART MART
QUERY ENGINE
VISUALIZER
INPUT DATA
HADOOP (HIVE)
VISUALIZER
Case Study
Klout
20
15
Social networks processed daily
769
TB of data storage
200,000
Indexed users added daily
400,000,000
Users indexed daily
12,000,000,000
Social signals processed daily
50,000,000,000
API calls delivered monthly
10,080,000,000,000
Rows of data in the data warehouse
Trillions!
Klout’s Data Architecture
A common question
How are users using our site?
Google Analytics
+ Great page by page analysis
- User identifiable data against EULA
MixPanel
+ Great event tracking
- Can send user specific identifiers, big limitations
Klout
+ All our data telling us who these people actually are
- That’s about it
Klout’s Event Tracker
{
"project": "plusK",
"event": "spend",
"ks_uid": 123456,
"type": "add_topic"
}
{
"project": "plusK",
"event": "spend",
"session_id": "0",
"ip": "50.68.47.158",
"kloutId": "123456",
"cookie_id": "123456",
"ref": "http://www.klout.com",
"type": "add_topic",
"time": "1338366015"
}
EVENT_LOG
tstamp INT
project STRING
event STRING
session_id BIGINT
ks_uid BIGINT
ip STRING
attr_map MAP<STRING,STRING>
json_text STRING
dt STRING
hr STRING
SELECT { [Measures].[Counter],
[Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-0
02T00:00:00]),
[Events].[Event].[Event].allmembers) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.kloutId') as kloutId,
get_json_object(json_text,'$.v') as version,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND tstamp=20121027
AND event in ('api_error', 'api_timeout')
ORDER BY sid;
So, what’s wrong with this
picture?
Klout’s Data Architecture
Klout’s Data Architecture
Case Study
Online Gaming (MMO)
...
LogInt1369155542t4533245t”loc":”23”,"rank":"Expert”,"client":"ios"lf
Buyt1369155556t4533446t”loc":”23”,"item":"212”,"ref”:”ask.com”,"amt":"1.50"lf
...
Capture
Event Name Timestamp User ID Attributes
CREATE EXTERNAL TABLE event_log (
event STRING,
event_time TIMESTAMP,
user_id INTEGER,
event_attributes MAP<STRING, STRING>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' COLLECTION ITEM TERMINATED
BY ','
PARTITIONED BY (day(FROM_UNIXTIME(event_time)), INTEGER)
LOCATION '/user/event_logs’;
Event Name Timestamp User ID Attributes
Map
SELECT
SUBSTR(FROM_UNIXTIME(event_time),1,7) AS MonthOfEvent,
event_attributes[”loc"] AS Location,
count(*) AS EventCount
FROM event_log
WHERE year(FROM_UNIXTIME(event_time)) = 2014
GROUP BY SUBSTR(FROM_UNIXTIME(event_time),1,7), attributes[”loc"]
Event Name Timestamp User ID Attributes
Transform and Query
Hive for analytics
Why now?
New, Interactive Flavors
Shark Impala Stinger
Shark Impala Stinger
Performance approach Caching Optimizer Improve Hive
Theoretical limits (# of rows) Billions Trillions Trillions
Supports UDFs, SerDes Yes Fall ‘14 Yes
Supports non-scalar data types Yes Fall ‘14 Yes
Preferred file format Tachyon Parquet ORC
Sponsorship Databricks Cloudera Hortonworks
Hive is a cheap MPP database
Records
Returned
Time (Seconds)
Select Statement
HANA
Small
Impala
Small
(1 Node)
Parquet
Impala
Small
(3
Nodes)
Parquet
Impala
Small
(1 Node)
Text
Impala
Small
(3 Nodes)
Text
select count(*) from lineitem 1 1 3 1 74 31
select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by
l_shipmode 7 8 23 5 74 28
select l_shipmode, count(*), sum(l_extendedprice) from lineitem where
l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by
l_shipmode, l_linestatus 1 1 27 5 72 29
select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus =
'F' and l_suppkey = 1 45 1 23 5 73 30
select l_shipmode, l_linestatus, l_extendedprice from lineitem where
l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31
select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 104 21 73 30
Size
(5 Part.)
1.9Gb
(40 files x 80mb)
3.2Gb
(1 file – No
Compression)
7.2Gb
Est. Monthly Cost of Production Environment on AWS
(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws
TPC-H Query Run Times (Impala vs. HANA)
Line item table, 60 Million Rows
Real Customer Scenario: Impala (CDH5)
5 Data Node Cluster, 16Gb of RAM each, 4 Cores each, Parquet
Fact Table
(# of rows)
Dimensions
Execution
Time
(seconds)
1 Dimension (28 rows) 0:00:00.836
2 Dimensions (28 rows, 28 rows) 0:00:00.767
2 Dimensions (28 rows, 2,926 rows) 0:00:00.660
3 Dimensions (28 rows, 28 rows, 2,926 rows) 0:00:00.871
1 Dimension (2,926 rows) 0:00:00.490
2 Dimensions (28 rows, 2,926 rows) 0:00:00.705
1 Dimension (28 rows) 0:00:06.780
2 Dimensions (28 rows, 3,782 rows) 0:00:23.097
0 Dimensions 0:00:52.074
1 Dimension (121,964,466 rows) - Count Distinct 0:00:55.861
1 Dimension (5,547,151 rows) 0:01:08.972
2 Dimensions (5 rows, 2,926 rows) 0:00:45.060
3 Dimensions (5 rows, 2,926 rows, 3,782 rows) 0:01:39.119
0 Dimensions 0:00:06.945
2 Dimensions(5 rows, 40,040 rows) 0:00:31.980
4 Dimensions (2 rows, 487,374 rows, 7,875,489 rows, 2,038,760 rows) 0:01:33.404
0 Dimensions 0:00:12.854
1 Dimension (8,038 rows) 0:00:24.083
995,761,863 1 Dimension (3,782 rows) 0:01:23.484
1 Dimension (28 rows) 0:00:56.716
1 Dimension (5 rows) 0:00:33.750
2 Dimensions (5 rows, 3,782 rows) 0:01:11.021
0 Dimensions 0:00:32.854
2 Dimensions (3 rows, 371 rows) 0:00:54.329
520
15,036
55,676
72,745,961
121,964,466
263,223,987
378,706,328
587,679,516
1,064,423,864
1,174,737,467
Hive v Impala
TL;DW
(Too long; didn’t watch)
DO DO NOT
Capture Data Aggregate Data
DO DO NOT
T (Transform) ETL (Extract, Transform, Load)
DO DO NOT
Schema on Read Schema on Load
DO DO NOT
Query in Place Create Data Marts
Thank you!
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com

Weitere ähnliche Inhalte

Was ist angesagt?

Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 

Was ist angesagt? (20)

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...High-Performance Analytics with Probabilistic Data Structures: the Power of H...
High-Performance Analytics with Probabilistic Data Structures: the Power of H...
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 

Andere mochten auch

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
DataWorks Summit
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Andere mochten auch (10)

Population Health Management Case Studies
Population Health Management Case StudiesPopulation Health Management Case Studies
Population Health Management Case Studies
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data WarehouseHybrid Data Architecture: Integrating Hadoop with a Data Warehouse
Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
Clinical Data Repository vs. A Data Warehouse - Which Do You Need?
 

Ähnlich wie Hadoop & Hive Change the Data Warehousing Game Forever

Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Osama Khan
 

Ähnlich wie Hadoop & Hive Change the Data Warehousing Game Forever (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Handson with Twitter Heron
Handson with Twitter HeronHandson with Twitter Heron
Handson with Twitter Heron
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Hadoop & Hive Change the Data Warehousing Game Forever

Hinweis der Redaktion

  1. Batman *Forever*
  2. Our ability to capture data has far exceeded our ability to analyze it Traditional data warehousing tools have not kept pace with the growth of data Hadoop allows us to capture and store data economically but tradition BI tools and approaches don’t work IDC “Currently a quarter of the information in the Digital Universe would be useful for big data if it were tagged and analyzed. We think only 3% of the potentially useful data is tagged, and even less is analyzed”
  3. Sad panda
  4. Happy panda!
  5. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  6. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  7. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  8. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  9. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  10. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  11. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  12. Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
  13. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  14. Great page by page analysis, great reports, but couldn’t send user identifiable data
  15. Mixpanel has great support for real time events, but we couldn’t send all the necessary data to really draw interesting conclusions. Joining on data was still going to be a huge challenge.
  16. We had all our data, but of course, that was about it
  17. We couldn’t cross the streams. We wanted to discover really interesting patterns and make advanced recommendations based on who the user was.
  18. At Klout, we used web analytics tools like Google Analytics and Mixpanel to understand how our users interacted with our web site and mobile app. However, we could not join the usage data with our profile data. This made for an incomplete view of our users. We decided to build a flexible, event oriented architecture to capture all events for user activity. This is the architecture.
  19. First, we invented a simple, JSON oriented event capture method. This allowed our web and app designers to add instrumentation without regards to how it would affect the downstream analytics applications or Hive warehouse.
  20. Next, using Flume, we mapped the semi-structure data stream into time partitioned files in Hadoop HDFS.
  21. We then created an EXTERNAL Hive table on top of this file structure. That allowed us to “query” the incoming files in HDFS.
  22. In order to provide an interactive query environment (OLAP), we connected SQL Server Analysis Services directly to the Hive warehouse and continuously updated a MOLAP cube with the data.
  23. We then could hook up internally developed applications (Event Tracker) to our data by having the applications generate MDX (multi-dimensional query language) and run them against our cube.
  24. Or we could use the Hive CLI (command line interface) to execute queries using SQL directly against our Hive warehouse.
  25. Thumbs up!
  26. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  27. The Klout architecture is made up of open source tools. Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse. We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
  28. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  29. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  30. By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
  31. Apache Hive’s reliance on MapReduce as it’s core data processing engines makes it unsuitable for interactive queries due to startup times and MapReduce’s batch nature. There are several approaches emerging to address these deficiencies that are still Hive catalog compatible. These developments are what makes using Hadoop/Hive as the world’s least expensive but scalable data warehousing platform possible.
  32. Aron MacDonald Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws Cloudera Impala which is essentially free, performs almost as well as an expensive alternative that relies on memory caching for delivering performance.
  33. Shark, Impala, etc. turn Hive into a real interactive SQL query environment. This is a huge advancement and was the missing piece that makes Hadoop into the world’s cheapest most scalable database. Here’s a query that demonstrates Shark/Hive’s support for non-scalar data types: use aw_demo; describe factinternetsales; select a.year, s.stylename, a.num_orders from ( select part_year as year, product_info["style"] as style, sum(orderquantity) as num_orders from factinternetsales where part_year < 2007 group by product_info["style"], part_year ) a left outer join dimstyle s on a.style = s.stylekey order by year, num_orders desc;
  34. Too Long; Didn’t Watch
  35. Too Long; Didn’t Watch