SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Confidential and Proprietary to Daugherty Business Solutions
05-02-2018
Hadoop Data Modeling
Confidential and Proprietary to Daugherty Business Solutions 2
Agenda
• What’s the Big Data Innovation, Data Engineering, Analytics Group?
• Data Modelling in Hadoop
• Questions
Confidential and Proprietary to Daugherty Business Solutions 3
It started with an article
Confidential and Proprietary to Daugherty Business Solutions 4
And a name change
Confidential and Proprietary to Daugherty Business Solutions
• What is the future of the Hadoop ecosystem?
• What is the dividing line between Spark and Hadoop?
• What are the big players doing?
• How does the push to cloud technologies affect Hadoop usage?
• How does Streaming come into play?
5
Which led to some questions
Confidential and Proprietary to Daugherty Business Solutions
• Hadoop is here to stay, but it will make the most strides as a machine learning platform.
• Spark can perform many of the same tasks that elements of the Hadoop ecosystem can,
but it is missing some existing features out of the box.
• Cloudera, Hortonworks, and MapR are positioning themselves as data processing
platforms with roots in Hadoop, but other aspirations. For example, Cloudera is
positioning itself as a machine learning platform.
• The push to cloud means that the distributed filesystem of HDFS may be less important
to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to
work with cloud sources.
• The Hadoop ecosystem projects have proven patterns for ingesting streaming data and
turning it into information.
And then our answer
6
Confidential and Proprietary to Daugherty Business Solutions
• We’re now going to be
St. Louis Big Data Innovation, Data Engineering, and Analytics Group
Or more simply put:
St. Louis Big Data IDEA
Introducing …
7
Confidential and Proprietary to Daugherty Business Solutions
• Local Companies
• Big Data
– Hadoop
– Cloud deployments
– Cloud-native technologies
– Spark
– Kafka
• Innovation
– New Big Data projects
– New Big Data services
– New Big Data applications
• Data Engineering
– Streaming data
– Batch data analysis
– Machine Learning Pipelines
– Data Governance
– ETL @ Scale
• Analytics
– Visualization
– Machine Learning
– Reporting
– Forecasting
So What is the STL Big Data IDEA interested in?
8
Confidential and Proprietary to Daugherty Business Solutions
• Scott Shaw has been with Hortonworks for four years.
• He is the author of four books including Practical Hive and Internet of Things and Data
Analytics Handbook.
• Scott will be helping our group find speakers
in the open source community.
Please help me welcome Scott to the group
in his new role
Introducing our New Board Member
9
Confidential and Proprietary to Daugherty Business Solutions 10
Agenda
• The Schema-on-Read Promise
• File formats and Compression formats
• Schema Design – Data Layout
• Indexes, Partitioning and Bucketing
• Join Performance
• Hadoop SQL Boost – Tez, Cost Based Optimizations & LLAP
• Summary
Confidential and Proprietary to Daugherty Business Solutions
Introducing our Speakers
Adam Doyle
• Co-Organizer, St. Louis Big Data IDEA
• Big Data Community Lead, Daugherty
Business Solutions
Drew Marco
• Board Member & Secretary, TDWI
• Data and Analytics Line of Service
Leader, Daugherty Business Solutions
11
Confidential and Proprietary to Daugherty Business Solutions 12
Schema On Read
• Schemas are typically purpose-built and hard to
change
• Generally loses the raw/atomic data as a source
• Requires considerable modeling/implementation
effort before being able to work with the data
• If a certain type of data can’t be confined in the
schema, you can’t effectively store or use it (if you
can store it at all)
Schema on Write Schema on Read
• Slower Results
• Preserve the raw/atomic data as a source
• Flexibility to add, remove and modify columns
• Data may be riddled with missing or invalid data,
duplicates
• Suited for data exploration and not
recommended for repetitive querying and high
performance
Real world use of Hadoop / Hive that require high performing queries on
large data sets requires up-front planning and data modeling
Confidential and Proprietary to Daugherty Business Solutions 13
Schema Design – Data Layout
Normalization
“The primary reason to avoid normalization is to
minimize disk seeks, such as those typically required to
navigate foreign key relations. Denormalizing data
permits it to be scanned from or written to large,
contiguous sections of disk drives, which optimizes I/O
performance. However, you pay the penalty of
denormalization, data duplication and the greater risk
of inconsistent data.”
Source: Programming Hive by Dean Wampler, Jason
Rutherglen, Edward Capriolo, O’Reilly Media
Denormalization
• Pros
• Reduces data redundancy
• Decreases risk of inconsistent datasets
• Cons
• Requires re-organization of source data
• Less efficient storage
• Pros
• Often requires reorganizing the data (slower writes)
• Minimizes disk seeks (i.e. FK relations)
• Storage in large contiguous disk drive segments
• Cons
• Data Duplication
• Increased Risk of inconsistent data
Confidential and Proprietary to Daugherty Business Solutions 14
Introducing Our Use Case
Departments
Dept_no
Name
Dept_emp
Dept_no
Emp_no
From_date
To_date
Employees
Emp_no
Birth_date
First_Name
Last_Name
Gender
Hire_date
Dept_Manager
Dept_no
Emp_no
From_date
To_date
Titles
Emp_no
Title
From_date
To_date
Salaries
Emp_no
Salary
From_date
To_date
https://dev.mysql.com/doc/employee/en/
Confidential and Proprietary to Daugherty Business Solutions 15
Data Storage Decisions
• Hadoop is a file system - No Standard data storage format in Hadoop
• Optimal storage of data is determined by how the data will be processed
• Typical input data is in JSON, XML or CSV
Major Considerations:
File Formats Compression
Confidential and Proprietary to Daugherty Business Solutions 16
Parquet
• Faster access to data
• Efficient columnar compression
• Effective for select queries
Confidential and Proprietary to Daugherty Business Solutions 17
ORCFile
High Performance: Split-able, columnar
storage file Efficient Reads: Break into
large “stripes” of data for
efficient read
Fast Filtering: Built in index, min/max,
metadata for fast filtering blocks - bloom
filters if desired
Efficient Compression: Decompose
complex row types into primitives: massive
compression and efficient comparisons for
filtering
Precomputation: Built in aggregates per
block (min, max, count, sum, etc.)
Proven at 300 PB scale: Facebook uses
ORC for their 300 PB Hive Warehouse
Confidential and Proprietary to Daugherty Business Solutions 18
Avro
• JSON based schema
• Cross-language file format for Hadoop
• Schema evolution was primary goal – Good for Select * queries
• Schema segregated from data
• Row major format
Confidential and Proprietary to Daugherty Business Solutions
Query Text Avro ORC Parquet
select count(*) from employees e join
salaries s on s.emp_no = e.emp_no join
titles t on t.emp_no = e.emp_no;
42.696 48.934 25.846 26.081
select d.name, count(1), d.first_name,
d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name
from departments d join dept_manager
dm on dm.dept_no = d.dept_no join
employees m on dm.emp_no =
m.emp_no where dm.to_date='9999-01-
01') d join dept_emp de on de.dept_no
= d.dept_no join employees e on
de.emp_no = e.emp_no group by
d.name, d.first_name, d.last_name;
59.536 63.08 27.954 26.073
Size 124M 134M 16.7M 30.5M
19
Comparison of file formats
Confidential and Proprietary to Daugherty Business Solutions 20
Compression
• Not just for storage (data-at-rest) but also critical for disk/network I/O (data-in-
motion)
• Splittability of the compression codec is an important consideration
Snappy LZO
• High speed with
reasonable compression
• Not splittable – only
used with Avro
• Optimized for speed as
opposed to size
• Splittable but requires
additional indexing
• Not shipped with
Hadoop
Gzip
• Optimized for size
• Write performance is
half of snappy
• Read performance as
good as snappy
• Smaller blocks = better
performance
bzip2
• Optimized for size
(9% better
compared to Gzip)
• Splittable
• Performance
sucks; Primary use
is archival on
Hadoop
Confidential and Proprietary to Daugherty Business Solutions 21
Partitioning & Bucketing
• Partitioning is useful for chronological columns that don’t have a very high number of
possible values
• Bucketing is most useful for tables that are “most often” joined together on the same
key
• Skews useful when one or two column values dominate the table
Confidential and Proprietary to Daugherty Business Solutions 22
Partitioning
• Every query reads the entire table even when processing subset of
data (full-table scan)
• Breaks up data horizontally by column value sets
• When partitioning you will use 1 or more “virtual” columns break up
data
• Virtual columns cause directories to be created in HDFS.
• Static Partitioning versus Dynamic Partitioning
• Partitioning makes queries go fast.
• Partitioning works particularly well when querying with the “virtual column”
• If queries use various columns, it may be hard to decide which columns
should we partition by
Confidential and Proprietary to Daugherty Business Solutions 23
Bucketing
• Used to strike a balance between large files within partition
• Breaks up data vertically by hashed key sets
• When bucketing, you specify the number of buckets
• Works particularly well when a lot of queries contain joins
• Especially when the two data sets are bucketed on the join key
Confidential and Proprietary to Daugherty Business Solutions 24
Comparison
Query Text Partition Bucketed
select d.name, count(1), d.first_name,
d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name
from departments d join dept_manager
dm on dm.dept_no = d.dept_no join
employees m on dm.emp_no =
m.emp_no where dm.to_date='9999-01-
01') d join dept_emp_buck de on
de.dept_no = d.dept_no join emp_buck e
on de.emp_no = e.emp_no group by
d.name, d.first_name, d.last_name;
59.536 59.652 55.196
Confidential and Proprietary to Daugherty Business Solutions 25
Join Performance
Map Side Joins
• Star schemas (e.g. dimension tables)
Good when table is small enough to fit
in RAM
Confidential and Proprietary to Daugherty Business Solutions 26
Reduce Side Joins
Default Hive Join
Works with data of any size
Confidential and Proprietary to Daugherty Business Solutions
Query Map-Side Reduce
select /*+ MAPJOIN(d) */ d.name, count(1),
d.first_name, d.last_name from (select d.dept_no,
d.dept_name as name, m.first_name as
first_name, m.last_name as last_name from
departments d join dept_manager dm on
dm.dept_no = d.dept_no join employees m on
dm.emp_no = m.emp_no where
dm.to_date='9999-01-01') d
join dept_emp_buck de on de.dept_no =
d.dept_no
join emp_buck e on de.emp_no = e.emp_no
group by d.name, d.first_name, d.last_name;
58.227 59.652
27
Comparison
Confidential and Proprietary to Daugherty Business Solutions 28
Considerations for SQL Performance
Tez
Confidential and Proprietary to Daugherty Business Solutions
• Hive uses a Cost-Based Optimizer to optimize the cost of running a query.
• Calcite applies optimizations like query rewrite, join reordering, join elimination, and
deriving implied predicates.
• Calcite will prune away inefficient plans in order to produce and select the cheapest
query plans.
• Needs to be enabled:
Set hive.cbo.enable=true;
Set hive.stats.autogather=true;
29
CBO – Cost Based Optimization
CBO Process Overview
1. Parse and validate query
2. Generate possible execution plans
3. For each logically equivalent plan,
assign a cost
4. Select the plan with the lowest
cost
Optimization Factors
• Join optimization
• Table size
Confidential and Proprietary to Daugherty Business Solutions
• Consists of a long-lived daemon and a tightly integrated DAG framework.
• Handles
– Pre-fetching
– Some Query Processing
– Fine-grained column-level Access Control
30
LLAP
Confidential and Proprietary to Daugherty Business Solutions
Confidential and Proprietary to Daugherty Business Solutions
Daugherty Overview
32
Combining world-class capabilities
with a local practice model
Long-term consultant employees
with deep business acumen &
leadership abilities
Providing more experienced
consultants & leading
methods/techniques/tools to:
• Accelerate results & productivity
• Provide greater team continuity
• More sustainable/cost effective
price point.
Over 1000 employees
from Management
Consultants to
Developers
88% of our clients are
long-term, repeat/referral
relationships of 10+ years
Demonstrated 31 year
track record of delivering
mission critical initiatives
enabled by emerging
technologies
1000
Engagements with over
75 Fortune 500 industry
leaders over the past
five years
ATLANTA
CHICAGO
DALLAS
DENVER
MINNEAPOLIS
NEW YORK
SAINT LOUIS (HQ)
DEVELOPMENT CENTER
SUPPORT & HARDWARE CENTER
9BUSINESS
UNITS
75
88%
31
BY THE NUMBERS
32
COLLABORATIVE
Co-staffed teams, project
Services, resource pools,
collaborative managed
services
PRAGMATIC
Pragmatic, co-staffed
approach well suited to
building internal
competency while getting
key project initiates
completed
ALTERNATIVE
Strong Alternative to the
Global Consultancies
FLEXIBLE
Flexible engagement
model
Confidential and Proprietary to Daugherty Business Solutions 33
Data & Analytics - What we bring to the table
APPLICATION
DEVELOPMENT
Methods / Tools / Techniques
• 12 Domain EIM Blueprint/Roadmap framework that
manages technical complexity, accelerates initiatives and
focuses on delivering greatest business analytics impact quickly.
• Highly accurate BI Dimensional estimator that provides
predictability in investments and time to market.
• Analytic Strategy framework that aligns people, process and
technology components to deliver business value
• Analytic Governance reference model that mitigates risk and
provide guardrails for self-service adoption
• Business value models to calculate the value and ROI of
investments in Data & Analytics initiatives
• Reference architecture for a modern data & analytic
platform
• Dashboard Design best practices that transform complex
business KPIs in a rich immersive design
• Bi-Modal Data as a Service Operating Model that integrates
Agile development with a Service oriented organization design
PROGRAM
& PROJECT
MANAGEMENT
• Program & Project Planning
• Program & Project
Management
• Business Case Development
• PMO Optimization
• M&A Integration
4
Data & Analytics
 Over 40% of Daugherty’s 1,000 consultants are focused on
Information Management Solutions.
 Bringing the latest thought leadership in Next Generation,
Unified Architectures that integrate structured, unstructured
data (“Big Data”) and applied advanced analytics into cohesive
solutions.
 Strong capabilities across both existing and emerging
technologies while maintaining a technology neutral
approach.
 Leveraging the latest visual design concepts to deliver
interactive and user friendly applications that drive adoption and
satisfaction with solutions.
 Leader in the effective application of Agile techniques applied to
Data Engineering development and business analytics. Full Data
life cycle methods & techniques from business definition through
development and on-going support
 Building and supporting mission-critical platforms for
many Fortune 500 companies in multi-year, using a flexible
support model including Collaborative Managed Services models.
DATA & ANALYTICS
• Data & Analytics Strategy & Roadmap
• Building Analytic Solutions
• Analytics Competency Development
• Big Data / Next Gen Architecture
• Business Analytics and Insights
33

Weitere ähnliche Inhalte

Was ist angesagt?

Columnar Databases (1).pptx
Columnar Databases (1).pptxColumnar Databases (1).pptx
Columnar Databases (1).pptx
ssuser55cbdb
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
 

Was ist angesagt? (20)

Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Get Intelligent with Metabase
Get Intelligent with MetabaseGet Intelligent with Metabase
Get Intelligent with Metabase
 
MongoDB
MongoDBMongoDB
MongoDB
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
MongoDB and Azure Databricks
MongoDB and Azure DatabricksMongoDB and Azure Databricks
MongoDB and Azure Databricks
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Columnar Databases (1).pptx
Columnar Databases (1).pptxColumnar Databases (1).pptx
Columnar Databases (1).pptx
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 

Ähnlich wie Hadoop Data Modeling

Ähnlich wie Hadoop Data Modeling (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 

Mehr von Adam Doyle

Mehr von Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

KĂźrzlich hochgeladen

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

KĂźrzlich hochgeladen (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 

Hadoop Data Modeling

  • 1. Confidential and Proprietary to Daugherty Business Solutions 05-02-2018 Hadoop Data Modeling
  • 2. Confidential and Proprietary to Daugherty Business Solutions 2 Agenda • What’s the Big Data Innovation, Data Engineering, Analytics Group? • Data Modelling in Hadoop • Questions
  • 3. Confidential and Proprietary to Daugherty Business Solutions 3 It started with an article
  • 4. Confidential and Proprietary to Daugherty Business Solutions 4 And a name change
  • 5. Confidential and Proprietary to Daugherty Business Solutions • What is the future of the Hadoop ecosystem? • What is the dividing line between Spark and Hadoop? • What are the big players doing? • How does the push to cloud technologies affect Hadoop usage? • How does Streaming come into play? 5 Which led to some questions
  • 6. Confidential and Proprietary to Daugherty Business Solutions • Hadoop is here to stay, but it will make the most strides as a machine learning platform. • Spark can perform many of the same tasks that elements of the Hadoop ecosystem can, but it is missing some existing features out of the box. • Cloudera, Hortonworks, and MapR are positioning themselves as data processing platforms with roots in Hadoop, but other aspirations. For example, Cloudera is positioning itself as a machine learning platform. • The push to cloud means that the distributed filesystem of HDFS may be less important to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to work with cloud sources. • The Hadoop ecosystem projects have proven patterns for ingesting streaming data and turning it into information. And then our answer 6
  • 7. Confidential and Proprietary to Daugherty Business Solutions • We’re now going to be St. Louis Big Data Innovation, Data Engineering, and Analytics Group Or more simply put: St. Louis Big Data IDEA Introducing … 7
  • 8. Confidential and Proprietary to Daugherty Business Solutions • Local Companies • Big Data – Hadoop – Cloud deployments – Cloud-native technologies – Spark – Kafka • Innovation – New Big Data projects – New Big Data services – New Big Data applications • Data Engineering – Streaming data – Batch data analysis – Machine Learning Pipelines – Data Governance – ETL @ Scale • Analytics – Visualization – Machine Learning – Reporting – Forecasting So What is the STL Big Data IDEA interested in? 8
  • 9. Confidential and Proprietary to Daugherty Business Solutions • Scott Shaw has been with Hortonworks for four years. • He is the author of four books including Practical Hive and Internet of Things and Data Analytics Handbook. • Scott will be helping our group find speakers in the open source community. Please help me welcome Scott to the group in his new role Introducing our New Board Member 9
  • 10. Confidential and Proprietary to Daugherty Business Solutions 10 Agenda • The Schema-on-Read Promise • File formats and Compression formats • Schema Design – Data Layout • Indexes, Partitioning and Bucketing • Join Performance • Hadoop SQL Boost – Tez, Cost Based Optimizations & LLAP • Summary
  • 11. Confidential and Proprietary to Daugherty Business Solutions Introducing our Speakers Adam Doyle • Co-Organizer, St. Louis Big Data IDEA • Big Data Community Lead, Daugherty Business Solutions Drew Marco • Board Member & Secretary, TDWI • Data and Analytics Line of Service Leader, Daugherty Business Solutions 11
  • 12. Confidential and Proprietary to Daugherty Business Solutions 12 Schema On Read • Schemas are typically purpose-built and hard to change • Generally loses the raw/atomic data as a source • Requires considerable modeling/implementation effort before being able to work with the data • If a certain type of data can’t be confined in the schema, you can’t effectively store or use it (if you can store it at all) Schema on Write Schema on Read • Slower Results • Preserve the raw/atomic data as a source • Flexibility to add, remove and modify columns • Data may be riddled with missing or invalid data, duplicates • Suited for data exploration and not recommended for repetitive querying and high performance Real world use of Hadoop / Hive that require high performing queries on large data sets requires up-front planning and data modeling
  • 13. Confidential and Proprietary to Daugherty Business Solutions 13 Schema Design – Data Layout Normalization “The primary reason to avoid normalization is to minimize disk seeks, such as those typically required to navigate foreign key relations. Denormalizing data permits it to be scanned from or written to large, contiguous sections of disk drives, which optimizes I/O performance. However, you pay the penalty of denormalization, data duplication and the greater risk of inconsistent data.” Source: Programming Hive by Dean Wampler, Jason Rutherglen, Edward Capriolo, O’Reilly Media Denormalization • Pros • Reduces data redundancy • Decreases risk of inconsistent datasets • Cons • Requires re-organization of source data • Less efficient storage • Pros • Often requires reorganizing the data (slower writes) • Minimizes disk seeks (i.e. FK relations) • Storage in large contiguous disk drive segments • Cons • Data Duplication • Increased Risk of inconsistent data
  • 14. Confidential and Proprietary to Daugherty Business Solutions 14 Introducing Our Use Case Departments Dept_no Name Dept_emp Dept_no Emp_no From_date To_date Employees Emp_no Birth_date First_Name Last_Name Gender Hire_date Dept_Manager Dept_no Emp_no From_date To_date Titles Emp_no Title From_date To_date Salaries Emp_no Salary From_date To_date https://dev.mysql.com/doc/employee/en/
  • 15. Confidential and Proprietary to Daugherty Business Solutions 15 Data Storage Decisions • Hadoop is a file system - No Standard data storage format in Hadoop • Optimal storage of data is determined by how the data will be processed • Typical input data is in JSON, XML or CSV Major Considerations: File Formats Compression
  • 16. Confidential and Proprietary to Daugherty Business Solutions 16 Parquet • Faster access to data • Efficient columnar compression • Effective for select queries
  • 17. Confidential and Proprietary to Daugherty Business Solutions 17 ORCFile High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
  • 18. Confidential and Proprietary to Daugherty Business Solutions 18 Avro • JSON based schema • Cross-language file format for Hadoop • Schema evolution was primary goal – Good for Select * queries • Schema segregated from data • Row major format
  • 19. Confidential and Proprietary to Daugherty Business Solutions Query Text Avro ORC Parquet select count(*) from employees e join salaries s on s.emp_no = e.emp_no join titles t on t.emp_no = e.emp_no; 42.696 48.934 25.846 26.081 select d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01- 01') d join dept_emp de on de.dept_no = d.dept_no join employees e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 59.536 63.08 27.954 26.073 Size 124M 134M 16.7M 30.5M 19 Comparison of file formats
  • 20. Confidential and Proprietary to Daugherty Business Solutions 20 Compression • Not just for storage (data-at-rest) but also critical for disk/network I/O (data-in- motion) • Splittability of the compression codec is an important consideration Snappy LZO • High speed with reasonable compression • Not splittable – only used with Avro • Optimized for speed as opposed to size • Splittable but requires additional indexing • Not shipped with Hadoop Gzip • Optimized for size • Write performance is half of snappy • Read performance as good as snappy • Smaller blocks = better performance bzip2 • Optimized for size (9% better compared to Gzip) • Splittable • Performance sucks; Primary use is archival on Hadoop
  • 21. Confidential and Proprietary to Daugherty Business Solutions 21 Partitioning & Bucketing • Partitioning is useful for chronological columns that don’t have a very high number of possible values • Bucketing is most useful for tables that are “most often” joined together on the same key • Skews useful when one or two column values dominate the table
  • 22. Confidential and Proprietary to Daugherty Business Solutions 22 Partitioning • Every query reads the entire table even when processing subset of data (full-table scan) • Breaks up data horizontally by column value sets • When partitioning you will use 1 or more “virtual” columns break up data • Virtual columns cause directories to be created in HDFS. • Static Partitioning versus Dynamic Partitioning • Partitioning makes queries go fast. • Partitioning works particularly well when querying with the “virtual column” • If queries use various columns, it may be hard to decide which columns should we partition by
  • 23. Confidential and Proprietary to Daugherty Business Solutions 23 Bucketing • Used to strike a balance between large files within partition • Breaks up data vertically by hashed key sets • When bucketing, you specify the number of buckets • Works particularly well when a lot of queries contain joins • Especially when the two data sets are bucketed on the join key
  • 24. Confidential and Proprietary to Daugherty Business Solutions 24 Comparison Query Text Partition Bucketed select d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01- 01') d join dept_emp_buck de on de.dept_no = d.dept_no join emp_buck e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 59.536 59.652 55.196
  • 25. Confidential and Proprietary to Daugherty Business Solutions 25 Join Performance Map Side Joins • Star schemas (e.g. dimension tables) Good when table is small enough to fit in RAM
  • 26. Confidential and Proprietary to Daugherty Business Solutions 26 Reduce Side Joins Default Hive Join Works with data of any size
  • 27. Confidential and Proprietary to Daugherty Business Solutions Query Map-Side Reduce select /*+ MAPJOIN(d) */ d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01-01') d join dept_emp_buck de on de.dept_no = d.dept_no join emp_buck e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 58.227 59.652 27 Comparison
  • 28. Confidential and Proprietary to Daugherty Business Solutions 28 Considerations for SQL Performance Tez
  • 29. Confidential and Proprietary to Daugherty Business Solutions • Hive uses a Cost-Based Optimizer to optimize the cost of running a query. • Calcite applies optimizations like query rewrite, join reordering, join elimination, and deriving implied predicates. • Calcite will prune away inefficient plans in order to produce and select the cheapest query plans. • Needs to be enabled: Set hive.cbo.enable=true; Set hive.stats.autogather=true; 29 CBO – Cost Based Optimization CBO Process Overview 1. Parse and validate query 2. Generate possible execution plans 3. For each logically equivalent plan, assign a cost 4. Select the plan with the lowest cost Optimization Factors • Join optimization • Table size
  • 30. Confidential and Proprietary to Daugherty Business Solutions • Consists of a long-lived daemon and a tightly integrated DAG framework. • Handles – Pre-fetching – Some Query Processing – Fine-grained column-level Access Control 30 LLAP
  • 31. Confidential and Proprietary to Daugherty Business Solutions
  • 32. Confidential and Proprietary to Daugherty Business Solutions Daugherty Overview 32 Combining world-class capabilities with a local practice model Long-term consultant employees with deep business acumen & leadership abilities Providing more experienced consultants & leading methods/techniques/tools to: • Accelerate results & productivity • Provide greater team continuity • More sustainable/cost effective price point. Over 1000 employees from Management Consultants to Developers 88% of our clients are long-term, repeat/referral relationships of 10+ years Demonstrated 31 year track record of delivering mission critical initiatives enabled by emerging technologies 1000 Engagements with over 75 Fortune 500 industry leaders over the past five years ATLANTA CHICAGO DALLAS DENVER MINNEAPOLIS NEW YORK SAINT LOUIS (HQ) DEVELOPMENT CENTER SUPPORT & HARDWARE CENTER 9BUSINESS UNITS 75 88% 31 BY THE NUMBERS 32 COLLABORATIVE Co-staffed teams, project Services, resource pools, collaborative managed services PRAGMATIC Pragmatic, co-staffed approach well suited to building internal competency while getting key project initiates completed ALTERNATIVE Strong Alternative to the Global Consultancies FLEXIBLE Flexible engagement model
  • 33. Confidential and Proprietary to Daugherty Business Solutions 33 Data & Analytics - What we bring to the table APPLICATION DEVELOPMENT Methods / Tools / Techniques • 12 Domain EIM Blueprint/Roadmap framework that manages technical complexity, accelerates initiatives and focuses on delivering greatest business analytics impact quickly. • Highly accurate BI Dimensional estimator that provides predictability in investments and time to market. • Analytic Strategy framework that aligns people, process and technology components to deliver business value • Analytic Governance reference model that mitigates risk and provide guardrails for self-service adoption • Business value models to calculate the value and ROI of investments in Data & Analytics initiatives • Reference architecture for a modern data & analytic platform • Dashboard Design best practices that transform complex business KPIs in a rich immersive design • Bi-Modal Data as a Service Operating Model that integrates Agile development with a Service oriented organization design PROGRAM & PROJECT MANAGEMENT • Program & Project Planning • Program & Project Management • Business Case Development • PMO Optimization • M&A Integration 4 Data & Analytics  Over 40% of Daugherty’s 1,000 consultants are focused on Information Management Solutions.  Bringing the latest thought leadership in Next Generation, Unified Architectures that integrate structured, unstructured data (“Big Data”) and applied advanced analytics into cohesive solutions.  Strong capabilities across both existing and emerging technologies while maintaining a technology neutral approach.  Leveraging the latest visual design concepts to deliver interactive and user friendly applications that drive adoption and satisfaction with solutions.  Leader in the effective application of Agile techniques applied to Data Engineering development and business analytics. Full Data life cycle methods & techniques from business definition through development and on-going support  Building and supporting mission-critical platforms for many Fortune 500 companies in multi-year, using a flexible support model including Collaborative Managed Services models. DATA & ANALYTICS • Data & Analytics Strategy & Roadmap • Building Analytic Solutions • Analytics Competency Development • Big Data / Next Gen Architecture • Business Analytics and Insights 33

Hinweis der Redaktion

  1. An updated version of ORC was released in HDP 2.6.3 with better support for vectorization.
  2. Although compression can greatly optimize processing performance, not all compression codecs supported on Hadoop are splittable. Since the MapReduce framework splits data for input to multiple tasks, having a non-splittable compression codec provides an impediment to efficient processing. If files cannot be split, that means the entire file needs to be passed to a single MapReduce task, eliminating the advantages of parallelism and data locality that Hadoop provides. For this reason, splittability is a major consideration in choosing a compression codec, as well as file format. We’ll discuss the various compression codecs available for Hadoop, and some considerations in choosing between them. Snappy Snappy is a compression codec developed at Google for high compression speeds with reasonable compression. Although Snappy doesn’t offer the best compression sizes, it does provide a good trade-off between speed and size. Processing performance with Snappy can be significantly better than other compression formats. An important thing to note is that Snappy is intended to be used with a container format like SequenceFiles or Avro, since it’s not inherently splittable. LZO LZO is similar to Snappy in that it’s optimized for speed as opposed to size. Unlike Snappy, LZO compressed files are splittable, but this requires an additional indexing step. This makes LZO a good choice for things like plain text files that are not being stored as part of a container format. It should also be noted that LZO’s license prevents it from being distributed with Hadoop, and requires a separate install, unlike Snappy, which can be distributed with Hadoop. Gzip Gzip provides very good compression performance (on average, about 2.5 times the compression that’d be offered by snappy), but its write speed performance is not as good as Snappy (on average, about half of that offered by snappy). Gzip usually performs almost as good as snappy in terms of read performance. Gzip is also not splittable, so should be used with a container format. It should be noted that one reason Gzip is sometimes slower than Snappy for processing is that due to Gzip compressed files taking up fewer blocks, fewer tasks are required for processing the same data. For this reason, when using Gzip, using smaller blocks can lead to better performance. bzip2 bzip2 provides excellent compression performance, but can be significantly slower than other compression codecs such as Snappy in terms of processing performance. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Unlike Snappy and gzip, bzip2 is inherently splittable. In the examples we have seen, bzip2 will normally compress around 9% better as compared to GZip, in terms of storage space. However, this extra compression comes with a significant read/write performance cost. This performance difference will vary with different machines but in general it’s about 10x slower then GZip. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Example of such a use case can be where Hadoop is being used mainly for active archival purposes.
  3. Multi-layer Partitioning is possible but often not efficient – Number of partitions becomes too much and will overwhelm the Metastore • Limit the number of partitions. Less may be better – 1000 partitions will often perform better than 10000 • Hadoop likes big files – avoid creating partitions with mostly small files • Only use when – Data is very large and there are lots of table scans – Data is queried aginst a particular column frequently – Column data must have low cardinality
  4. The map-side join can only be achieved if it is possible to join the records by key during read of the input files, so before the map phase. Additionally for this to work the input files need to be sorted by the same join key. Further more both inputs need to have the same number of partitions. Reaching these strict constraints is commonly hard to achieve. The most likely scenario for a map-side join is when both input tables were created by (different) MapReduce jobs having the same amount of reducers using the same (join) key. Set hive.auto.convert.join = true • HIVE then automatically uses broadcast join, if possible – Small tables held in memory by all nodes • Used for star-schema type joins common in Data warehousing use-cases • hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join: – Default 10MB is too low (check your default) – Recommended: 256MB for 4GB container
  5. Summary (top left) Insert box from capabilities overview slide Methods / Tools / Techniques (bottom left) What unique tools and techniques do we bring to the table? Identify differentiating methods, tools and techniques. Include graphics / images as appropriate to enhance and create impact. Capabilities (right) Create key points for each of these capabilities from the Daugherty Capabilities Overview. Confirm or update capabilities as appropriate. Comments should be specific and differentiating to the extent possible. What are the only things that Daugherty can say?