SlideShare ist ein Scribd-Unternehmen logo
1 von 38
INTRODUCTION TO OLAP
 OLAP (online analytical processing) is
computer processing that enables a user
to easily and selectively extract and
view data from different points of view.
 OLAP allows users to analyze database
information from multiple database systems
at one time.
 OLAP data is stored in multidimensional
databases.
Analysis
Query/
Reporting
Data
Mining
Monitoring & Administration
Metadata
Repository
External
Sources
Operational
databases
Extract
Transform
Load
Refresh
DATA
WAREHOUSE
Serve
OLAP servers
DATAWAREHOUSING ARCHITECHURE
 A multidimensional cube can combine data from
disparate data sources and store the information
in a fashion that is logical for business users.
THE OLAP CUBE
 An OLAP Cube is a data structure that allows fast
analysis of data.
 The arrangement of data into cubes overcomes a
limitation of relational databases.
 The OLAP cube consists of numeric facts called
measures which are categorized by dimensions.
OLAP CUBE
TWOTYPES OF
DATABASE ACTIVITY
 OLTP
◦ (Online-Transaction Processing)
 OLAP
◦ (Online-Analytical Processing)
OLTP vs. OLAP
 On-LineTransaction Processing (OLTP):
– technology used to perform updates on
operational or transactional systems (e.g., point
of sale systems)
 On-Line Analytical Processing (OLAP):
– technology used to perform complex analysis of
the data in a data warehouse
OLAP is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of
information that has been transformed from raw data to reflect
the dimensionality of the enterprise as understood by the user.
[source: OLAP Council: www.olapcouncil.org]
OLTP vs. OLAP
TYPES OF OLAP
 Relational OLAP(ROLAP):
 Relational and Specialized Relational DBMS to store and manage warehouse data
 OLAP middleware to support missing pieces
 Optimize for each DBMS backend
 Aggregation Navigation Logic
 Additional tools and services
 Example: Microstrategy, MetaCube (Informix)
 Extended RDBMS with multidimensional data mapping to standard relational operation.
 Multidimensional OLAP(MOLAP):
 Array-based storage structures
 Direct access to array data structures
 Implemented operation in multidimensional data
 Example: Essbase (Arbor)
 Hybrid Online Analytical Processing (HOLAP):
A hybrid approach to the solution where the aggregated totals are stored in a
multidimensional database while the detail data is stored in the relational database. This is the
balance between the data efficiency of the ROLAP model and the performance of the
MOLAP model.
ROLAP v/s MOLAP
Characteristics ROLAP MOLAP
SCHEMA User star Schema
•Additional dimensions
can be added
dynamically.
User Data cubes
•Addition dimensions
require recreation of
data cube.
Database Size Medium to large Small to medium
Architecture Client/Server Client/Server
Access Support ad-hoc
requests
Limited to pre-defined
dimensions
Characteristics ROLAP MOLAP
Resources HIGH VERY HIGH
Flexibility HIGH LOW
Scalability HIGH LOW
Speed •Good with small data
sets.
•Average for medium to
large data set.
•Faster for small to
medium data sets.
•Average for large data
sets.
 One main benefit of OLAP is consistency of information
and calculations.
 "What if" scenarios are some of the most popular uses of
OLAP software and are made eminently more possible by
multidimensional processing.
 It allows a manager to pull down data from an OLAP
database in broad or specific terms.
 OLAP creates a single platform for all the information
and business needs, planning, budgeting, forecasting,
reporting and analysis.
BENEFITS OF OLAP
/Contd…
 Marketing and sales analysis
 Consumer goods industries
 Financial services industry (insurance, banks etc)
 Database Marketing
Apache Kylin – What ?
● Open source
● Distributed Analytics Engine
● Provides SQL interface
● Multi-dimensional analysis (OLAP) on Hadoop
● Faster and more user-responsive than relational online
analytical processing (ROLAP)
The Fundamental Idea
● The idea of Kylin is not brand new.
● Technologies include methods to store pre-calculated results
to serve analysis queries, generate each level’s cuboids with
all possible combinations of dimensions, and calculate all
metrics at different levels.
From Relational to key-value
● Prevents large table scan and a long delay to get the answer.
● It makes sense to calculate and store those values for further
usage.
● This process generates all of the dimension combinations and
measured values.
Github Page
How it Works ?
● Read data from Hive (which is stored on HDFS)
● Run MapReduce jobs to pre-calculate
● Store cube data in HBase
● Leverage Zookeeper for job coordination
Apache Foundation Blog December 2015
● Apache Kylin is the best OLAP engine on Big Data so far.
● While other OLAP engines struggle with the data volume,
Kylin enables query responses in the milliseconds.
● Starting to leverage Kylin for near real time data streaming
storage and analytics engine.
Advantages
● Kylin has good intergration with BI tools, such as Tableau or
Excel.
● Kylin support molap cube, it has very good performance for
complex query on billion level data set
Limitations
● Real Time Support hasn’t yet been built.
● Kylin only supports the star schema. You are limited to a
single fact table for each cube.
Key Features
●Open Source.
●Distributed architecture.
●Real-time ingestion.
●Column-oriented for speed.
●Fast filtering.
●Operational simplicity.
●Support to OLAP Queries.
Druid Architecture
Types of Nodes:
Historical Nodes
➢Backbone of Druid cluster.
➢Download segments and serve queries over them.
Broker Nodes
➢Clients query to broker node to get data from Druid .
➢Scattering Queries.
➢Gathering and merging results.(know location of the segments)
Coordinator Nodes
➢Manage segments on historical nodes .
➢Load new segments , drop old segments and move segments to load
balance.
Ingestion method
● Streaming (real-time):
– If your dataset originates in a streaming system like Kafka .
– Kafka lets you process streams of records as they occur.
– The Kafka cluster stores streams of records in categories called topics.
– Each record consists of a key, a value, and a timestamp
● File based (Batch):
– Load data from HDFS, local files ,etc in batches.
Segments
● Druid stores its index in segment files ,partitioned by time
(Timestamp)
● Data Structure of segment file
– Columnar: the data for each column is laid out in separate
data structures.
●A segment consists of the timestamp column, dimension columns, and metric
columns .
●The timestamp and metric columns are simple and each of these is an array of
integer or floating point values .Values in metric columns are pulled out to perform
aggregate.
●Dimensions columns are different because they support filter and group-by
operations and requires:
➢ Dictionary that encodes column values
{
"Justin Bieber": 0,
"Ke$ha": 1
}
➢Column data
[0,
0,
1,
1]
●Bitmaps - one for each unique value of the column
●value="Justin Bieber": [1,1,0,0]
●value="Ke$ha": [0,0,1,1]
Druid vs Apache Kylin
DRUID APACHE KYLIN
Query Speed Very Fast Fast
Type of Analysis RealTime Analysis Focuses on OLAP cases,
RealTime Analysis under
development
SQL Support Absent Present
FaultTolerance All Nodes Need to Setup
BITools Integration Under Development Present (Tableau or Excel)
Integration with Kafka Present Absent
Complex Queries Bad for big data sets Good Performance
StorageType Bit-map Index OLAP Cube
Underlying technology Own computation and storage
cluster
Hadoop for cube build ,
HBase for storage
Miscellaneous Points to Consider…
 Druid has limitation on table join.
 Apache Kylin supports Star Schema.
 Modern corporations are increasingly looking for near real time
analytics and insights to make actionable decisions.
 Druid is trying to support integration with BI tools using Apache
Hive at Horton works.
(https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/)
 Previous version of Druid was under GPL v2 license.The latest
version of Druid is under Apache license v2,Apache Kylin is
under Apache License v2.
 Druid has 181 contributors for their GitHub project whereas
Apache Kylin has 60 contributors.
References - OLAP & OLTP
● http://en.wikipedia.org/wiki/Online_analytical_processing
● http://www.dmreview.com/issues/19971101/964-l.html
● http://en.wikipedia.org/wiki/Extract,_transform,_load
● http://www.olapreport.com/Applications.html
References-Apache Kylin
 https://mail-archives.apache.org/mod_mbox/kylin-
dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai
l.gmail.com%3E
 https://dzone.com/articles/apache-kylin-for-olap-on-hadoop
 http://kylin.apache.org/docs16/
 https://github.com/apache/kylin
 https://resources.zaloni.com/blog/apache-kylin-for-olap-on-hadoop
 https://en.wikipedia.org/wiki/Apache_Kylin
 http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
References-Druid
 http://druid.io/docs/latest/design/
 http://druid.io/docs/latest/tutorials/ingestion.html
 http://druid.io/docs/latest/design/segments.html
 https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
References-Druid vs Apache Kylin
 https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens
 http://markmail.org/message/mf6gfzdwfqwtbtv6#query:+page:1+mid:sp7ek7x5pawjlxb6+state:results
 https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/
 https://github.com/druid-io/druid
 https://github.com/apache/kylin
THANK YOU!!

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
DataWorks Summit
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Was ist angesagt? (20)

Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive
HiveHive
Hive
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 

Andere mochten auch

Andere mochten auch (6)

Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
HBase Coprocessor Introduction
HBase Coprocessor IntroductionHBase Coprocessor Introduction
HBase Coprocessor Introduction
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
카일린 Kylin, OLAP on hadoop
카일린 Kylin, OLAP on hadoop카일린 Kylin, OLAP on hadoop
카일린 Kylin, OLAP on hadoop
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 

Ähnlich wie Kylin and Druid Presentation

Ähnlich wie Kylin and Druid Presentation (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
3 OLAP.pptx
3 OLAP.pptx3 OLAP.pptx
3 OLAP.pptx
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
86921864 olap-case-study-vj
86921864 olap-case-study-vj86921864 olap-case-study-vj
86921864 olap-case-study-vj
 
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big Data
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
SAP HANA_class1.pptx
SAP HANA_class1.pptxSAP HANA_class1.pptx
SAP HANA_class1.pptx
 
OBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptOBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.ppt
 
OLAP OnLine Analytical Processing
OLAP OnLine Analytical ProcessingOLAP OnLine Analytical Processing
OLAP OnLine Analytical Processing
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Kushal Data Warehousing PPT
Kushal Data Warehousing PPTKushal Data Warehousing PPT
Kushal Data Warehousing PPT
 
OLAP & Data Warehouse
OLAP & Data WarehouseOLAP & Data Warehouse
OLAP & Data Warehouse
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Kylin and Druid Presentation

  • 1.
  • 2. INTRODUCTION TO OLAP  OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view.  OLAP allows users to analyze database information from multiple database systems at one time.  OLAP data is stored in multidimensional databases.
  • 4.  A multidimensional cube can combine data from disparate data sources and store the information in a fashion that is logical for business users.
  • 5. THE OLAP CUBE  An OLAP Cube is a data structure that allows fast analysis of data.  The arrangement of data into cubes overcomes a limitation of relational databases.  The OLAP cube consists of numeric facts called measures which are categorized by dimensions.
  • 7. TWOTYPES OF DATABASE ACTIVITY  OLTP ◦ (Online-Transaction Processing)  OLAP ◦ (Online-Analytical Processing)
  • 8. OLTP vs. OLAP  On-LineTransaction Processing (OLTP): – technology used to perform updates on operational or transactional systems (e.g., point of sale systems)  On-Line Analytical Processing (OLAP): – technology used to perform complex analysis of the data in a data warehouse OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the dimensionality of the enterprise as understood by the user. [source: OLAP Council: www.olapcouncil.org]
  • 10. TYPES OF OLAP  Relational OLAP(ROLAP):  Relational and Specialized Relational DBMS to store and manage warehouse data  OLAP middleware to support missing pieces  Optimize for each DBMS backend  Aggregation Navigation Logic  Additional tools and services  Example: Microstrategy, MetaCube (Informix)  Extended RDBMS with multidimensional data mapping to standard relational operation.  Multidimensional OLAP(MOLAP):  Array-based storage structures  Direct access to array data structures  Implemented operation in multidimensional data  Example: Essbase (Arbor)  Hybrid Online Analytical Processing (HOLAP): A hybrid approach to the solution where the aggregated totals are stored in a multidimensional database while the detail data is stored in the relational database. This is the balance between the data efficiency of the ROLAP model and the performance of the MOLAP model.
  • 11. ROLAP v/s MOLAP Characteristics ROLAP MOLAP SCHEMA User star Schema •Additional dimensions can be added dynamically. User Data cubes •Addition dimensions require recreation of data cube. Database Size Medium to large Small to medium Architecture Client/Server Client/Server Access Support ad-hoc requests Limited to pre-defined dimensions
  • 12. Characteristics ROLAP MOLAP Resources HIGH VERY HIGH Flexibility HIGH LOW Scalability HIGH LOW Speed •Good with small data sets. •Average for medium to large data set. •Faster for small to medium data sets. •Average for large data sets.
  • 13.  One main benefit of OLAP is consistency of information and calculations.  "What if" scenarios are some of the most popular uses of OLAP software and are made eminently more possible by multidimensional processing.  It allows a manager to pull down data from an OLAP database in broad or specific terms.  OLAP creates a single platform for all the information and business needs, planning, budgeting, forecasting, reporting and analysis. BENEFITS OF OLAP
  • 14. /Contd…  Marketing and sales analysis  Consumer goods industries  Financial services industry (insurance, banks etc)  Database Marketing
  • 15. Apache Kylin – What ? ● Open source ● Distributed Analytics Engine ● Provides SQL interface ● Multi-dimensional analysis (OLAP) on Hadoop ● Faster and more user-responsive than relational online analytical processing (ROLAP)
  • 16. The Fundamental Idea ● The idea of Kylin is not brand new. ● Technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.
  • 17.
  • 18. From Relational to key-value ● Prevents large table scan and a long delay to get the answer. ● It makes sense to calculate and store those values for further usage. ● This process generates all of the dimension combinations and measured values.
  • 19.
  • 20.
  • 22. How it Works ? ● Read data from Hive (which is stored on HDFS) ● Run MapReduce jobs to pre-calculate ● Store cube data in HBase ● Leverage Zookeeper for job coordination
  • 23. Apache Foundation Blog December 2015 ● Apache Kylin is the best OLAP engine on Big Data so far. ● While other OLAP engines struggle with the data volume, Kylin enables query responses in the milliseconds. ● Starting to leverage Kylin for near real time data streaming storage and analytics engine.
  • 24. Advantages ● Kylin has good intergration with BI tools, such as Tableau or Excel. ● Kylin support molap cube, it has very good performance for complex query on billion level data set
  • 25. Limitations ● Real Time Support hasn’t yet been built. ● Kylin only supports the star schema. You are limited to a single fact table for each cube.
  • 26.
  • 27. Key Features ●Open Source. ●Distributed architecture. ●Real-time ingestion. ●Column-oriented for speed. ●Fast filtering. ●Operational simplicity. ●Support to OLAP Queries.
  • 28. Druid Architecture Types of Nodes: Historical Nodes ➢Backbone of Druid cluster. ➢Download segments and serve queries over them. Broker Nodes ➢Clients query to broker node to get data from Druid . ➢Scattering Queries. ➢Gathering and merging results.(know location of the segments) Coordinator Nodes ➢Manage segments on historical nodes . ➢Load new segments , drop old segments and move segments to load balance.
  • 29. Ingestion method ● Streaming (real-time): – If your dataset originates in a streaming system like Kafka . – Kafka lets you process streams of records as they occur. – The Kafka cluster stores streams of records in categories called topics. – Each record consists of a key, a value, and a timestamp ● File based (Batch): – Load data from HDFS, local files ,etc in batches.
  • 30. Segments ● Druid stores its index in segment files ,partitioned by time (Timestamp) ● Data Structure of segment file – Columnar: the data for each column is laid out in separate data structures.
  • 31. ●A segment consists of the timestamp column, dimension columns, and metric columns . ●The timestamp and metric columns are simple and each of these is an array of integer or floating point values .Values in metric columns are pulled out to perform aggregate. ●Dimensions columns are different because they support filter and group-by operations and requires: ➢ Dictionary that encodes column values { "Justin Bieber": 0, "Ke$ha": 1 } ➢Column data [0, 0, 1, 1] ●Bitmaps - one for each unique value of the column ●value="Justin Bieber": [1,1,0,0] ●value="Ke$ha": [0,0,1,1]
  • 32. Druid vs Apache Kylin DRUID APACHE KYLIN Query Speed Very Fast Fast Type of Analysis RealTime Analysis Focuses on OLAP cases, RealTime Analysis under development SQL Support Absent Present FaultTolerance All Nodes Need to Setup BITools Integration Under Development Present (Tableau or Excel) Integration with Kafka Present Absent Complex Queries Bad for big data sets Good Performance StorageType Bit-map Index OLAP Cube Underlying technology Own computation and storage cluster Hadoop for cube build , HBase for storage
  • 33. Miscellaneous Points to Consider…  Druid has limitation on table join.  Apache Kylin supports Star Schema.  Modern corporations are increasingly looking for near real time analytics and insights to make actionable decisions.  Druid is trying to support integration with BI tools using Apache Hive at Horton works. (https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/)  Previous version of Druid was under GPL v2 license.The latest version of Druid is under Apache license v2,Apache Kylin is under Apache License v2.  Druid has 181 contributors for their GitHub project whereas Apache Kylin has 60 contributors.
  • 34. References - OLAP & OLTP ● http://en.wikipedia.org/wiki/Online_analytical_processing ● http://www.dmreview.com/issues/19971101/964-l.html ● http://en.wikipedia.org/wiki/Extract,_transform,_load ● http://www.olapreport.com/Applications.html
  • 35. References-Apache Kylin  https://mail-archives.apache.org/mod_mbox/kylin- dev/201503.mbox/%3CCAKmQrOY0fjZLUU0MGo5aajZ2uLb3T0qJknHQd+Wv1oxd5PKixQ@mai l.gmail.com%3E  https://dzone.com/articles/apache-kylin-for-olap-on-hadoop  http://kylin.apache.org/docs16/  https://github.com/apache/kylin  https://resources.zaloni.com/blog/apache-kylin-for-olap-on-hadoop  https://en.wikipedia.org/wiki/Apache_Kylin  http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/
  • 36. References-Druid  http://druid.io/docs/latest/design/  http://druid.io/docs/latest/tutorials/ingestion.html  http://druid.io/docs/latest/design/segments.html  https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • 37. References-Druid vs Apache Kylin  https://www.slideshare.net/freepsw/olap-for-big-data-druid-vs-apache-kylin-vs-apache-lens  http://markmail.org/message/mf6gfzdwfqwtbtv6#query:+page:1+mid:sp7ek7x5pawjlxb6+state:results  https://ko.hortonworks.com/blog/apache-hive-druid-part-1-3/  https://github.com/druid-io/druid  https://github.com/apache/kylin