SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Local Secondary Indexes in
Apache Phoenix
Rajeshbabu Chintaguntla
PhoenixCon 2017
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Local Indexes Introduction
Local indexes design and data model
Local index writes and reads
Performance Results
Helpful Tips or recommendations
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Secondary indexes in Phoenix
 Primary Key columns in a phoenix table forms HBase row key which acts as a
primary index so filtering by primary key columns become point or range
scans to the table.
 Filtering on non primary key column converts query into full table scans and
consume lot time and resources.
 With secondary indexes, we can create alternative access paths to convert
queries into point lookups or range scans.
 Phoenix supports two kinds of indexes GLOBAL and LOCAL.
 Phoenix supports Functional indexes as well.
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Indexes - Introduction
 Local secondary index is LOCAL in the sense that a REGION in a table is
considered as a unit and create and maintain index of it’s data.
 The local index data is stored and maintained in the shadow column
family(ies) in the same table.
 So the index is 100% co-reside in the same server serving the actual data.
 Faster index building.
 Syntax:
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Index - Introduction
Order Id Customer ID Item ID Date
100 11 1111 06/10/2017
101 23 1231 06/01/2017
102 11 1332 05/31/2017
103 34 3221 06/01/2017
Region[100
,104)
Region[104
,107)
REGION
START KEY
IDX ID DATE Order ID
100 1 05/31/2017 102
100 1 06/01/2017 101
100 1 06/01/2017 103
100 1 06/10/2017 100
104 55 1343 05/28/2017
105 11 2312 06/01/2017
106 29 1234 05/15/2017
104 1 05/15/2017 106
104 1 05/28/2017 104
104 1 06/01/2017 105
CREATE TABLE IF NOT EXISTS ORDERS(
ORDER_ID LONG NOT NULL PRIMARY KEY,
CUSTOMER_ID LONG NOT NULL,
ITEM_ID INTEGER NOT NULL,
DATE DATE NOT NULL);
CREATE LOCAL INDEX IDX ON ORDERS(DATE)
Index of
Region[100,
104)
Index of Region[104,107)
BASE TABLE
DATA – ORDER
ID IS PRIMARY
KEY INDEX ROW KEY
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Table
Region1
0
L#
0
STATS
CREATE TABLE IF NOT EXISTS WEB_STAT (
HOST CHAR(2) NOT NULL,
DOMAIN VARCHAR NOT NULL,
FEATURE VARCHAR NOT NULL,
DATE DATE NOT NULL,
STATS.ACTIVE_VISITOR INTEGER
CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN));
Region2
0
L#
0
STATS
2) CREATE LOCAL INDEX IDX2 ON
WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE)
Table
Region1
0
STATS
Region2
0
L#
0
STATS
3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE)
INCLUDE(STATS.ACTIVE_VISITOR)
L#STATS
L#
0
L#STATS
Data Model
Shadow column
families to store
the index data
1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Model
REGION
START KEY
SALT NUMBER
(Empty for
non salt table)
INDEX ID
TENANT_ID
(Empty for
non multi
tenant table)
INDEXED COLUMN
VALUE[S]
PRIMARY KEY COLUMN
VALUE[S]
Local index row key format
 REGION START KEY: Start key of data region. For first region it’s empty byte array of region
end key length. This helps to index region wise data.
 SALT NUMBER: A byte value represents a salt bucket number calculated for index row key.
 INDEX ID: A short number represents the local index. This helps to store each index data
together.
 TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Write path
Region Server
Region
CLIENT
1.Write
request
prepare index updates
Data cf Index cf
2.batch call
Mem
Store
Me
mSto
re
Index
updates
Data updates
4.Merge data and
index updates
5.Write to
MemStores
WAL
6.Write to WAL
100% ATOMIC
and CONSISTENT
local index
updates with
data updates
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Regionserver
Region [‘’,F)
Region [F,L)
Client
0 L#0
Region [L,R)
Region [R,’’)
Regionserver
Read Path
0 L#0
0 L#0
0 L#0
SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’
2
1
0
5
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Read Path
SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’
Joining back missing columns from data table
Region
CLIENT
1.SCAN,L#0,FILTER
Index cf Data cf
Mem
Store
Me
mSto
re
2.Apply filter
on index col
3.Get non
index cols on
matching rows
4.Merge with
index cols
5.Return
combined
results to client
6. Results
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Region Splits and Merges
 Since the indexes also stored in the same table, splits and merges taken care
by HBase automatically.
 We have special mechanism to separate HFile into child regions after split.
We scan through each key value find the data row key from it and write to
corresponding child region
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance Results
 4 node cluster
 Tested with 5 local indexes on the base table of 25 columns with 10 regions.
 Ingested 50M rows.
 3x faster upsert time comparing to global indexes
 5x less network RX/TX utilizations during write comparing to global indexes
 Similar read performance comparing to global indexes with queries like aggregations, group
by, limit etc.
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Write performance
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Helpful Tips
 Mutable vs Immutable rows table?
– Writes are much more faster with local indexes on immutable rows table than mutable.
So if the row written once and never updated then better to create table with
IMMUTABLE_ROWS property.
 Online vs Offline index population?
– When a table with pre-existing data then index population time may vary depending on
the data size.
– Usually index population happen at server by reading data table and writing index to the
same table. It works very fast normally. But if the data size is too big then better to use
ASYNC population by using IndexTool.
 Covered index vs non covered index?
– When a query contains the non indexed columns to access then Phoenix joins the
missing columns(in the index) from data table itself by using get calls. If the matching
number of rows are high better to create covered index to avoid get calls.
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
Q & A?
rajeshbabu@apache.org
@rajeshhcu32

Weitere ähnliche Inhalte

Was ist angesagt?

HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseDataWorks Summit
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseDataWorks Summit
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 

Was ist angesagt? (20)

HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 

Ähnlich wie Local Secondary Indexes in Apache Phoenix

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanAnkit Singhal
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Martin Loetzsch
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0DataWorks Summit
 
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index SolutionMichael Stack
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasDataWorks Summit
 
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014Dave Stokes
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
IRJET- Rest API for E-Commerce Site
IRJET- Rest API for E-Commerce SiteIRJET- Rest API for E-Commerce Site
IRJET- Rest API for E-Commerce SiteIRJET Journal
 
Sql server lesson6
Sql server lesson6Sql server lesson6
Sql server lesson6Ala Qunaibi
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 

Ähnlich wie Local Secondary Indexes in Apache Phoenix (20)

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Ijebea14 228
Ijebea14 228Ijebea14 228
Ijebea14 228
 
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solution
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
 
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
IRJET- Rest API for E-Commerce Site
IRJET- Rest API for E-Commerce SiteIRJET- Rest API for E-Commerce Site
IRJET- Rest API for E-Commerce Site
 
War of the Indices- SQL vs. Oracle
War of the Indices-  SQL vs. OracleWar of the Indices-  SQL vs. Oracle
War of the Indices- SQL vs. Oracle
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Sql server lesson6
Sql server lesson6Sql server lesson6
Sql server lesson6
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 

Kürzlich hochgeladen

eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024ThousandEyes
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
How to Improve the Employee Experience? - HRMS Software
How to Improve the Employee Experience? - HRMS SoftwareHow to Improve the Employee Experience? - HRMS Software
How to Improve the Employee Experience? - HRMS SoftwareNYGGS Automation Suite
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageDista
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Jonathan Katz
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 

Kürzlich hochgeladen (20)

eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024New ThousandEyes Product Features and Release Highlights: March 2024
New ThousandEyes Product Features and Release Highlights: March 2024
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
How to Improve the Employee Experience? - HRMS Software
How to Improve the Employee Experience? - HRMS SoftwareHow to Improve the Employee Experience? - HRMS Software
How to Improve the Employee Experience? - HRMS Software
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 

Local Secondary Indexes in Apache Phoenix

  • 1. Local Secondary Indexes in Apache Phoenix Rajeshbabu Chintaguntla PhoenixCon 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Local Indexes Introduction Local indexes design and data model Local index writes and reads Performance Results Helpful Tips or recommendations
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Secondary indexes in Phoenix  Primary Key columns in a phoenix table forms HBase row key which acts as a primary index so filtering by primary key columns become point or range scans to the table.  Filtering on non primary key column converts query into full table scans and consume lot time and resources.  With secondary indexes, we can create alternative access paths to convert queries into point lookups or range scans.  Phoenix supports two kinds of indexes GLOBAL and LOCAL.  Phoenix supports Functional indexes as well.
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Local Secondary Indexes - Introduction  Local secondary index is LOCAL in the sense that a REGION in a table is considered as a unit and create and maintain index of it’s data.  The local index data is stored and maintained in the shadow column family(ies) in the same table.  So the index is 100% co-reside in the same server serving the actual data.  Faster index building.  Syntax:
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Local Secondary Index - Introduction Order Id Customer ID Item ID Date 100 11 1111 06/10/2017 101 23 1231 06/01/2017 102 11 1332 05/31/2017 103 34 3221 06/01/2017 Region[100 ,104) Region[104 ,107) REGION START KEY IDX ID DATE Order ID 100 1 05/31/2017 102 100 1 06/01/2017 101 100 1 06/01/2017 103 100 1 06/10/2017 100 104 55 1343 05/28/2017 105 11 2312 06/01/2017 106 29 1234 05/15/2017 104 1 05/15/2017 106 104 1 05/28/2017 104 104 1 06/01/2017 105 CREATE TABLE IF NOT EXISTS ORDERS( ORDER_ID LONG NOT NULL PRIMARY KEY, CUSTOMER_ID LONG NOT NULL, ITEM_ID INTEGER NOT NULL, DATE DATE NOT NULL); CREATE LOCAL INDEX IDX ON ORDERS(DATE) Index of Region[100, 104) Index of Region[104,107) BASE TABLE DATA – ORDER ID IS PRIMARY KEY INDEX ROW KEY
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Table Region1 0 L# 0 STATS CREATE TABLE IF NOT EXISTS WEB_STAT ( HOST CHAR(2) NOT NULL, DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL, DATE DATE NOT NULL, STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN)); Region2 0 L# 0 STATS 2) CREATE LOCAL INDEX IDX2 ON WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE) Table Region1 0 STATS Region2 0 L# 0 STATS 3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE) INCLUDE(STATS.ACTIVE_VISITOR) L#STATS L# 0 L#STATS Data Model Shadow column families to store the index data 1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Model REGION START KEY SALT NUMBER (Empty for non salt table) INDEX ID TENANT_ID (Empty for non multi tenant table) INDEXED COLUMN VALUE[S] PRIMARY KEY COLUMN VALUE[S] Local index row key format  REGION START KEY: Start key of data region. For first region it’s empty byte array of region end key length. This helps to index region wise data.  SALT NUMBER: A byte value represents a salt bucket number calculated for index row key.  INDEX ID: A short number represents the local index. This helps to store each index data together.  TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Write path Region Server Region CLIENT 1.Write request prepare index updates Data cf Index cf 2.batch call Mem Store Me mSto re Index updates Data updates 4.Merge data and index updates 5.Write to MemStores WAL 6.Write to WAL 100% ATOMIC and CONSISTENT local index updates with data updates
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Regionserver Region [‘’,F) Region [F,L) Client 0 L#0 Region [L,R) Region [R,’’) Regionserver Read Path 0 L#0 0 L#0 0 L#0 SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’ 2 1 0 5
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Read Path SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’ Joining back missing columns from data table Region CLIENT 1.SCAN,L#0,FILTER Index cf Data cf Mem Store Me mSto re 2.Apply filter on index col 3.Get non index cols on matching rows 4.Merge with index cols 5.Return combined results to client 6. Results
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Region Splits and Merges  Since the indexes also stored in the same table, splits and merges taken care by HBase automatically.  We have special mechanism to separate HFile into child regions after split. We scan through each key value find the data row key from it and write to corresponding child region
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance Results  4 node cluster  Tested with 5 local indexes on the base table of 25 columns with 10 regions.  Ingested 50M rows.  3x faster upsert time comparing to global indexes  5x less network RX/TX utilizations during write comparing to global indexes  Similar read performance comparing to global indexes with queries like aggregations, group by, limit etc.
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Write performance
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Helpful Tips  Mutable vs Immutable rows table? – Writes are much more faster with local indexes on immutable rows table than mutable. So if the row written once and never updated then better to create table with IMMUTABLE_ROWS property.  Online vs Offline index population? – When a table with pre-existing data then index population time may vary depending on the data size. – Usually index population happen at server by reading data table and writing index to the same table. It works very fast normally. But if the data size is too big then better to use ASYNC population by using IndexTool.  Covered index vs non covered index? – When a query contains the non indexed columns to access then Phoenix joins the missing columns(in the index) from data table itself by using get calls. If the matching number of rows are high better to create covered index to avoid get calls.
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You Q & A? rajeshbabu@apache.org @rajeshhcu32