Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Tuning Apache Phoenix/HBase

1.256 Aufrufe

Veröffentlicht am

Presentation of how we have Tuned Apache Phoenix/HBase at Truecar. We covered Data Modeling, EC2 instance types, Architecture and Cluster settings.

Veröffentlicht in: Software
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Tuning Apache Phoenix/HBase

  1. 1. Anil Gupta Omkar Nalawade 06/18/2018
  2. 2. Assumptions: • Our audience have basic knowledge of HBase/Phoenix • Actual performance improvement varies per your workload • Due to time constraints, we are covering most important tuning tips 2
  3. 3. Agenda: • Data Architecture at TRUECar • Use Cases for Apache HBase/Phoenix • Performance Optimization Techniques  Cluster Settings  Table Settings  Data Modelling  Instance Type 3
  4. 4. Data Architecture at TRUECar 4
  5. 5. 5 Storage Cluster Compute Cluster Isolate compute and storage cluster for: • Reducing interference between Compute and Storage job • Use different EC2 instance types for HBase and Yarn • Better consistency and debugging capability
  6. 6. Use Cases for Apache HBase/Phoenix • Data store for Historical Data • Data store for highly unstructured data(primarily HBase) • Data store for semi-structured data(dynamic columns of Phoenix) • In-memory Cache for small datasets • We try to denormalize data to avoid joins in HBase/Phoenix 6
  7. 7. Cluster Settings • UPDATE_CACHE_FREQUENCY • Default value is “Always” • SYSTEM.CATALOG is queried for every instantiation of Statement/PreparedStatement • Causes hotspot in SYSTEM.CATALOG • “phoenix.default.update.cache.frequency”: 120000 • Can be set per Table • Saw 5x performance improvement in some jobs 7
  8. 8. Table Settings • Pre-splitting the table • Pre-splitting the secondary index • Bloom Filter • Hints • SMALL • NO_CACHE • IN_MEMORY 8
  9. 9. Pre-split! Pre-split! Pre-split! • Without presplitting, Phoenix tables are seeded with 1 region • Avoid hotspot writing data to new tables. • Leads to better distribution of table data across cluster • Significant performance improvement(few X) at initial data load of table 9
  10. 10. Pre-splitting Global Secondary Index • Global Secondary Index data is stored in another Phoenix table. • Without pre-splitting Index table can lead to:  Hotspot in Index table  Slow writes to primary table(even though its pre-splitted) 10
  11. 11. Bloom Filter • It’s a light-weight in-memory structure to reduce the number of negative reads • It can be enabled on Column Family:  ROW(default): If table doesnt have a lot of Dynamic Columns  ROWCOL: If table has lots of Dynamic Columns 11 We saw 2x performance improvement in Read in a table that had close to 40000 Dynamic Columns
  12. 12. Hints 12
  13. 13. NO_CACHE • To avoid the results of query to populate HBase block cache • Use it when adhoc/nigthly export of data • Reduce unnecessary churn in LRU 13
  14. 14. SMALL HINT  Data set:  Main Table consists of 50 columns  2 million rows  Case 1: Secondary Index without HINT  Secondary Index on Main Table to retrieve 2 columns  CREATE TEST_IDX ON TEST_TABLE(COLUMN_1)  Query: SELECT * FROM TEST_IDX WHERE COLUMN_1=100  Performance: 10.44 ms/query 14
  15. 15. SMALL HINT  Case 2: Covered Index with HINT  Covered Index to retrieve 2 columns  CREATE TEST_IDX ON TEST_TABLE(COLUMN_1) INCLUDE (COLUMN_2, COLUMN_3)  SELECT COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100  Query Performance: ~1.8 ms/query 15
  16. 16. SMALL HINT  Case 3: Covered Index with SMALL HINT  Covered Index with SMALL HINT to retrieve 2 columns  SELECT /*+SMALL*/ COLUMN_2, COLUMN_3 FROM TEST_IDX WHERE COLUMN_1=100  Query Performance: ~1.2 ms/query 16
  17. 17. SMALL Hint: Performance 17
  18. 18. IN_MEMORY Option • Use in-memory option to cache small data sets. • Fast reads(in single digit milliseconds) • We try to restrict in memory option to data < 1 Gb • Don’t forget to split the table 18
  19. 19. Data Modeling: Incremental Key • Rows in Phoenix are sorted lexicographically by the row key • Sequential Keys leads to hotspotting due to non-uniform read/write pattern • Common example: SequenceId’s of RDBMS 19
  20. 20. Data Modeling: Incremental Key • Reversing key • Reversing the primary Key so that randomizes the row keys • Reversing can be done iff point queries are done • Range Scan are not feasible with Reversing 20
  21. 21. Why Reversing key rather than Salting? • Need to specify number of buckets at time to table creation • Number of salt bucket stays same even if datasize keeps on growing • Range scans are not feasible with salting too. 21
  22. 22. Data Modelling: Read Most Recent Data • Sample Problem:  We want to store sales transaction of vehicle  Applications wants to read latest sale data per vehicle(VIN number)  We can still do range scan on primary key prefix i.e. VIN 22 Primary key: <(String)VIN><(long)epoch time at Jan-01-2100:00 - SaleDate> Phoenix Query to read latest: Select * from vin_sales where vin=‘x’ limit 1;
  23. 23. Data Modelling: Read Most Recent Data 23 VIN SALE_DATE 19UDE2F30HA000958 20170924 19UDE2F30HA000958 20180402 VIN MILLIS_UNTIL_EPOCH SALE_DATE 19UDE2F30HA000958 2609193660000 20180402 19UDE2F30HA000958 2609280060000 20170924 Rowkey:VIN, Millis_Until_Epoch Query:Select where vin= 19UDE2F30HA000958 limit 1 Rowkey: VIN,Sale_date Query: Will need to do orderby sale_date
  24. 24. EC2 Instance Types 24 d2.xlarge i3.2xlarge Memory 30.5 GB 61GB vCPUs 4 8 Instance Storage 6 TB (spinning disk) 1.9 TB NVMe SSD(fastest disk) Network Performance Moderate Up to 10GB Cost - On-Demand Instances $0.69/hr $0.62/hr Cost – Reserved Instances $0.40/hr $0.43/hr
  25. 25. EC2 Instance Types 25 I3.2xlarge instance provided 25-120% performance improvement in our jobs mainly due to better disk without significant increase in cost
  26. 26. Thanks & Questions (PS:We are hiring!) 26

×