Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data

487 Aufrufe

Veröffentlicht am

How Truecar leveraged Dynamic Columns of Phoenix to store Sparse dataset using Phoenix

Veröffentlicht in: Software
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data

  1. 1. Dynamic Columns for SQL on NoSQL Data (Incentives in HBase) June 13, 2017 Anil Gupta Amey Hegde
  2. 2. 2 Agenda • Overview of Incentives • Components • Why Phoenix? • HBase Data Model • Performance Tuning • Learnings
  3. 3. 3 About TrueCar • TrueCar is an online marketplace for buying and selling cars. • We are dedicated to being the most transparent brand in automotive industry. • We show consumers what others paid for the car they want, so they can recognize a fair price.
  4. 4. 4 Example of Incentives
  5. 5. 5 List of various Incentives
  6. 6. 6 Overview of Incentives IncentiveId TrimId Amount New/Used 94017 94121 90401 30076 30610 1000 N 1 1 1 29653 28565 779 U 1 16455 24981 1200 N 1 An Incentive can be active from 1-40K Postal Codes. So, dynamic columns will be good. Postal Code
  7. 7. 7 Overview of Incentives ● Historical Incentives data  Snapshot or history of incentives over last 18 months  Used in internal analytics jobs ● Current Incentives data  Latest OEM incentives to customers  Used to published to the website
  8. 8. Old Pipeline Dataflow Overview 8 [Database] Data from all sources [Sqoop] Dumps data from sql server to HDFS [Pig] Joins multiple data- sets HDFS ES [Mapper] Highly nested Avro/JSON data
  9. 9. 9 Shortcomings of Old Pipeline ● Interference of backend job with live traffic ● Scalability with Elastic Search ● Reads were complex ● Nested dataset increases post processing time
  10. 10. 10 Incentive Components ● HBase: Datastore for Historical Incentives ● Phoenix : SQL layer to operate on HBase. ● Elasticsearch: Stores current Incentive data for Front End. ● MapReduce: Computation engine for Incentives ● Avro: Serialization library for storing data on HDFS
  11. 11. 111111 Why Phoenix ? ● Easy to use across all disciplines (multiple teams/roles) ● Standard SQL API and JDBC connection ● Dynamic Column Feature ● Fully integrated with Hadoop Ecosystem
  12. 12. New Pipeline Dataflow Overview 12 [Database] Data from all sources [Sqoop] Dumps data from sql server to HDFS [Pig] Joins multiple data- sets HDFSHbase [Mapper] De-normalize then parse and validate record
  13. 13. Ingestion Logic in Mapper 13 Incoming Record Insert Finish Check YesNo Update
  14. 14. 14 Column Family Description Versions S • Stores all static columns 1 D • Stores dynamic columns for postal code 1 Table: INCENTIVES Row Key: <TRIM_ID><SNAPSHOT_START_DATE><VALUE_TYPE><INCENTIVE_ID> Initial Data Model
  15. 15. 15 Initial Performance 0 10 20 30 40 50 60 M i n u t e s Transaction Records New Pipeline(HBase/Phoenix) Old Pipeline (Elasticsearch)
  16. 16. 16 Column Family Description Versions S • Stores all static columns 1 E • Stores dynamic columns for even scheme of postal code 1 O • Stores dynamic columns for odd scheme of postal code 1 Table: INCENTIVES Row Key: <TRIM_ID><SNAPSHOT_START_DATE><VALUE_TYPE><INCENTIVE_ID> HBase Data Model
  17. 17. 17 Sample Select Query SELECT * FROM HIST_INCENTIVES (O.P90401 INTEGER) WHERE TRIM_ID = 30070 AND SNAPSHOT_START_DATE= 1497386726 AND VALUE_TYPE = ’CUSTOMERCASH’ AND P90401=1
  18. 18. 18 HBase Tuning ● Split postal code data into 2 column families(even/odd) ● Added bloom filter to Row-Columns ● Splitting regions ● Evenly distributed data across region servers ● Time to live (TTL) = 540 days ● Region size 8-10 GB
  19. 19. 19 Performance after tuning 0 10 20 30 40 50 60 M i n u t e s Transaction Records New Pipeline(HBase/Phoenix) Old Pipeline (Elasticsearch) 2.6x performance gain
  20. 20. 20 Description Old Pipeline New Pipeline Data Ingestion • Write to Elasticsearch • MapReduce Job • 32-35min • Normalized data-set • Write to HBase using Phoenix • Map only Job • 48-50 min • De-normalized data-set Data Retrieval • Read from Elasticsearch • 48-50 min • Sequential action for all five different provider • Read from HBase using Phoenix • 18-20 min • Parallel action for all provider Performance Testing Results
  21. 21. 2121 Random Facts ● Affects approximately 2.536 billion cells in each run ● Data retrieval performance is improved by 80% ● Data duplication was eliminated from the pipeline ● Post processing after data retrieval is negligible due to de- normalized data
  22. 22. 2222 Unit Testing ● Create HBase minicluster ● Establish phoenix connection ● Create a HBase table ● Created various test suites to validate all the use cases
  23. 23. 2323 Summary ● Improvement in performance of analytical jobs that use Historical Incentives ● Higher Scalability with new architecture of Historical Incentives ● Eliminated intervention of offline jobs with live traffic
  24. 24. Thanks! Questions? 24

×