Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Mutable Data in Hive's Immutable World

Hadoop Summit 2015

  • Als Erste(r) kommentieren

Mutable Data in Hive's Immutable World

  1. 1. Page1 Mutable Data in Hive’s Immutable World Lester Martin – Hortonworks 2015 Hadoop Summit
  2. 2. Page2 Connection before Content Lester Martin – Hortonworks Professional Services lmartin@hortonworks.com || lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc)
  3. 3. Page3 “Traditional” Hadoop Data Time-Series Immutable (TSI) Data – Hive’s sweet spot Going beyond web logs to more exotic data such as: Vehicle sensors (ground, air, above/below water – space!) Patient data (to include the atmosphere around them) Smart phone/watch (TONS of info) Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES
  4. 4. Page4 Good TSI Solutions Exist Hive partitions •Store as much as you want •Only read the files you need Hive Streaming Data Ingest from Flume or Storm Sqoop’s –-incremental mode of append •Use appropriate –-check-column •“Saved Job” remembering –last-value
  5. 5. Page5 Use Case for an Active Archive Evolving Domain Data – Hive likes immutable data Need exact copy of mutating tables refreshed periodically •Structural replica of multiple RDBMS tables •The data in these tables are being updated •Don’t need every change; just “as of” content Existing Systems ERP CRM SCM SOURCES eComm
  6. 6. Page6 Start With a Full Refresh Strategy The epitome of the KISS principle •Ingest & load new data •Drop the existing table •Rename the newly created table Surely not elegant, but solves the problem until the reload takes longer than the refresh period
  7. 7. Page7 Then Evolve to a Merge & Replace Strategy Typically, deltas are… •Small % of existing data •Plus, some totally new records In practice, differences in sizes of circles is often much more pronounced
  8. 8. Page8 Requirements for Merge & Replace An immutable unique key •To determine if an addition or a change •The source table’s (natural or surrogate) PK is perfect A last-updated timestamp to find the deltas Leverage Sqoop’s –-incremental mode of lastmodified to identify the deltas •Use appropriate –-check-column •“Saved Job” remembering –last-value
  9. 9. Page9 Processing Steps for Merge & Replace See blog at http://hortonworks.com/blog/four-step- strategy-incremental-updates-hive/, but note that merge can be done in multiple technologies, not just Hive Ingest – bring over the incremental data Reconcile – perform the merge Compact – replace the existing data with the newly merged content Purge – cleanup & prepare to repeat
  10. 10. Page10 Full Merge & Replace Will NOT Scale The “elephant” eventually gets too big and merging it with the “mouse” takes too long! Example: A Hive structure with 100 billion rows, but only 100,000 delta records
  11. 11. Page11 What Will? The Classic Hadoop Strategy!
  12. 12. Page12 But… One Size Does NOT Fit All… Not everything is “big” – in fact, most operational apps’ tables are NOT too big for a simple Full Refresh Divide & Conquer requires additional per-table research to ensure the best partitioning strategy is decided upon
  13. 13. Page13 Criteria for Active Archive Partition Values Non-nullable & immutable Ensures sliding scale growth with new records generally creating new partitions Supports delta records being skewed such that the percentage of partitions needing merge & replace operations is relatively small Classic value is (still) “Date Created”
  14. 14. Page14 Work on (FEW!) Partitions in Parallel
  15. 15. Page15 Partition-Level Merge & Replace Steps Generate the delta file Create list of affected partitions Perform merge & replace operations for affected partitions 1. Filter the delta file for the current partition 2. Load the Hive table’s current partition 3. Merge the two datasets 4. Delete the existing partition 5. Recreate the partition with the merged content
  16. 16. Page16 What Does This Approach Look Like? A Lightning-Fast Review of an Indicative Hybrid Pig-Hive Example
  17. 17. Page17 One-Time: Create the Table CREATE TABLE bogus_info( bogus_id int, field_one string, field_two string, field_three string) PARTITIONED BY (date_created STRING) STORED AS ORC TBLPROPERTIES ("orc.compress"="ZLIB");
  18. 18. Page18 One-Time: Get Content from the Source 11,2014-09-17,base,base,base 12,2014-09-17,base,base,base 13,2014-09-17,base,base,base 14,2014-09-18,base,base,base 15,2014-09-18,base,base,base 16,2014-09-18,base,base,base 17,2014-09-19,base,base,base 18,2014-09-19,base,base,base 19,2014-09-19,base,base,base
  19. 19. Page19 One-Time: Read Content from HDFS as_recd = LOAD '/user/fred/original.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray );
  20. 20. Page20 One-Time: Sort and Insert into Hive Table sorted_as_recd = ORDER as_recd BY date_created, bogus_id; STORE sorted_as_recd INTO 'bogus_info' USING org.apache.hcatalog.pig.HCatStorer();
  21. 21. Page21 One-Time: Verify Data are Present hive> select * from bogus_info; 11 base base base 2014-09-17 12 base base base 2014-09-17 13 base base base 2014-09-17 14 base base base 2014-09-18 15 base base base 2014-09-18 16 base base base 2014-09-18 17 base base base 2014-09-19 18 base base base 2014-09-19 19 base base base 2014-09-19
  22. 22. Page22 One-Time: Verify Partitions are Present hdfs dfs -ls /apps/hive/warehouse/bogus_info Found 3 items … /apps/hive/warehouse/bogus_info/date_created =2014-09-17 … /apps/hive/warehouse/bogus_info/date_created =2014-09-18 … /apps/hive/warehouse/bogus_info/date_created =2014-09-19
  23. 23. Page23 Generate the Delta File 20,2014-09-20,base,base,base 21,2014-09-20,base,base,base 22,2014-09-20,base,base,base 12,2014-09-17,base,CHANGED,base 14,2014-09-18,base,CHANGED,base 16,2014-09-18,base,CHANGED,base
  24. 24. Page24 Read Delta File from HDFS delta_recd = LOAD '/user/fred/delta1.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray );
  25. 25. Page25 Create List of Affected Partitions by_grp = GROUP delta_recd BY date_created; part_names = FOREACH by_grp GENERATE group; srtd_part_names = ORDER part_names BY group; STORE srtd_part_names INTO '/user/fred/affected_partitions’;
  26. 26. Page26 Loop/Multithread Through Affected Partitions Pig doesn’t really help you with this problem This indicative example could be implemented as: •A simple script that loops through the partitions •A Java program that multi-threads the partition-aligned processing Multiple “Control Structures” options exist as described at http://pig.apache.org/docs/r0.14.0/cont.html
  27. 27. Page27 Loop Step: Filter on the Current Partition delta_recd = LOAD '/user/fred/delta1.txt' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray ); deltaP = FILTER delta_recd BY date_created == '$partition_key’;
  28. 28. Page28 Loop Step: Retrieve Hive’s Current Partition all_bogus_info = LOAD 'bogus_info' USING org.apache.hcatalog.pig.HCatLoader(); tblP = FILTER all_bogus_info BY date_created == '$partition_key';
  29. 29. Page29 Loop Step: Merge the Datasets partJ = JOIN tblP BY bogus_id FULL OUTER, deltaP BY bogus_id; combined_part = FOREACH partJ GENERATE ((deltaP::bogus_id is not null) ? deltaP::bogus_id: tblP::bogus_id) as bogus_id, /* do for all fields and end with “;” */
  30. 30. Page30 Loop Step: Sort and Save the Merged Data s_combined_part = ORDER combined_part BY date_created, bogus_id; STORE s_combined_part INTO '/user/fred/ temp_$partition_key’ USING PigStorage(','); hdfs dfs –cat temp_2014-09-17/part-r-00000 11,2014-09-17,base,base,base 12,2014-09-17,base,CHANGED,base 13,2014-09-17,base,base,base
  31. 31. Page31 Loop Step: Delete the Partition ALTER TABLE bogus_info DROP IF EXISTS PARTITION (date_created='2014-09-17’);
  32. 32. Page32 Loop Step: Recreate the Partition 2load = LOAD '/user/fred/ temp_$partition_key' USING PigStorage(',') AS ( bogus_id:int, date_created:chararray, field_one:chararray, field_two:chararray, field_three:chararray ); STORE 2load INTO 'bogus_info' using org.apache.hcatalog.pig.HCatStorer();
  33. 33. Page33 Verify the Loop Step Updates select * from bogus_info where date_created = '2014-09-17’; 11 base base base 2014-09-17 12 base CHANGED base 2014-09-17 13 base base base 2014-09-17
  34. 34. Page34 My Head Hurts, Too! As Promised, We Flew Through That – Take Another Look Later
  35. 35. Page35 What Does Merge & Replace Miss? If critical, you have options •Create a delete table sourced by a trigger •At some wide frequency, start all over with a Full Refresh Fortunately, ~most~ enterprises don’t delete anything Marking items “inactive” is popular
  36. 36. Page36 Hybrid: Partition-Level Refresh If most of the partition is modified, just replace it entirely Especially if the changes are only recent (or highly skewed) Use a configured number of partitions to refresh and assume the rest of the data is static
  37. 37. Page37 Active Archive Strategy Review strategy # of rows % of chg chg skew handles deletes complexity Full Refresh <= millions any any yes simple Full Merge & Replace <= millions any any no moderate Partition-Level Merge & Replace billions + < 5% < 5% no complex Partition-Level Refresh billions + < 5% < 5% yes complex
  38. 38. Page38 Isn’t There Anything Easier? HIVE-5317 brought us Insert, Update & Delete •Alan Gates presented Monday •More tightly-coupled w/o the same “hazard windows” •“Driver” logic shifts to be delta-only & row-focused Thoughts & attempts at true DB replication •Some COTS solutions have been tried •Ideally, an open-source alternative is best such as enhancing the Streaming Data Ingest framework
  39. 39. Page39 Considerations for HIVE-5317 On performance & scalability; your mileage may vary Does NOT make Hive a RDBMS Available in Hive .14 onwards DDL requirements •Must utilize partitioning & bucketing •Initially, only supports ORC
  40. 40. Page40 Recommendations Take another look at this topic once back at “your desk” As with all things Hadoop… •Know your data & workloads •Try several approaches & evaluate results in earnest •Stick with the KISS principle whenever possible Share your findings via blogs and local user groups Expect (even more!) great things from Hive
  41. 41. Page41 Questions? Lester Martin – Hortonworks Professional Services lmartin@hortonworks.com || lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc) THANKS FOR YOUR TIME!!

    Als Erste(r) kommentieren

    Loggen Sie sich ein, um Kommentare anzuzeigen.

  • whome00

    Jul. 14, 2015
  • HeQi

    Jan. 6, 2017
  • KrishnakumarMenon

    Apr. 19, 2018
  • MichaelLuo28

    May. 2, 2018
  • GiannisDimitriou2

    Mar. 25, 2020
  • SaryeHaddadi

    Jun. 13, 2020

Hadoop Summit 2015


Aufrufe insgesamt


Auf Slideshare


Aus Einbettungen


Anzahl der Einbettungen