SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
HBase	
  schema	
  design
             Headline	
  Goes	
  Here
             Amandeep	
  Khurana	
  |	
  Solu7ons	
  AHere
             Speaker	
  Name	
  or	
  Subhead	
  Goes	
   rchitect
             Big	
  Data	
  TechCon,	
  Boston,	
  April	
  2013




  1
Friday, April 12, 13
About	
  me
         • Solu@ons	
  Architect,	
  Cloudera	
  Inc
         • Amazon	
  Web	
  Services
         • Interested	
  in	
  large	
  scale	
  distributed	
  systems
         • Co-­‐author,	
  HBase	
  In	
  Ac@on
         • TwiHer:	
  amansk

                                                                Nick Dimiduk
                                                                Amandeep Khurana




                                                                     MANNING




  2
Friday, April 12, 13
About	
  the	
  talk
         • Data	
  model	
  recap
         • Data	
  modeling	
  thought	
  process
         • Tools	
  and	
  techniques




  3
Friday, April 12, 13
HBase	
  is	
  ...
         • Column	
  family	
  oriented	
  database
                • Column	
  family	
  oriented
                • Tables	
  consis@ng	
  of	
  rows	
  and	
  columns
         • Persisted	
  Map
                • Sparse
                • Mul@	
  dimensional
                • Sorted
                • Indexed	
  by	
  rowkey,	
  column	
  and	
  @mestamp
         • Key	
  Value	
  store
                • [rowkey,	
  col	
  family,	
  col	
  qualifier,	
  @mestamp]	
  -­‐>	
  cell	
  value

  4
Friday, April 12, 13
HBase	
  is	
  not	
  ...
         • A	
  rela@onal	
  database
                • No	
  SQL	
  query	
  language
                • No	
  joins
                • No	
  secondary	
  indexing
                • No	
  transac@ons




  5
Friday, April 12, 13
Data	
  Model	
  recap
              It’s	
  not	
  a	
  rela7onal	
  database	
  system




  6
Friday, April 12, 13
Important	
  terms
         • Table
                • Consists	
  of	
  rows	
  and	
  columns
         • Row
                • Has	
  a	
  bunch	
  of	
  columns.
                • Iden@fied	
  by	
  a	
  rowkey	
  (primary	
  key)
         • Column	
  Qualifier
                • Dynamic	
  column	
  name
         • Column	
  Family
                • Column	
  groups	
  -­‐	
  logical	
  and	
  physical	
  (Similar	
  access	
  paHern)
         • Cell
                • The	
  actual	
  element	
  that	
  contains	
  the	
  data	
  for	
  a	
  row-­‐column	
  intersec@on
         • Version
                • Every	
  cell	
  has	
  mul@ple	
  versions.


  7
Friday, April 12, 13
Data	
  coordinates
         • Row	
  is	
  addressed	
  using	
  rowkey
         • Cell	
  is	
  addressed	
  using	
  
           	
  	
  	
  	
  	
  	
  [rowkey	
  +	
  family	
  +	
  qualifier]




  8
Friday, April 12, 13
Tabular	
  representa@on


                                                                                                              2 Column Family - Info
                                                              1                                                                                   3
                                                                  Rowkey                     name                           email                     password
                                                           GrandpaD             Mark Twain                          samuel@clemens.org          abc123

                                                           HMS_Surprise         Patrick O'Brien                     aubrey@sea.com              abc123
                       The table is lexicographically
                          sorted on the rowkeys
                                                           SirDoyle             Fyodor Dostoyevsky                  fyodor@brothers.net         abc123

                                                           TheRealMT            Sir Arthur Conan Doyle              art@TheQueensMen.co.uk       Langhorne
                                                                                                                                                      abc123
                                                                                                                                                                          4

                                                                                                                Cells               ts1=1329088321289        ts2=1329088818321

                                                                                                                                               Each cell has multiple
                                                   The coordinates used to identify data in an HBase table are:                                       versions,
                                                  (1) rowkey, (2) column family, (3) column qualifier, (4) version                             typically represented by
                                                                                                                                                   the timestamp
                                                                                                                                                 of when they were
                                                                                                                                               inserted into the table
                                                                                                                                                      (ts2>ts1)




  9
Friday, April 12, 13
Key-­‐Value	
  store



                                                             Keys    Values
                       [TheRealMT, info, password, 1329088818321]             abc123
                       [TheRealMT, info, password, 1329088321289]         Langhorne


                                        A single KeyValue instance




  10
Friday, April 12, 13
Key-­‐Value	
  store
                                 [TheRealMT, info, password, 1329088818321]                       abc123

                                                1    Start with coordinates of full precision

                                                                        {
                                                                            1329088818321 : "abc123",
                              [TheRealMT, info, password]
                                                                            1329088321289 : "Langhorne"
                                                                        }
                                        2    Drop version and you're left with a map of version to values

                       Keys                              {
                                                              "email" : {
                                                                 1329088321289 : "samuel@clemens.org"
                                                              },
                                                             "name" : {
                              [TheRealMT, info]                 1329088321289 : "Mark Twain"
                                                             },
                                                                                                                 Values
                                                              "password" : {
                                                                 1329088818321 : "abc123",
                                                                 1329088321289 : "Langhorne"
                                                              }
                                                         }
                                        3 Omit qualifier and you have a map of qualifiers to the previous maps


                                                    {
                                                        "info" : {
                                                           "email" : {
                                                              1329088321289 : "samuel@clemens.org"
                                                           },
                                                          "name" : {
                                                             1329088321289 : "Mark Twain"
                               [TheRealMT]
                                                          },
                                                           "password" : {
                                                              1329088818321 : "abc123",
                                                              1329088321289 : "Langhorne"
                                                           }
                                                        }
                                                    }
                                        4    Finally, drop the column family and you have a row, a map of maps




  11
Friday, April 12, 13
Sorted	
  map	
  of	
  maps

                                                         Rowkey
                                      {
                                          "TheRealMT" : {
                         Column family      "info" : {
                                               "email" : {
                                                  1329088321289 : "samuel@clemens.org"
                                               },
                                              "name" : {
                                                 1329088321289 : "Mark Twain"
                       Column qualifiers       },
                                               "password" : {                    Values
                                                  1329088818321 : "abc123",
                                                  1329088321289 : "Langhorne"
                                               }
                                            }
                                          }                          Versions
                                      }




  12
Friday, April 12, 13
HFiles	
  and	
  physical	
  data	
  model
         • HFiles	
  are
                • Immutable
                • Sorted	
  on	
  rowkey	
  +	
  qualifier	
  +	
  @mestamp
                • In	
  the	
  context	
  of	
  a	
  column	
  family	
  per	
  region
                              "TheRealMT" ,   "info" ,   "email" , 1329088321289, "samuel@clemens.org"
                              "TheRealMT" ,   "info" ,   "name" , 1329088321289 , "Mark Twain"
                              "TheRealMT" ,   "info" ,   "password" , 1329088818321 , "abc123",
                              "TheRealMT" ,   "info" ,   "password" , 1329088321289 , "Langhorne"


                                              HFile for the info column family in the users table




  13
Friday, April 12, 13
Thinking	
  through	
  the	
  design
              ...	
  it’s	
  a	
  database	
  a?er-­‐all




  14
Friday, April 12, 13
But	
  isn’t	
  HBase	
  schema-­‐less?
         • Number	
  of	
  tables
         • Rowkey	
  design	
  
         • Number	
  of	
  column	
  families	
  per	
  table.	
  What	
  goes	
  
           into	
  what	
  column	
  family
         • Column	
  qualifier	
  names
         • What	
  goes	
  into	
  the	
  cells
         • Number	
  of	
  versions



  15
Friday, April 12, 13
Rowkeys
         • Rowkey	
  design	
  is	
  the	
  single	
  most	
  important	
  
           aspect	
  of	
  HBase	
  table	
  designs
         • The	
  only	
  way	
  to	
  address	
  rows	
  in	
  HBase




  16
Friday, April 12, 13
TwitBase	
  rela@onships
         • Users	
  follow	
  users
         • Rela@onships	
  need	
  to	
  be	
  persisted	
  for	
  usage	
  later	
  on
         • Model	
  tables	
  for	
  the	
  expected	
  access	
  paHerns
         • Read	
  paHern
                • Who	
  does	
  A	
  follow?
                • Who	
  follows	
  A?
                • Does	
  A	
  follow	
  B?
         • Write	
  paHern
                • A	
  follows	
  B
                • A	
  unfollows	
  B

  17
Friday, April 12, 13
Start	
  simple
         • Adjacency	
  list

                                                            Column Family : follows
                                  row key:
                                   userid            column qualifier: followed user number




                                                           cell value: followed userid



                                        Cell value
                       Col Qualifier
                                                                          follows
                           TheFakeMT         1:TheRealMT    2:MTFanBoy              3:Olivia   4:HRogers
                            TheRealMT         1:HRogers        2:Olivia




  18
Friday, April 12, 13
Op@mizing	
  the	
  adjacency	
  list
         • We	
  need	
  a	
  count
                • Where	
  does	
  a	
  new	
  followed	
  user	
  go?


                                                                 follows
                          TheFakeMT   1:TheRealMT   2:MTFanBoy   3:Olivia   4:HRogers   count:4
                          TheRealMT    1:HRogers      2:Olivia   count:2




  19
Friday, April 12, 13
Adding	
  a	
  new	
  user
                       Row that needs to be updated


                                                                                         follows
                                     TheFakeMT        1:TheRealMT   2:MTFanBoy           3:Olivia        4:HRogers         count:4
                                     TheRealMT         1:HRogers        2:Olivia         count:2


                                                                                                                     1


                                                                                          TheFakeMT : follows: {count -> 4}

                                                                                                         2   increment count

                                  Client code:                                            TheFakeMT : follows: {count -> 5}
                                  Step 1: Get current count
                                  Step 2: Update count
                                  Step 3: Add new entry                                                  3    add new entry
                                  Step 4: Write the new data to HBase
                                                                                   TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5}


                                                                                                                                                 4


                                                                                                   follows
                                   TheFakeMT     1:TheRealMT        2:MTFanBoy          3:Olivia        4:HRogers        5:MTFanBoy2   count:5
                                    TheRealMT         1:HRogers       2:Olivia          count:2




  20
Friday, April 12, 13
Transac@ons	
  ==	
  not	
  good
         • HBase	
  doesn’t	
  have	
  na@ve	
  support	
  (think	
  scale)
         • Don’t	
  want	
  to	
  complicate	
  client	
  side	
  logic
         • Only	
  solu@on	
  -­‐>	
  simplify	
  schema

                                                                 follows
                          TheFakeMT   TheRealMT:1   MTFanBoy:1             Olivia:1   HRogers:1
                          TheRealMT    HRogers:1      Olivia:1




  21
Friday, April 12, 13
Revisit	
  the	
  ques@ons
         • Read	
  paHern
                • Who	
  all	
  does	
  A	
  follow?
                • Who	
  all	
  follows	
  A?
                • Does	
  A	
  follow	
  B?
         • Write	
  paHern
                • A	
  follows	
  B
                • A	
  unfollows	
  B



  22
Friday, April 12, 13
Revisit	
  the	
  ques@ons




  22
Friday, April 12, 13
Denormaliza@on
         • Second	
  table	
  for	
  reverse	
  rela@onship
         • Otherwise	
  scan	
  across	
  en@re	
  table	
  and	
  affect	
  read	
  
           performance
                                                       Normalization         Dreamland
                                   Write performance




                                                       Poor design         Denormalization




                                                                Read performance




  23
Friday, April 12, 13
More	
  op@miza@ons
         • Convert	
  into	
  tall-­‐narrow	
  table
         • Leverage	
  rowkey	
  indexing	
  beHer
         • Gets	
  -­‐>	
  short	
  Scans
                                              Keeping the column family and column qualifier
                                             names short reduces the data transferred over the
                                              network back to the client. The KeyValue objects
                                                             become smaller.


                                                                CF : f
                                                                                     The + in the row key refers to concatenating
                             row key:                CQ: followed user's name           the two values. You could delimitate
                       follower + followed                                                   using any character you like.
                                                                                                    eg: A-B or A,B


                                                                          cell value: 1




  24
Friday, April 12, 13
Tall-­‐narrow	
  table	
  example
         • Denormaliza@on	
  is	
  the	
  way	
  to	
  go



                                                        f           Putting the user name in the column
                                                                     qualifier saves you from looking up
                         TheFakeMT+TheRealMT      Mark Twain:1       the users table for the name of the
                                                                       user given an id. You can simply
                         TheFakeMT+MTFanBoy    Amandeep Khurana:1
                                                                      list out names or ids while looking
                           TheFakeMT+Olivia      Olivia Clemens:1    at relationships just from this table.
                                                                    The downside of this is that you need
                          TheFakeMT+HRogers      Henry Rogers:1
                                                                      to update the name in all the cells
                           TheRealMT+Olivia      Olivia Clemens:1      if the user updates their name in
                                                                                   their profile.
                          TheRealMT+HRogers      Henry Rogers:1         This is classic Denormalization.




  25
Friday, April 12, 13
Uniform	
  rowkey	
  length
         • MD5	
  the	
  userids	
  -­‐>	
  16	
  bytes	
  +	
  16	
  bytes	
  rowkeys
         • BeHer	
  distribu@on	
  of	
  load



                                                           CF : f
                                                                                  Using MD5 of the user ids gives you
                                row key:            CQ: followed userid             fixed lengths instead of variable
                       md5(follower)md5(followed)                                   length user ids. You don't need
                                                                                     concatenation logic anymore.

                                                           cell value: followed users name




  26
Friday, April 12, 13
Uniform	
  rowkey	
  length	
  (con@nued)



                                                                     f
                       MD5(TheFakeMT) MD5(TheRealMT)      TheRealMT:Mark Twain
                       MD5(TheFakeMT) MD5(MTFanBoy)    MTFanBoy:Amandeep Khurana
                         MD5(TheFakeMT) MD5(Olivia)        Olivia:Olivia Clemens
                        MD5(TheFakeMT) MD5(HRogers)       HRogers:Henry Rogers
                         MD5(TheRealMT) MD5(Olivia)        Olivia:Olivia Clemens
                        MD5(TheRealMT) MD5(HRogers)       HRogers:Henry Rogers




  27
Friday, April 12, 13
Tall	
  v/s	
  Wide	
  tables	
  storage	
  footprint
                                Logical representation of an HBase table.
                                                                                                                Actual physical storage of the table
                       We'll look at what it means to Get() row r5 from this table.

                                         CF1                         CF2                                      HFile for CF1             HFile for CF2
                        r1      c1:v1                      c1:v9    c6:v2
                                                                                                             r1:CF1:c1:t1:v1
                        r2      c1:v2             c3:v6                                                      r2:CF1:c1:t2:v2
                                                                                                                                       r1:CF2:c1:t1:v9
                                                                                                             r2:CF1:c3:t3:v6
                                                                                                                                       r1:CF2:c6:t4:v2
                        r3               c2:v3             c5:v6                                             r3:CF1:c2:t1:v3
                                                                                                                                       r3:CF2:c5:t4:v6
                                                                                                             r4:CF1:c2:t1:v4
                                                                                                                                       r5:CF2:c7:t3:v8
                        r4               c2:v4                                                               r5:CF1:c1:t2:v1
                                                                                                             r5:CF1:c3:t3:v5
                        r5      c1:v1             c3:v5                      c7:v8




                                                               Result object returned for a Get() on row r5
                                                                                r5:CF1:c1:t2:v1
                                                                                r5:CF1:c3:t3:v5              KeyValue objects
                                                                                r5:cf2:c7:t3:v8




                                                                                      Key                    Value
                                                                     Row       Col          Col     Time      Cell
                                                                     Key       Fam          Qual   Stamp     Value

                                                                            Structure of a KeyValue object




  28
Friday, April 12, 13
Rowkey	
  design
         • Single	
  most	
  important	
  aspect	
  of	
  designing	
  tables
         • Depends	
  on	
  expected	
  access	
  paHerns
         • HFiles	
  are	
  sorted	
  on	
  Key	
  part	
  of	
  KeyValue	
  objects

                          "TheRealMT" ,   "info" ,   "email" , 1329088321289, "samuel@clemens.org"
                          "TheRealMT" ,   "info" ,   "name" , 1329088321289 , "Mark Twain"
                          "TheRealMT" ,   "info" ,   "password" , 1329088818321 , "abc123",
                          "TheRealMT" ,   "info" ,   "password" , 1329088321289 , "Langhorne"


                                          HFile for the info column family in the users table




  29
Friday, April 12, 13
Write	
  op@mized
         • Distribute	
  writes	
  across	
  the	
  cluster
                • Issue	
  most	
  pronounced	
  with	
  @me	
  series	
  data
         • Hashing
           hash("TheRealMT") -> random byte[]

         • Sal@ng
           int salt = new Integer(new Long(timestamp).hashCode()).shortValue()
           % <number of region servers>;
           byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") +
           Bytes.toBytes(timestamp));




  30
Friday, April 12, 13
Read	
  op@mized
         • Data	
  to	
  be	
  accessed	
  together	
  should	
  be	
  stored	
  
           together
            • eg:	
  twit	
  streams	
  -­‐	
  last	
  10	
  twits	
  by	
  the	
  users	
  I	
  
              follow             Olivia1
                                 Olivia2
                                                               1Olivia
                                                               1TheRealMT
                                    Olivia5                     2Olivia
                                    Olivia7                     2TheFakeMT
                                    Olivia9                     2TheRealMT
                                    TheFakeMT2                  3TheFakeMT
                                    TheFakeMT3                  4TheFakeMT
                                    TheFakeMT4                  5Olivia
                                    TheFakeMT5                  5TheFakeMT
                                    TheFakeMT6                  5TheRealMT
                                    TheRealMT1                  6TheFakeMT
                                    TheRealMT2                  7Olivia
                                    TheRealMT5                  8TheRealMT
                                    TheRealMT8                  9Olivia




  31
Friday, April 12, 13
Rela@onal	
  to	
  Non	
  rela@onal
         • Rela@onal	
  concepts
                • En@@es
                • AHributes
                • Rela@onships
         • En@@es
                • Table	
  is	
  a	
  table.	
  Not	
  much	
  going	
  on	
  there
                • Users	
  table	
  contains...	
  users.	
  Those	
  are	
  en@@es
                       • Good	
  place	
  to	
  start


  32
Friday, April 12, 13
Rela@onal	
  to	
  Non	
  rela@onal	
  
         • AHributes
                • Iden@fying
                       • Primary	
  keys.	
  Compound	
  keys
                       • Maps	
  to	
  rowkeys

                • Non-­‐iden@fying
                       • Other	
  columns
                       • Maps	
  to	
  column	
  qualifiers	
  and	
  cells

         • Rela@onships
                • Foreign	
  keys,	
  junc@on	
  tables,	
  joins.
                • Non-­‐existent	
  in	
  HBase.	
  Instead	
  try	
  to	
  denormalize

  33
Friday, April 12, 13
Nested	
  En@@es
         • Column	
  Qualifiers	
  can	
  contain	
  data	
  instead	
  of	
  just	
  
           a	
  column	
  name

                                          hbase table
                                row key

                                 column family
                                  fixed qualifier → timestamp → value
                                                                         Nested entities
                                            repeating entity

                                 variable qualifier → timestamp → value




  34
Friday, April 12, 13
Schema	
  design	
  summary
         • Schema	
  can	
  make	
  or	
  break	
  the	
  performance	
  you	
  get
         • Rowkey	
  is	
  the	
  single	
  most	
  important	
  thing
                • Use	
  tricks	
  like	
  hashing	
  and	
  sal@ng
         • Denormalize	
  to	
  your	
  advantage
                • There	
  are	
  no	
  joins
         • Isolate	
  access	
  paHerns
                • Separate	
  CFs	
  or	
  even	
  separate	
  tables
         • Shorter	
  names	
  -­‐>	
  lower	
  storage	
  footprint
         • Column	
  qualifiers	
  can	
  be	
  used	
  to	
  store	
  data	
  and	
  not	
  just	
  column	
  
           names
            • Big	
  difference	
  as	
  compared	
  to	
  RDBMS


  35
Friday, April 12, 13
36
Friday, April 12, 13

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Empfohlen

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

HBase schema design Big Data TechCon Boston

  • 1. HBase  schema  design Headline  Goes  Here Amandeep  Khurana  |  Solu7ons  AHere Speaker  Name  or  Subhead  Goes   rchitect Big  Data  TechCon,  Boston,  April  2013 1 Friday, April 12, 13
  • 2. About  me • Solu@ons  Architect,  Cloudera  Inc • Amazon  Web  Services • Interested  in  large  scale  distributed  systems • Co-­‐author,  HBase  In  Ac@on • TwiHer:  amansk Nick Dimiduk Amandeep Khurana MANNING 2 Friday, April 12, 13
  • 3. About  the  talk • Data  model  recap • Data  modeling  thought  process • Tools  and  techniques 3 Friday, April 12, 13
  • 4. HBase  is  ... • Column  family  oriented  database • Column  family  oriented • Tables  consis@ng  of  rows  and  columns • Persisted  Map • Sparse • Mul@  dimensional • Sorted • Indexed  by  rowkey,  column  and  @mestamp • Key  Value  store • [rowkey,  col  family,  col  qualifier,  @mestamp]  -­‐>  cell  value 4 Friday, April 12, 13
  • 5. HBase  is  not  ... • A  rela@onal  database • No  SQL  query  language • No  joins • No  secondary  indexing • No  transac@ons 5 Friday, April 12, 13
  • 6. Data  Model  recap It’s  not  a  rela7onal  database  system 6 Friday, April 12, 13
  • 7. Important  terms • Table • Consists  of  rows  and  columns • Row • Has  a  bunch  of  columns. • Iden@fied  by  a  rowkey  (primary  key) • Column  Qualifier • Dynamic  column  name • Column  Family • Column  groups  -­‐  logical  and  physical  (Similar  access  paHern) • Cell • The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec@on • Version • Every  cell  has  mul@ple  versions. 7 Friday, April 12, 13
  • 8. Data  coordinates • Row  is  addressed  using  rowkey • Cell  is  addressed  using              [rowkey  +  family  +  qualifier] 8 Friday, April 12, 13
  • 9. Tabular  representa@on 2 Column Family - Info 1 3 Rowkey name email password GrandpaD Mark Twain samuel@clemens.org abc123 HMS_Surprise Patrick O'Brien aubrey@sea.com abc123 The table is lexicographically sorted on the rowkeys SirDoyle Fyodor Dostoyevsky fyodor@brothers.net abc123 TheRealMT Sir Arthur Conan Doyle art@TheQueensMen.co.uk Langhorne abc123 4 Cells ts1=1329088321289 ts2=1329088818321 Each cell has multiple The coordinates used to identify data in an HBase table are: versions, (1) rowkey, (2) column family, (3) column qualifier, (4) version typically represented by the timestamp of when they were inserted into the table (ts2>ts1) 9 Friday, April 12, 13
  • 10. Key-­‐Value  store Keys Values [TheRealMT, info, password, 1329088818321] abc123 [TheRealMT, info, password, 1329088321289] Langhorne A single KeyValue instance 10 Friday, April 12, 13
  • 11. Key-­‐Value  store [TheRealMT, info, password, 1329088818321] abc123 1 Start with coordinates of full precision { 1329088818321 : "abc123", [TheRealMT, info, password] 1329088321289 : "Langhorne" } 2 Drop version and you're left with a map of version to values Keys { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { [TheRealMT, info] 1329088321289 : "Mark Twain" }, Values "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } 3 Omit qualifier and you have a map of qualifiers to the previous maps { "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" [TheRealMT] }, "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } 4 Finally, drop the column family and you have a row, a map of maps 11 Friday, April 12, 13
  • 12. Sorted  map  of  maps Rowkey { "TheRealMT" : { Column family "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" Column qualifiers }, "password" : { Values 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } Versions } 12 Friday, April 12, 13
  • 13. HFiles  and  physical  data  model • HFiles  are • Immutable • Sorted  on  rowkey  +  qualifier  +  @mestamp • In  the  context  of  a  column  family  per  region "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 13 Friday, April 12, 13
  • 14. Thinking  through  the  design ...  it’s  a  database  a?er-­‐all 14 Friday, April 12, 13
  • 15. But  isn’t  HBase  schema-­‐less? • Number  of  tables • Rowkey  design   • Number  of  column  families  per  table.  What  goes   into  what  column  family • Column  qualifier  names • What  goes  into  the  cells • Number  of  versions 15 Friday, April 12, 13
  • 16. Rowkeys • Rowkey  design  is  the  single  most  important   aspect  of  HBase  table  designs • The  only  way  to  address  rows  in  HBase 16 Friday, April 12, 13
  • 17. TwitBase  rela@onships • Users  follow  users • Rela@onships  need  to  be  persisted  for  usage  later  on • Model  tables  for  the  expected  access  paHerns • Read  paHern • Who  does  A  follow? • Who  follows  A? • Does  A  follow  B? • Write  paHern • A  follows  B • A  unfollows  B 17 Friday, April 12, 13
  • 18. Start  simple • Adjacency  list Column Family : follows row key: userid column qualifier: followed user number cell value: followed userid Cell value Col Qualifier follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers TheRealMT 1:HRogers 2:Olivia 18 Friday, April 12, 13
  • 19. Op@mizing  the  adjacency  list • We  need  a  count • Where  does  a  new  followed  user  go? follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 19 Friday, April 12, 13
  • 20. Adding  a  new  user Row that needs to be updated follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 1 TheFakeMT : follows: {count -> 4} 2 increment count Client code: TheFakeMT : follows: {count -> 5} Step 1: Get current count Step 2: Update count Step 3: Add new entry 3 add new entry Step 4: Write the new data to HBase TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5} 4 follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers 5:MTFanBoy2 count:5 TheRealMT 1:HRogers 2:Olivia count:2 20 Friday, April 12, 13
  • 21. Transac@ons  ==  not  good • HBase  doesn’t  have  na@ve  support  (think  scale) • Don’t  want  to  complicate  client  side  logic • Only  solu@on  -­‐>  simplify  schema follows TheFakeMT TheRealMT:1 MTFanBoy:1 Olivia:1 HRogers:1 TheRealMT HRogers:1 Olivia:1 21 Friday, April 12, 13
  • 22. Revisit  the  ques@ons • Read  paHern • Who  all  does  A  follow? • Who  all  follows  A? • Does  A  follow  B? • Write  paHern • A  follows  B • A  unfollows  B 22 Friday, April 12, 13
  • 23. Revisit  the  ques@ons 22 Friday, April 12, 13
  • 24. Denormaliza@on • Second  table  for  reverse  rela@onship • Otherwise  scan  across  en@re  table  and  affect  read   performance Normalization Dreamland Write performance Poor design Denormalization Read performance 23 Friday, April 12, 13
  • 25. More  op@miza@ons • Convert  into  tall-­‐narrow  table • Leverage  rowkey  indexing  beHer • Gets  -­‐>  short  Scans Keeping the column family and column qualifier names short reduces the data transferred over the network back to the client. The KeyValue objects become smaller. CF : f The + in the row key refers to concatenating row key: CQ: followed user's name the two values. You could delimitate follower + followed using any character you like. eg: A-B or A,B cell value: 1 24 Friday, April 12, 13
  • 26. Tall-­‐narrow  table  example • Denormaliza@on  is  the  way  to  go f Putting the user name in the column qualifier saves you from looking up TheFakeMT+TheRealMT Mark Twain:1 the users table for the name of the user given an id. You can simply TheFakeMT+MTFanBoy Amandeep Khurana:1 list out names or ids while looking TheFakeMT+Olivia Olivia Clemens:1 at relationships just from this table. The downside of this is that you need TheFakeMT+HRogers Henry Rogers:1 to update the name in all the cells TheRealMT+Olivia Olivia Clemens:1 if the user updates their name in their profile. TheRealMT+HRogers Henry Rogers:1 This is classic Denormalization. 25 Friday, April 12, 13
  • 27. Uniform  rowkey  length • MD5  the  userids  -­‐>  16  bytes  +  16  bytes  rowkeys • BeHer  distribu@on  of  load CF : f Using MD5 of the user ids gives you row key: CQ: followed userid fixed lengths instead of variable md5(follower)md5(followed) length user ids. You don't need concatenation logic anymore. cell value: followed users name 26 Friday, April 12, 13
  • 28. Uniform  rowkey  length  (con@nued) f MD5(TheFakeMT) MD5(TheRealMT) TheRealMT:Mark Twain MD5(TheFakeMT) MD5(MTFanBoy) MTFanBoy:Amandeep Khurana MD5(TheFakeMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheFakeMT) MD5(HRogers) HRogers:Henry Rogers MD5(TheRealMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheRealMT) MD5(HRogers) HRogers:Henry Rogers 27 Friday, April 12, 13
  • 29. Tall  v/s  Wide  tables  storage  footprint Logical representation of an HBase table. Actual physical storage of the table We'll look at what it means to Get() row r5 from this table. CF1 CF2 HFile for CF1 HFile for CF2 r1 c1:v1 c1:v9 c6:v2 r1:CF1:c1:t1:v1 r2 c1:v2 c3:v6 r2:CF1:c1:t2:v2 r1:CF2:c1:t1:v9 r2:CF1:c3:t3:v6 r1:CF2:c6:t4:v2 r3 c2:v3 c5:v6 r3:CF1:c2:t1:v3 r3:CF2:c5:t4:v6 r4:CF1:c2:t1:v4 r5:CF2:c7:t3:v8 r4 c2:v4 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r5 c1:v1 c3:v5 c7:v8 Result object returned for a Get() on row r5 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 KeyValue objects r5:cf2:c7:t3:v8 Key Value Row Col Col Time Cell Key Fam Qual Stamp Value Structure of a KeyValue object 28 Friday, April 12, 13
  • 30. Rowkey  design • Single  most  important  aspect  of  designing  tables • Depends  on  expected  access  paHerns • HFiles  are  sorted  on  Key  part  of  KeyValue  objects "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 29 Friday, April 12, 13
  • 31. Write  op@mized • Distribute  writes  across  the  cluster • Issue  most  pronounced  with  @me  series  data • Hashing hash("TheRealMT") -> random byte[] • Sal@ng int salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region servers>; byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") + Bytes.toBytes(timestamp)); 30 Friday, April 12, 13
  • 32. Read  op@mized • Data  to  be  accessed  together  should  be  stored   together • eg:  twit  streams  -­‐  last  10  twits  by  the  users  I   follow Olivia1 Olivia2 1Olivia 1TheRealMT Olivia5 2Olivia Olivia7 2TheFakeMT Olivia9 2TheRealMT TheFakeMT2 3TheFakeMT TheFakeMT3 4TheFakeMT TheFakeMT4 5Olivia TheFakeMT5 5TheFakeMT TheFakeMT6 5TheRealMT TheRealMT1 6TheFakeMT TheRealMT2 7Olivia TheRealMT5 8TheRealMT TheRealMT8 9Olivia 31 Friday, April 12, 13
  • 33. Rela@onal  to  Non  rela@onal • Rela@onal  concepts • En@@es • AHributes • Rela@onships • En@@es • Table  is  a  table.  Not  much  going  on  there • Users  table  contains...  users.  Those  are  en@@es • Good  place  to  start 32 Friday, April 12, 13
  • 34. Rela@onal  to  Non  rela@onal   • AHributes • Iden@fying • Primary  keys.  Compound  keys • Maps  to  rowkeys • Non-­‐iden@fying • Other  columns • Maps  to  column  qualifiers  and  cells • Rela@onships • Foreign  keys,  junc@on  tables,  joins. • Non-­‐existent  in  HBase.  Instead  try  to  denormalize 33 Friday, April 12, 13
  • 35. Nested  En@@es • Column  Qualifiers  can  contain  data  instead  of  just   a  column  name hbase table row key column family fixed qualifier → timestamp → value Nested entities repeating entity variable qualifier → timestamp → value 34 Friday, April 12, 13
  • 36. Schema  design  summary • Schema  can  make  or  break  the  performance  you  get • Rowkey  is  the  single  most  important  thing • Use  tricks  like  hashing  and  sal@ng • Denormalize  to  your  advantage • There  are  no  joins • Isolate  access  paHerns • Separate  CFs  or  even  separate  tables • Shorter  names  -­‐>  lower  storage  footprint • Column  qualifiers  can  be  used  to  store  data  and  not  just  column   names • Big  difference  as  compared  to  RDBMS 35 Friday, April 12, 13