SlideShare ist ein Scribd-Unternehmen logo
1 von 106
Downloaden Sie, um offline zu lesen
Building	
  applica2ons	
  with	
  HBase
        Headline	
  Goes	
  Here
        Amandeep	
  Khurana	
  |	
  Solu7ons	
  Architect
        Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here
        Data	
  Days	
  Texas,	
  Aus2n,	
  March	
  2013




 1

Monday, April 1, 13
About	
  me
     • Solu7ons	
  Architect,	
  Cloudera	
  Inc
     • Amazon	
  Web	
  Services
     • Interested	
  in	
  large	
  scale	
  distributed	
  systems
     • Co-­‐author,	
  HBase	
  In	
  Ac7on
     • TwiGer:	
  amansk




 2

Monday, April 1, 13
I’m	
  a	
  wanna-­‐be	
  (author)




                                          www.hbaseinac7on.com
                 Nick Dimiduk
                 Amandeep Khurana




                      MANNING




 3

Monday, April 1, 13
About	
  the	
  workshop
     • Install	
  HBase
     • Basic	
  interac7ons	
  with	
  the	
  shell
     • Use	
  of	
  HBase	
  Java	
  API
     • Data	
  model
     • HBase	
  and	
  Hadoop	
  integra7on




 4

Monday, April 1, 13
Get	
  HBase	
  and	
  tutorial	
  instruc7ons
     • Get	
  it	
  here	
  -­‐>	
  hGp://hbase.apache.org/
     • Downloads	
  -­‐>	
  Choose	
  a	
  mirror	
  -­‐>	
  0.94.1
     • Download	
  the	
  file	
  hbase-­‐0.94.1.tar.gz


     •   Get	
  the	
  tutorial	
  handout	
  -­‐>	
  
         hGp://people.apache.org/~ak/ddtx13



 5

Monday, April 1, 13
What’s	
  Big	
  Data?
         ...	
  except	
  another	
  marke7ng	
  term




 6

Monday, April 1, 13
Data	
  management




 7

Monday, April 1, 13
Data	
  has	
  changed



                                             END-USER
                                            APPLICATIONS
                                           THE INTERNET
                      DATA GROWTH




                                           MOBILE DEVICES
                                           SOPHISTICATED
                                             MACHINES




                                                            UNSTRUCTURED DATA – 90%

                                                              STRUCTURED DATA – 10%

                                    1980                               2012


 8

Monday, April 1, 13
“Big	
  Data”




 9

Monday, April 1, 13
Hadoop	
  -­‐	
  Big	
  Data	
  poster	
  child




                                                        HDFS, MAPREDUCE




 10

Monday, April 1, 13
Hadoop	
  -­‐	
  Big	
  Data	
  poster	
  child
                        File System Mount                 UI Framework                           Machine Learning
                                         FUSE-DFS                              HUE                       APACHE MAHOUT




                              Workflow                     Scheduling                               Metadata
                                     APACHE OOZIE                     APACHE OOZIE                          APACHE HIVE




                                                    Languages / Compilers
                                                    APACHE PIG, APACHE HIVE, APACHE MAHOUT

                                                                                                         Fast
                      Data Integration
                                                                                                   Read/Write Access

                      APACHE FLUME, APACHE
                             SQOOP                                                                    APACHE HBASE
                                                                               HDFS, MAPREDUCE




                                                                                                     APACHE ZOOKEEPER
                                                          Coordination




 10

Monday, April 1, 13
What	
  is	
  HBase?
      • Scalable,	
  distributed	
  data	
  store
      • Sorted	
  map	
  of	
  maps
      • Open	
  source	
  avatar	
  of	
  Google’s	
  Bigtable
      • Sparse
      • Mul7	
  dimensional
      • Tightly	
  integrated	
  with	
  Hadoop
      • Not	
  a	
  RDBMS



 11

Monday, April 1, 13
So,	
  what’s	
  the	
  big	
  deal?
      • Give	
  up	
  system	
  complexity	
  and	
  sophis7cated	
  features	
  for	
  ease	
  
        of	
  scale	
  and	
  manageability
      • Simple	
  system	
  !=	
  easy	
  to	
  build	
  applica7ons	
  on	
  top
      • Goal:	
  predictable	
  read	
  and	
  write	
  performance	
  at	
  scale
            • Effec7ve	
  API	
  usage
            • Op7mal	
  schema	
  design
            • Understand	
  internals	
  and	
  leverage	
  at	
  scale



 12

Monday, April 1, 13
Use	
  cases
      •   Canonical	
  use	
  case:	
  storing	
  crawl	
  data	
  and	
  indices	
  for	
  search

                                                                                                                                   Web Search
                                                                                                                                   powered by Bigtable
                                                                  Crawlers
                                         Internets                 Crawlers                 Bigtable
                                                                     Crawlers
                                                                      Crawlers
                                                           1




                                                                                    2
                                                                                           MapReduce
                             Indexing the Internet
                             1 Crawlers constantly scour the Internet for new pages.
                               Those pages are stored as individual records in Bigtable.                      4                           3

                             2 A MapReduce job runs over the entire table, generating      Search
                               search indexes for the Web Search application.                Search
                                                                                            Index
                                                                                               Search                Web Search
                                                                                              Index
                                                                                                Index

                                                                                                                                          5           You


                                                                                                        Searching the Internet
                                                                                                        3 The user initiates a Web Search request.

                                                                                                        4 The Web Search application queries the Search Indexes
                                                                                                          and retries matching documents directly from Bigtable.

                                                                                                        5 Search results are presented to the user.




 13

Monday, April 1, 13
Use	
  cases	
  (con7nued)
      • Capturing	
  incremental	
  data
      • Content	
  serving
      • Informa7on	
  exchange




 14

Monday, April 1, 13
Ok,	
  enough	
  of	
  high	
  level	
  stuff
      • Big	
  data	
  is	
  a	
  buzzword	
  but	
  there	
  is	
  more	
  to	
  it
          • New	
  products,	
  applica7ons,	
  types	
  of	
  data
      • Hadoop	
  ecosystem	
  is	
  the	
  poster	
  child	
  of	
  big	
  data	
  architectures
          • Consists	
  of	
  several	
  components
      • HBase	
  is	
  a	
  part	
  of	
  the	
  ecosystem
          • Scalable	
  database
          • Lots	
  of	
  produc7on	
  deployments



 15

Monday, April 1, 13
HBase	
  -­‐	
  GeEng	
  Started
         Installa7on	
  and	
  basic	
  interac7on




 16

Monday, April 1, 13
Installing	
  HBase
      • Get	
  it	
  here	
  -­‐>	
  hGp://hbase.apache.org/
      • Downloads	
  -­‐>	
  Choose	
  a	
  mirror	
  -­‐>	
  0.94.1
      • Download	
  the	
  file	
  hbase-­‐0.94.1.tar.gz
      • Untar	
  it
          • mkdir	
  ~/ddtx
          • cp	
  hbase-­‐0.94.1.tar.gz	
  ~/ddtx
          • cd	
  ~/ddtx
          • tar	
  xvf	
  hbase-­‐0.94.1.tar.gz

 17

Monday, April 1, 13
Basics
      • Modes	
  of	
  opera2on
           • Standalone
           • Pseudo-­‐distributed
           • Fully	
  distributed
      • Start	
  up	
  in	
  standalone	
  mode
           • cd	
  hbase-­‐0.94.1
           • bin/start-­‐hbase.sh
           • Logs	
  are	
  in	
  ./logs/
      • Check	
  out	
  hSp://localhost:60010	
  in	
  your	
  browser


 18

Monday, April 1, 13
Shell
      • Basic	
  interac7on	
  tool
      • WriGen	
  in	
  JRuby
      • Uses	
  the	
  Java	
  client	
  library
      • Good	
  place	
  to	
  start	
  poking	
  around




 19

Monday, April 1, 13
Shell	
  (con7nued)
      • bin/hbase	
  shell
      • Create	
  table
           • create	
  ‘users’	
  ,	
  ‘info’
      • List	
  tables
           • list
      • Describe	
  table
           • describe	
  ‘users’



 20

Monday, April 1, 13
Shell	
  (con7nued)
      •   Put	
  a	
  row
               •put	
  ‘users’	
  ,	
  ‘u1’,	
  ‘info:name’	
  ,	
  ‘Mr.	
  Rogers’
      • Get	
  a	
  row
           • get	
  ‘users’	
  ,	
  ‘u1’
      • Put	
  more
           • put	
  ‘users’	
  ,	
  ‘u1’	
  ,	
  ‘info:email’	
  ,	
  ‘mail@rogers.com’
           • put	
  ‘users’	
  ,	
  ‘u1’	
  ,	
  ‘info:password’	
  ,	
  ‘foobar’
      • Get	
  a	
  row
               •get	
  ‘users’	
  ,	
  ‘u1’
      • Scan	
  table
           • scan	
  ‘users’

 21

Monday, April 1, 13
Java	
  API
         ...	
  can’t	
  really	
  build	
  an	
  applica7on	
  just	
  with	
  the	
  shell




 22

Monday, April 1, 13
Important	
  terms
      •   Table
               •   Consists	
  of	
  rows	
  and	
  columns
      •   Row
               • Has	
  a	
  bunch	
  of	
  columns.
               • Iden2fied	
  by	
  a	
  rowkey	
  (primary	
  key)
      •   Column	
  Qualifier
               • Dynamic	
  column	
  name
      •   Column	
  Family
               • Column	
  groups	
  -­‐	
  logical	
  and	
  physical	
  (Similar	
  access	
  paSern)
      •   Cell
               • The	
  actual	
  element	
  that	
  contains	
  the	
  data	
  for	
  a	
  row-­‐column	
  intersec2on
      •   Version
               • Every	
  cell	
  has	
  mul2ple	
  versions.



 23

Monday, April 1, 13
Introducing	
  TwitBase
      •   TwiSer	
  clone	
  built	
  on	
  HBase
      •   Designed	
  to	
  scale
      •   It’s	
  an	
  example,	
  not	
  a	
  produc2on	
  app
      •   Start	
  with	
  users	
  and	
  twits	
  for	
  now
      •   Users
              • Have	
  profiles
              • Post	
  twits
      •   Twits
              • Have	
  text	
  posted	
  by	
  users
      •   Get	
  sample	
  code	
  at:	
  hSps://github.com/hbaseinac2on/twitbase

 24

Monday, April 1, 13
Connec7ng	
  to	
  HBase
      •   Using default confs:
          HTableInterface usersTable = new HTable("users");




      •   Passing custom conf:
          Configuration myConf = HBaseConfiguration.create();
          myConf.set("parameter_name", "parameter_value");
          HTableInterface usersTable = new HTable(myConf, "users");



      •   Connection pooling
          HTablePool pool = new HTablePool();
          HTableInterface usersTable = pool.getTable("users");
            // work with the table
          usersTable.close();


 25

Monday, April 1, 13
Inser7ng	
  data
      •   Put	
  call	
  on	
  HTable
          Put p = new Put(Bytes.toBytes("TheRealMT"));
            p.add(Bytes.toBytes("info"),
              Bytes.toBytes("name"),
              Bytes.toBytes("Mark Twain"));
            p.add(Bytes.toBytes("info"),
              Bytes.toBytes("email"),
              Bytes.toBytes("samuel@clemens.org"));
            p.add(Bytes.toBytes("info"),
              Bytes.toBytes("password"),
              Bytes.toBytes("Langhorne"));
          usersTable.put(p);




 26

Monday, April 1, 13
Data	
  coordinates
      • Row	
  is	
  addressed	
  using	
  rowkey
      • Cell	
  is	
  addressed	
  using	
  
        	
  	
  	
  	
  	
  	
  [rowkey	
  +	
  family	
  +	
  qualifier]




 27

Monday, April 1, 13
Upda7ng	
  data
      • Update	
  ==	
  Write
        Put p = new Put(Bytes.toBytes("TheRealMT"));
        p.add(Bytes.toBytes("info"),
          Bytes.toBytes("password"),
          Bytes.toBytes("abc123"));
        usersTable.put(p);




 28

Monday, April 1, 13
Write	
  path
                                 WAL

                                            Write goes to the write-ahead log (WAL)
                               Memstore    as well as the in memory write buffer called
                                                          the Memstore.
                      Client     HFile        Client's don't interact directly with the
                                                 underlying HFiles during writes.
                                 HFile




                               Memstore



                                                Memstore gets flushed to disk to
                               New HFile
                                                      form a new HFile
                                                  when filled up with writes.
                                 HFile


                                 HFile




 29

Monday, April 1, 13
Reading	
  data
      •   Get	
  the	
  en2re	
  row
          Get g = new Get(Bytes.toBytes("TheRealMT"));
          Result r = usersTable.get(g);


      •   Get	
  selected	
  columns
          Get g = new Get(Bytes.toBytes("TheRealMT"));
          g.addColumn(Bytes.toBytes("info"),
            Bytes.toBytes("password"));
          Result r = usersTable.get(g);



      •   Extract	
  the	
  value	
  out	
  of	
  the	
  Result	
  object
          byte[] b = r.getValue(Bytes.toBytes("info"),
          Bytes.toBytes("email"));
          String email = Bytes.toString(b);


 30

Monday, April 1, 13
Reading	
  data	
  (con7nued)
      •   Scan
          Scan s = new Scan(startRow, stopRow);
          ResultsScanner rs = twits.getScanner(s);
          for(Result r : rs) {
              // iterate over scanner results
          }



      •   Single threaded iteration over defined set of rows. Could be the entire
          table too.




 31

Monday, April 1, 13
What	
  happens	
  when	
  you	
  read	
  data?


                                          BlockCache


                                          Memstore

                             Client         HFile

                                            HFile




 32

Monday, April 1, 13
Dele7ng	
  data
      • Another	
  simple	
  API	
  call
      • Delete	
  en7re	
  row
          Delete d = new Delete(Bytes.toBytes("TheRealMT"));
          usersTable.delete(d);



      •   Delete	
  selected	
  columns
          Delete d = new Delete(Bytes.toBytes("TheRealMT"));
          d.deleteColumns(Bytes.toBytes("info"),
            Bytes.toBytes("email"));
          usersTable.delete(d);




 33

Monday, April 1, 13
Delete	
  internals
      • Tombstone	
  markers
      • Dele7on	
  only	
  happens	
  during	
  major	
  compac7ons
      • Compac7ons
          • Minor
          • Major




 34

Monday, April 1, 13
Compac7ons


                      Memstore   Memstore



                       HFile                 Two or more HFiles get combined
                                 Compacted     to form a single HFile during
                                   HFile              a compaction.
                       HFile


                       HFile       HFile




 35

Monday, April 1, 13
Hands-­‐on:	
  Java	
  client	
  to	
  HBase
      •   Create	
  a	
  table	
  from	
  the	
  HBase	
  shell
            • create	
  ‘users’	
  ,	
  ‘info’
      •   Write	
  a	
  program	
  to	
  do	
  the	
  following
            • Add	
  10	
  users	
  to	
  the	
  ‘users’	
  table.
                    •   userid	
  as	
  rowkey
                    •   user’s	
  name	
  in	
  the	
  column	
  ‘info:name’
                    •   userid	
  could	
  just	
  be	
  user	
  +	
  serial	
  number.	
  eg:	
  user1,	
  user2,	
  user3
             •   Get	
  a	
  user	
  from	
  the	
  ‘users’	
  table	
  and	
  display	
  on	
  screen
             •   Scan	
  the	
  en2re	
  ‘users’	
  table	
  and	
  display	
  on	
  screen
 36

Monday, April 1, 13
DAO
      • Like	
  any	
  other	
  applica7on
      • Makes	
  data	
  access	
  management	
  easy
           • Cleaner	
  code
           • Table	
  info	
  at	
  a	
  single	
  place
      • Op7miza7ons	
  like	
  connec7on	
  pooling	
  etc
      • Easier	
  management	
  of	
  database	
  configs




 37

Monday, April 1, 13
Admin
     HBaseAdmin                                             Table	
  Opera7ons
     •    Administra7ve	
  API	
  to	
  the	
  cluster.     •   createTable()
     •    Provides	
  an	
  interface	
  to	
  manage	
     •   deleteTable()
          HBase	
  database	
  tables	
  metadata	
  
                                                            •   addColumn()
          and	
  general	
  administra7ve	
  
          func7ons.                                         •   modifyColumn()
                                                            •   deleteColumn()
                                                            •   listTables()


Monday, April 1, 13
Data	
  Model
         It’s	
  not	
  a	
  rela7onal	
  database	
  system




 39

Monday, April 1, 13
HBase	
  is	
  ...
      • Column	
  family	
  oriented	
  database
          • Column	
  family	
  oriented
          • Tables	
  consis2ng	
  of	
  rows	
  and	
  columns
      • Persisted	
  Map
          • Sparse
          • Mul2	
  dimensional
          • Sorted
          • Indexed	
  by	
  rowkey,	
  column	
  and	
  2mestamp
      • Key	
  Value	
  store
          • [rowkey,	
  col	
  family,	
  col	
  qualifier,	
  2mestamp]	
  -­‐>	
  cell	
  value

 40

Monday, April 1, 13
HBase	
  is	
  not	
  ...
      •   A	
  rela7onal	
  database
               • No	
  SQL	
  query	
  language
               • No	
  joins
               • No	
  secondary	
  indexing
               • No	
  transac7ons




 41

Monday, April 1, 13
Tabular	
  representa7on

                                                                                                       2
                                                                                                             Column Family - Info
                                                       1                                                                                     3
                                                           Rowkey                     name                             email                     password
                                                    GrandpaD             Mark Twain                          samuel@clemens.org            abc123

                                                    HMS_Surprise         Patrick O'Brien                     aubrey@sea.com                abc123
                The table is lexicographically
                   sorted on the rowkeys
                                                    SirDoyle             Fyodor Dostoyevsky                  fyodor@brothers.net           abc123

                                                    TheRealMT            Sir Arthur Conan Doyle              art@TheQueensMen.co.uk         Langhorne
                                                                                                                                                 abc123
                                                                                                                                                                     4

                                                                                                         Cells                 ts1=1329088321289        ts2=1329088818321

                                                                                                                                          Each cell has multiple
                                            The coordinates used to identify data in an HBase table are:                                         versions,
                                           (1) rowkey, (2) column family, (3) column qualifier, (4) version                               typically represented by
                                                                                                                                              the timestamp
                                                                                                                                            of when they were
                                                                                                                                          inserted into the table
                                                                                                                                                 (ts2>ts1)




 42

Monday, April 1, 13
Key-­‐Value	
  store


                                                             Keys    Values
                       [TheRealMT, info, password, 1329088818321]             abc123
                       [TheRealMT, info, password, 1329088321289]         Langhorne


                                        A single KeyValue instance




 43

Monday, April 1, 13
Key-­‐Value	
  store
                                  [TheRealMT, info, password, 1329088818321]                       abc123

                                                 1    Start with coordinates of full precision

                                                                         {
                                                                             1329088818321 : "abc123",
                               [TheRealMT, info, password]
                                                                             1329088321289 : "Langhorne"
                                                                         }
                                         2    Drop version and you're left with a map of version to values

                        Keys                              {
                                                               "email" : {
                                                                  1329088321289 : "samuel@clemens.org"
                                                               },
                                                              "name" : {
                               [TheRealMT, info]                 1329088321289 : "Mark Twain"
                                                              },
                                                                                                                  Values
                                                               "password" : {
                                                                  1329088818321 : "abc123",
                                                                  1329088321289 : "Langhorne"
                                                               }
                                                          }
                                         3 Omit qualifier and you have a map of qualifiers to the previous maps


                                                     {
                                                         "info" : {
                                                            "email" : {
                                                               1329088321289 : "samuel@clemens.org"
                                                            },
                                                           "name" : {
                                                              1329088321289 : "Mark Twain"
                                [TheRealMT]
                                                           },
                                                            "password" : {
                                                               1329088818321 : "abc123",
                                                               1329088321289 : "Langhorne"
                                                            }
                                                         }
                                                     }
                                         4    Finally, drop the column family and you have a row, a map of maps




 44

Monday, April 1, 13
Sorted	
  map	
  of	
  maps

                                                         Rowkey
                                      {
                                          "TheRealMT" : {
                         Column family      "info" : {
                                               "email" : {
                                                  1329088321289 : "samuel@clemens.org"
                                               },
                                              "name" : {
                                                 1329088321289 : "Mark Twain"
                       Column qualifiers       },
                                               "password" : {                    Values
                                                  1329088818321 : "abc123",
                                                  1329088321289 : "Langhorne"
                                               }
                                            }
                                          }                          Versions
                                      }




 45

Monday, April 1, 13
HFiles	
  and	
  physical	
  data	
  model
      •   HFiles	
  are
            • Immutable
            • Sorted	
  on	
  rowkey	
  +	
  qualifier	
  +	
  7mestamp
            • In	
  the	
  context	
  of	
  a	
  column	
  family	
  per	
  region
                              "TheRealMT" ,   "info" ,   "email" , 1329088321289, "samuel@clemens.org"
                              "TheRealMT" ,   "info" ,   "name" , 1329088321289 , "Mark Twain"
                              "TheRealMT" ,   "info" ,   "password" , 1329088818321 , "abc123",
                              "TheRealMT" ,   "info" ,   "password" , 1329088321289 , "Langhorne"


                                              HFile for the info column family in the users table




 46

Monday, April 1, 13
HBase	
  in	
  distributed	
  mode
         ...	
  what	
  really	
  goes	
  on	
  under	
  the	
  hood




 47

Monday, April 1, 13
Distributed	
  mode
      •   Table	
  is	
  split	
  into	
  Regions
                                                                                                                                                 T1R1
                                                                                               00001        John               415-111-1234
                                                                                               00002        Paul               408-432-9922
                                                                00001   John    415-111-1234   00003        Ron                415-993-2124
                      00001   John               415-111-1234   00002   Paul    408-432-9922   00004        Bob                818-243-9988
                      00002   Paul               408-432-9922   00003   Ron     415-993-2124
                      00003   Ron                415-993-2124   00004   Bob     818-243-9988
                      00004   Bob                818-243-9988                                                                             T1R2
                      00005   Carly              206-221-9123   00005   Carly   206-221-9123   00005        Carly              206-221-9123
                      00006   Scott              818-231-2566   00006   Scott   818-231-2566   00006        Scott              818-231-2566
                      00007   Simon              425-112-9877   00007   Simon   425-112-9877   00007        Simon              425-112-9877
                      00008   Lucas              415-992-4432   00008   Lucas   415-992-4432   00008        Lucas              415-992-4432
                      00009   Steve              530-288-9832
                      00010   Kelly              916-992-1234   00009   Steve   530-288-9832
                                                                                                                                          T1R3
                      00011   Betty              650-241-1192   00010   Kelly   916-992-1234
                      00012   Anne               206-294-1298   00011   Betty   650-241-1192   00009        Steve              530-288-9832
                                                                00012   Anne    206-294-1298   00010        Kelly              916-992-1234
                                                                                               00011        Betty              650-241-1192
                                                                                               00012        Anne               206-294-1298


                                 Full table T1                                                   Table T1 split into 3 regions - R1, R2 and R3




 48

Monday, April 1, 13
Distributed	
  mode	
  (con7nued)
      • Regions	
  are	
  served	
  by	
     DataNode     RegionServer
                                                                          T1R1
                                                                                 00001
                                                                                 00002
                                                                                         John
                                                                                         Paul
                                                                                                 415-111-1234
                                                                                                 408-432-9922

        RegionServers                                                            00003
                                                                                 00004
                                                                                         Ron
                                                                                         Bob
                                                                                                 415-993-2124
                                                                                                 818-243-9988




      • RegionServers	
  are	
  Java	
  
                                                        Host1

                                                                          T1R2



        processes,	
  typically	
  
                                                                                 00005   Carly   206-221-9123
                                                                                 00006   Scott   818-231-2566
                                             DataNode     RegionServer           00007   Simon   425-112-9877
                                                                                 00008   Lucas   415-992-4432


        collocated	
  with	
  the	
                                       T1R3
                                                                                 00009
                                                                                 00010
                                                                                         Steve
                                                                                         Kelly
                                                                                                 530-288-9832
                                                                                                 916-992-1234

        DataNodes                                       Host2
                                                                                 00011
                                                                                 00012
                                                                                         Betty
                                                                                         Anne
                                                                                                 650-241-1192
                                                                                                 206-294-1298




                                             DataNode      RegionServer




                                                        Host3




 49

Monday, April 1, 13
Finding	
  your	
  region
      •   Two	
  special	
  tables:	
  -­‐ROOT-­‐	
  and	
  .META.
                                                                                      -ROOT- table, consisting of only 1 region
                                                  -ROOT-
                                                                                          hosted on region server RS1
                                                           RS1




                                                                                       .META. table, consisting of 3 regions,
                                      M1           M2             M3
                                                                                          hosted on RS1, RS2 and RS3
                                       RS2              RS3   RS1




                                                                                        User tables T1 and T2, consisting of
                        T1R1   T1R2        T2R1     T1R3         T2R2   T2R3   T2R4          4 and 3 regions respectively,
                                                                                      distributed amongst RS1, RS2 and RS3
                        RS3    RS1         RS2      RS3          RS2    RS1    RS1




 50

Monday, April 1, 13
Regions	
  distributed	
  across	
  the	
  cluster
                                                    T1-R1
                                                            00001      John            415-111-1234
                      DataNode     RegionServer             00002      Paul            408-432-9922
                                                            00003      Ron             415-993-2124
                                                                                                        RegionServer 1 (RS1), hosting regions R1 of
                                                            00004      Bob             818-243-9988
                                                                                                            the user table T1 and M2 of .META.
                                                  .META. - M2

                                                            T1:00009-T1:00012     R3        RS2
                                 Host1

                                                    T1-R2
                                                            00005      Carly           206-221-9123
                                                            00006      Scott           818-231-2566
                                                            00007      Simon           425-112-9877
                      DataNode     RegionServer             00008      Lucas           415-992-4432
                                                    T1-R3
                                                                                                         RegionServer 2 (RS2), hosting regions R2,
                                                            00009      Steve           530-288-9832        R3 of the user table and M1 of .META.
                                                            00010      Kelly           916-992-1234
                                                            00011      Betty           650-241-1192
                                                            00012      Anne            206-294-1298
                                 Host2
                                                  .META. - M1
                                                            T1:00001-T1:00004     R1        RS1
                                                            T1:00005-T1:00008     R2        RS2



                      DataNode     RegionServer     -ROOT-
                                                            M:T100001-M:T100009        M1         RS2
                                                                                                         RegionServer 3 (RS3), hosting just -ROOT-
                                                            M:T100010-M:T100012        M2         RS1



                                 Host3




 51

Monday, April 1, 13
What	
  the	
  client	
  does
                                               MyTable


                                      .META.
                                               TheRealMT


                        -ROOT-


      ZooKeeper




 52

Monday, April 1, 13
What	
  the	
  client	
  does
                                               MyTable


                                      .META.
                                               TheRealMT


                        -ROOT-


      ZooKeeper




 52

Monday, April 1, 13
What	
  the	
  client	
  does
                                                     MyTable


                                            .META.
                                                     TheRealMT


                           -ROOT-


      ZooKeeper


                      Row per META region




 52

Monday, April 1, 13
What	
  the	
  client	
  does
                                                                   MyTable


                                            .META.
                                                                   TheRealMT


                           -ROOT-


      ZooKeeper


                      Row per META region




                                            Row per table region




 52

Monday, April 1, 13
What	
  the	
  client	
  does
                                                                   MyTable


                                            .META.
                                                                   TheRealMT


                           -ROOT-


      ZooKeeper


                      Row per META region




                                            Row per table region




 52

Monday, April 1, 13
HBase	
  and	
  Hadoop
         ...	
  they	
  play	
  well	
  together




 53

Monday, April 1, 13
HBase	
  and	
  Hadoop
      • HBase	
  has	
  7ght	
  integra7on	
  with	
  Hadoop
      • Two	
  separate	
  concerns
          • Access	
  via	
  MapReduce
          • Storage	
  on	
  HDFS




 54

Monday, April 1, 13
Parallel	
  processing




 55

Monday, April 1, 13
MapReduce




 56

Monday, April 1, 13
HBase	
  and	
  MapReduce
      • HBase	
  is	
  7ghtly	
  integrated	
  with	
  the	
  Hadoop	
  stack
          • MapReduce
          • HDFS
      • Source	
  for	
  MapReduce	
  jobs
           •   Read	
  from	
  an	
  HBase	
  Table	
  with	
  a	
  Map-­‐Reduce	
  Job	
  (Export,	
  Row	
  Count,	
  ...)
      •   Sink	
  for	
  MapReduce	
  jobs
           •   Wri7ng	
  to	
  an	
  HBase	
  Table	
  from	
  a	
  Map-­‐Reduce	
  Job
      •   Lookups	
  during	
  MapReduce	
  jobs
           •   Join	
  between	
  hbase	
  and/or	
  another	
  source

 57

Monday, April 1, 13
HBase	
  as	
  MR	
  source
      •   One	
  mapper	
  per	
  region

                                                          Map Tasks - one per region.
                                                          The tasks take the key range
                                                         of the region as their input split
                                                                 and scan over it
                                                         Regions being


                       Region     Region
                                           ….   Region
                                                         served by the
                                                         Region Server

                       Server     Server        Server




 58

Monday, April 1, 13
HBase	
  as	
  MR	
  source	
  (con7nued)
      •   Mapper
          protected void map(ImmutableBytesWritable	
  rowkey,	
  Result	
  result,
                Context context) {
            // mapper logic goes here
          }


      •   Job	
  setup
          Scan scan = new Scan();
          scan.addColumn(Bytes.toBytes("twits"),
            Bytes.toBytes("twit"));
          TableMapReduceUtil.initTableMapperJob("twits", scan,
            Map.class, ImmutableBytesWritable.class, Result.class,
            job);




 59

Monday, April 1, 13
HBase	
  as	
  MR	
  sink
      •   Write	
  to	
  regions	
  via	
  MR	
  job

                                                                       Reduce tasks writing to HBase
                                                                 regions. Reduce tasks don't necessarily
                                                                write to a region on the same physical host.
                                                                They will write to whichever region contains
                                                                  the key range that they are writing into.
                                                                This could potentially mean that all reduce
                                                                    tasks talk to all regions on the cluster




                      Region        Region
                                               ….      Region
                                                                 Regions being
                                                                 served by the
                      Server        Server             Server    Region Server




 60

Monday, April 1, 13
HBase	
  as	
  MR	
  sink	
  (con7nued)
      •   Reducer
          protected void reduce(ImmutableBytesWritable	
  rowkey,
            Iterable<Put> values, Context context) {
            // reduce logic goes here
          }




      •   Job	
  setup
          TableMapReduceUtil.initTableReducerJob("users",
            Reducer.class, job);




 61

Monday, April 1, 13
HBase	
  as	
  a	
  shared	
  resource
      •   Lookups	
  from	
  MR	
  jobs
      •   Map-­‐side	
  joins
      •   Simple	
  Get/Scan	
  calls
                                                      HBase
                                                                              Regions being
                                                                              served by the
                                                                              Region Server
                                 Region     Region                  Region
                                 Server     Server                  Server


                                                                              Map Tasks reading
                                                                              files from HDFS as
                                                                                  their source
                                                                              and doing random
                                                                              gets or short scans


                                Datanode   Datanode
                                                              ….   Datanode
                                                                                  from HBase




 62

Monday, April 1, 13
Integra7on	
  with	
  HDFS
      • HDFS	
  is	
  a	
  filesystem
      • Underlying	
  storage	
  fabric	
  for	
  HBase
      • HBase	
  7ghtly	
  integrated	
  with	
  HDFS	
  and	
  leverages	
  the	
  seman7cs
          • LSM	
  trees	
  fit	
  well	
  into	
  a	
  write	
  once	
  filesystem
      • Availability	
  (fault	
  tolerance)	
  and	
  reliability	
  (durability)	
  at	
  scale




 63

Monday, April 1, 13
Availability	
  and	
  fault	
  tolerance
                                                 Disaster strikes on
                                             the host which was serving
                                                      region R1




                             Physical host                 Physical host            Physical host                   Physical host

                            Region Server                  Region Server            Region Server                   Region Server
                           Serving region R1




                                HFile1                         HFile1                                                   HFile1



                              Datanode                       Datanode                 Datanode                        Datanode


                                                                           HDFS




                                                                                               Another server picks up the
                                                                                               region and starts serving it
                                                                                                 from the persisted HFile
                                                                                              from HDFS. The lost copy of
                                                                                               the HFile will get replicated
                                                                                                   to another Datanode

                             Physical host                 Physical host            Physical host                   Physical host

                            Region Server                  Region Server            Region Server                   Region Server
                                                                                  Serving region R1




                                HFile1                         HFile1                                                   HFile1



                              Datanode                       Datanode                 Datanode                        Datanode


                                                                           HDFS




 64

Monday, April 1, 13
Durability
      • WAL	
  keeps	
  writes	
  intact	
  even	
  if	
  RS	
  fails
      • HDFS	
  is	
  reliable	
  and	
  doesn’t	
  lose	
  data	
  if	
  individual	
  nodes	
  fail




 65

Monday, April 1, 13
Summary
      • Big	
  data	
  =	
  new	
  products,	
  applica7ons
      • Hadoop	
  ecosystem	
  -­‐	
  poster	
  child
      • HBase	
  serves	
  many	
  use	
  cases
      • Allows	
  to	
  solve	
  aspects	
  of	
  big	
  data	
  problems
      • Simple	
  Java	
  API
      • Tight	
  Hadoop	
  integra7on
      • Low	
  latency	
  access	
  to	
  large	
  amounts	
  of	
  data
      • Large	
  scale	
  computa7on	
  using	
  MR

 66

Monday, April 1, 13
67

Monday, April 1, 13
Schema	
  design
         ...	
  it’s	
  a	
  database	
  aper-­‐all




 68

Monday, April 1, 13
But	
  isn’t	
  HBase	
  schema-­‐less?
      • Number	
  of	
  tables
      • Rowkey	
  design	
  
      • Number	
  of	
  column	
  families	
  per	
  table.	
  What	
  goes	
  into	
  what	
  
        column	
  family
      • Column	
  qualifier	
  names
      • What	
  goes	
  into	
  the	
  cells
      • Number	
  of	
  versions



 69

Monday, April 1, 13
Rowkeys
      • Rowkey	
  design	
  is	
  the	
  single	
  most	
  important	
  aspect	
  of	
  HBase	
  
        table	
  designs
      • The	
  only	
  way	
  to	
  address	
  rows	
  in	
  HBase




 70

Monday, April 1, 13
TwitBase	
  rela7onships
      •   Users	
  follow	
  users
      •   Rela2onships	
  need	
  to	
  be	
  persisted	
  for	
  usage	
  later	
  on
      •   Model	
  tables	
  for	
  the	
  expected	
  access	
  paSerns
      •   Read	
  paSern
             • Who	
  does	
  A	
  follow?
             • Who	
  follows	
  A?
             • Does	
  A	
  follow	
  B?
      •   Write	
  paSern
             • A	
  follows	
  B
             • A	
  unfollows	
  B

 71

Monday, April 1, 13
Start	
  simple
      •   Adjacency	
  list
                                                              Column Family : follows
                                  row key:
                                   userid              column qualifier: followed user number




                                                             cell value: followed userid



                                          Cell value
                       Col Qualifier
                                                                            follows
                              TheFakeMT      1:TheRealMT      2:MTFanBoy              3:Olivia   4:HRogers
                              TheRealMT       1:HRogers          2:Olivia




 72

Monday, April 1, 13
Op7mizing	
  the	
  adjacency	
  list
      •   We	
  need	
  a	
  count
           • Where	
  does	
  a	
  new	
  followed	
  user	
  go?


                                                               follows
                        TheFakeMT   1:TheRealMT   2:MTFanBoy   3:Olivia   4:HRogers   count:4
                        TheRealMT    1:HRogers      2:Olivia   count:2




 73

Monday, April 1, 13
Adding	
  a	
  new	
  user
                      Row that needs to be updated


                                                                                        follows
                                    TheFakeMT        1:TheRealMT   2:MTFanBoy           3:Olivia        4:HRogers         count:4
                                    TheRealMT         1:HRogers        2:Olivia         count:2


                                                                                                                    1


                                                                                         TheFakeMT : follows: {count -> 4}

                                                                                                        2   increment count

                                 Client code:                                            TheFakeMT : follows: {count -> 5}
                                 Step 1: Get current count
                                 Step 2: Update count
                                 Step 3: Add new entry                                                  3    add new entry
                                 Step 4: Write the new data to HBase
                                                                                  TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5}

                                                                                                                                                4


                                                                                                  follows
                                  TheFakeMT     1:TheRealMT        2:MTFanBoy          3:Olivia        4:HRogers        5:MTFanBoy2   count:5
                                   TheRealMT         1:HRogers       2:Olivia          count:2



 74

Monday, April 1, 13
Transac7ons	
  ==	
  not	
  good
      • HBase	
  doesn’t	
  have	
  na7ve	
  support	
  (think	
  scale)
      • Don’t	
  want	
  to	
  complicate	
  client	
  side	
  logic
      • Only	
  solu7on	
  -­‐>	
  simplify	
  schema

                                                                 follows
                         TheFakeMT    TheRealMT:1   MTFanBoy:1             Olivia:1   HRogers:1
                          TheRealMT    HRogers:1      Olivia:1




 75

Monday, April 1, 13
Revisit	
  the	
  ques7ons
      • Read	
  paGern
          • Who	
  all	
  does	
  A	
  follow?
          • Who	
  all	
  follows	
  A?
          • Does	
  A	
  follow	
  B?
      • Write	
  paGern
          • A	
  follows	
  B
          • A	
  unfollows	
  B



 76

Monday, April 1, 13
Revisit	
  the	
  ques7ons




 76

Monday, April 1, 13
Denormaliza7on
      • Second	
  table	
  for	
  reverse	
  rela7onship
      • Otherwise	
  scan	
  across	
  en7re	
  table	
  and	
  affect	
  read	
  performance


                                                        Normalization         Dreamland




                                    Write performance

                                                        Poor design         Denormalization




                                                                 Read performance




 77

Monday, April 1, 13
More	
  op7miza7ons
      • Convert	
  into	
  tall-­‐narrow	
  table
      • Leverage	
  rowkey	
  indexing	
  beGer
      • Gets	
  -­‐>	
  short	
  Scans
                                             Keeping the column family and column qualifier
                                            names short reduces the data transferred over the
                                             network back to the client. The KeyValue objects
                                                            become smaller.


                                                               CF : f
                                                                                    The + in the row key refers to concatenating
                            row key:                CQ: followed user's name           the two values. You could delimitate
                      follower + followed                                                   using any character you like.
                                                                                                   eg: A-B or A,B

                                                                         cell value: 1




 78

Monday, April 1, 13
Tall-­‐narrow	
  table	
  example
      •   Denormaliza7on	
  is	
  the	
  way	
  to	
  go


                                                            f           Putting the user name in the column
                                                                         qualifier saves you from looking up
                             TheFakeMT+TheRealMT      Mark Twain:1       the users table for the name of the
                                                                           user given an id. You can simply
                             TheFakeMT+MTFanBoy    Amandeep Khurana:1
                                                                          list out names or ids while looking
                               TheFakeMT+Olivia      Olivia Clemens:1    at relationships just from this table.
                                                                        The downside of this is that you need
                              TheFakeMT+HRogers      Henry Rogers:1
                                                                          to update the name in all the cells
                               TheRealMT+Olivia      Olivia Clemens:1      if the user updates their name in
                                                                                       their profile.
                              TheRealMT+HRogers      Henry Rogers:1         This is classic Denormalization.




 79

Monday, April 1, 13
Uniform	
  rowkey	
  length
      • MD5	
  the	
  userids	
  -­‐>	
  16	
  bytes	
  +	
  16	
  bytes	
  rowkeys
      • BeGer	
  distribu7on	
  of	
  load



                                                            CF : f
                                                                                   Using MD5 of the user ids gives you
                                 row key:            CQ: followed userid             fixed lengths instead of variable
                        md5(follower)md5(followed)                                   length user ids. You don't need
                                                                                      concatenation logic anymore.


                                                            cell value: followed users name




 80

Monday, April 1, 13
Uniform	
  rowkey	
  length	
  (con7nued)


                                                                    f
                      MD5(TheFakeMT) MD5(TheRealMT)      TheRealMT:Mark Twain
                      MD5(TheFakeMT) MD5(MTFanBoy)    MTFanBoy:Amandeep Khurana
                        MD5(TheFakeMT) MD5(Olivia)        Olivia:Olivia Clemens
                       MD5(TheFakeMT) MD5(HRogers)       HRogers:Henry Rogers
                        MD5(TheRealMT) MD5(Olivia)        Olivia:Olivia Clemens
                       MD5(TheRealMT) MD5(HRogers)       HRogers:Henry Rogers




 81

Monday, April 1, 13
Tall	
  v/s	
  Wide	
  tables	
  storage	
  footprint
                               Logical representation of an HBase table.                                       Actual physical storage of the table
                      We'll look at what it means to Get() row r5 from this table.

                                        CF1                         CF2                                      HFile for CF1             HFile for CF2
                       r1      c1:v1                      c1:v9    c6:v2
                                                                                                            r1:CF1:c1:t1:v1
                       r2      c1:v2             c3:v6                                                      r2:CF1:c1:t2:v2
                                                                                                                                      r1:CF2:c1:t1:v9
                                                                                                            r2:CF1:c3:t3:v6
                                                                                                                                      r1:CF2:c6:t4:v2
                       r3               c2:v3             c5:v6                                             r3:CF1:c2:t1:v3
                                                                                                                                      r3:CF2:c5:t4:v6
                                                                                                            r4:CF1:c2:t1:v4
                                                                                                                                      r5:CF2:c7:t3:v8
                       r4               c2:v4                                                               r5:CF1:c1:t2:v1
                                                                                                            r5:CF1:c3:t3:v5
                       r5      c1:v1             c3:v5                      c7:v8




                                                              Result object returned for a Get() on row r5
                                                                               r5:CF1:c1:t2:v1
                                                                               r5:CF1:c3:t3:v5              KeyValue objects
                                                                               r5:cf2:c7:t3:v8




                                                                                     Key                    Value
                                                                    Row       Col          Col     Time      Cell
                                                                    Key       Fam          Qual   Stamp     Value

                                                                           Structure of a KeyValue object

 82

Monday, April 1, 13
Rowkey	
  design
      • Single	
  most	
  important	
  aspect	
  of	
  designing	
  tables
      • Depends	
  on	
  expected	
  access	
  paGerns
      • HFiles	
  are	
  sorted	
  on	
  Key	
  part	
  of	
  KeyValue	
  objects


                           "TheRealMT" ,   "info" ,   "email" , 1329088321289, "samuel@clemens.org"
                           "TheRealMT" ,   "info" ,   "name" , 1329088321289 , "Mark Twain"
                           "TheRealMT" ,   "info" ,   "password" , 1329088818321 , "abc123",
                           "TheRealMT" ,   "info" ,   "password" , 1329088321289 , "Langhorne"


                                           HFile for the info column family in the users table




 83

Monday, April 1, 13
Write	
  op7mized
      • Distribute	
  writes	
  across	
  the	
  cluster
          • Issue	
  most	
  pronounced	
  with	
  7me	
  series	
  data
      • Hashing
          hash("TheRealMT") -> random byte[]

      •   Sal7ng
          int salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region
          servers>;
          byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") +
          Bytes.toBytes(timestamp));




 84

Monday, April 1, 13
Read	
  op7mized
      •   Data	
  to	
  be	
  accessed	
  together	
  should	
  be	
  stored	
  together
            • eg:	
  twit	
  streams	
  -­‐	
  last	
  10	
  twits	
  by	
  the	
  users	
  I	
  follow

                                       Olivia1                         1Olivia
                                       Olivia2                         1TheRealMT
                                       Olivia5                         2Olivia
                                       Olivia7                         2TheFakeMT
                                       Olivia9                         2TheRealMT
                                       TheFakeMT2                      3TheFakeMT
                                       TheFakeMT3                      4TheFakeMT
                                       TheFakeMT4                      5Olivia
                                       TheFakeMT5                      5TheFakeMT
                                       TheFakeMT6                      5TheRealMT
                                       TheRealMT1                      6TheFakeMT
                                       TheRealMT2                      7Olivia
                                       TheRealMT5                      8TheRealMT
                                       TheRealMT8                      9Olivia



 85

Monday, April 1, 13
Rela7onal	
  to	
  Non	
  rela7onal
      • Rela7onal	
  concepts
          • En77es
          • AGributes
          • Rela7onships
      • En77es
          • Table	
  is	
  a	
  table.	
  Not	
  much	
  going	
  on	
  there
          • Users	
  table	
  contains...	
  users.	
  Those	
  are	
  en77es
                  •   Good	
  place	
  to	
  start.	
  Denormaliza7on	
  bends	
  that	
  rule	
  a	
  bit

 86

Monday, April 1, 13
Rela7onal	
  to	
  Non	
  rela7onal	
  (con7nued)
      •   AGributes
            • Iden7fying
                  • Primary	
  keys.	
  Compound	
  keys
                  • Maps	
  to	
  rowkeys

            •   Non-­‐iden7fying
                  • Other	
  columns
                  • Maps	
  to	
  column	
  qualifiers	
  and	
  cells

      •   Rela7onships

 87

Monday, April 1, 13
Nested	
  En77es
      •   Column	
  Qualifiers	
  can	
  contain	
  data	
  instead	
  of	
  just	
  a	
  column	
  
          name

                                                hbase table
                                      row key

                                       column family
                                        fixed qualifier → timestamp → value
                                                                               Nested entities
                                                  repeating entity

                                       variable qualifier → timestamp → value




 88

Monday, April 1, 13
Schema	
  design	
  summary
      •   Schema	
  can	
  make	
  or	
  break	
  the	
  performance	
  you	
  get
      •   Rowkey	
  is	
  the	
  single	
  most	
  important	
  thing
             • Use	
  tricks	
  like	
  hashing	
  and	
  sal2ng
      •   Denormalize	
  to	
  your	
  advantage
             • There	
  are	
  no	
  joins
      •   Isolate	
  access	
  paSerns
             • Separate	
  CFs	
  or	
  even	
  separate	
  tables
      •   Shorter	
  names	
  -­‐>	
  lower	
  storage	
  footprint
      •   Column	
  qualifiers	
  can	
  be	
  used	
  to	
  store	
  data	
  and	
  not	
  just	
  column	
  names
             • Big	
  difference	
  as	
  compared	
  to	
  RDBMS

 89

Monday, April 1, 13
Deployments	
  and	
  Opera2ons
         ...	
  for	
  some	
  real	
  stuff,	
  beyond	
  your	
  laptop




 90

Monday, April 1, 13
Components
      • HDFS	
  (DataNodes)
          • Storage
      • ZooKeeper
          • Membership	
  management
      • RegionServers
          • Serve	
  the	
  regions
      • HBase	
  Masters
          • Janitorial	
  work

 91

Monday, April 1, 13
Hardware
      •   DataNodes	
  and	
  RegionServers	
  collocated
            • Plenty	
  of	
  RAM	
  and	
  Disk
            • 8-­‐12	
  cores
            • 48	
  GB	
  RAM
            • 12	
  x	
  1	
  TB	
  disks




 92

Monday, April 1, 13
Hardware	
  (con7nued)
      •   HBase	
  Master	
  and	
  ZooKeeper	
  collocated
            • Don’t	
  necessarily	
  need	
  to	
  be	
  though
            • Keep	
  odd	
  number	
  of	
  ZKs
                    •   ~3	
  is	
  good	
  enough	
  for	
  upto	
  50-­‐70	
  node	
  cluster	
  typically
             •   Give	
  ZK	
  independent	
  spindle	
  for	
  log	
  wri7ng
                    •   I/O	
  latency	
  sensi7ve
             • 4-­‐6	
  cores
             • 24	
  GB	
  RAM

 93

Monday, April 1, 13
Types	
  of	
  deployments
      •   Standalone	
  for	
  dev	
  purpose
      •   Prototype
              • <	
  10	
  nodes
              • No	
  real	
  SLAs
              • 1	
  ZK	
  +	
  Master,	
  rest	
  are	
  RS
      •   Small	
  produc2on	
  cluster
              • 10-­‐20	
  nodes
              • 1	
  ZK	
  +	
  Master,	
  maybe	
  NNHA,	
  rest	
  are	
  RS
      •   Medium	
  produc2on	
  cluster
              • 20-­‐50	
  nodes
              • 3	
  ZK	
  +	
  Master,	
  NNHA,	
  rest	
  are	
  RS
      •   Large	
  produc2on	
  cluster
              • 50+	
  nodes
              • 3	
  or	
  5	
  ZK	
  +	
  Master,	
  NNHA,	
  rest	
  are	
  RS
      •   Hardware	
  choices	
  remain	
  the	
  same	
  mostly


 94

Monday, April 1, 13
Deploying	
  sopware	
  and	
  configura7on
      • Automated	
  sopware	
  bits	
  deployment	
  highly	
  recommended
          • Chef
          • Puppet
          • Cloudera	
  Manager
      • Centralized	
  configura7on	
  management	
  highly	
  recommended
          • Consistency
          • Ease	
  of	
  management



 95

Monday, April 1, 13
Configura7ons
      •   $HBASE_HOME/conf
            • hbase-­‐env.sh
                   •   Environment	
  configura2ons
            •   hbase-­‐site.xml
                   •   HBase	
  system	
  configura2ons
      • Linux	
  configura2ons
           • Number	
  of	
  open	
  files
           • swappiness
      • HDFS	
  configura2ons
           • $HADOOP_HOME/hdfs-­‐site.xml

 96

Monday, April 1, 13
Monitoring
      • Metrics	
  context	
  from	
  Hadoop
          • Ganglia
          • File	
  context
      • JMX
          • Cac7
      • Several	
  metrics	
  available
          • Write	
  path	
  specific
          • Read	
  path	
  specific

 97

Monday, April 1, 13
Monitoring	
  (con7nued)




 98

Monday, April 1, 13
Performance	
  tuning
      • Performance	
  tes2ng
           • Use	
  real	
  (or	
  synthe2c)	
  workloads
           • YCSB
      • Tuning
           • Write	
  op2mized
                     •   Memstore
             •   Random-­‐read	
  op2mized
                     • Block	
  cache
                     • Smaller	
  block	
  size

             •   Sequen2al-­‐read	
  op2mized
                     • Block	
  cache
                     • Higher	
  block	
  size

             •   GC	
  tuning

 99

Monday, April 1, 13
100

Monday, April 1, 13

Weitere ähnliche Inhalte

Andere mochten auch

TL P5 & P6 parent's briefing 2011
TL P5 & P6 parent's briefing 2011TL P5 & P6 parent's briefing 2011
TL P5 & P6 parent's briefing 2011alanpillay79
 
Polyvore blogger workshop for fashion and beauty bloggers
Polyvore blogger workshop for fashion and beauty bloggersPolyvore blogger workshop for fashion and beauty bloggers
Polyvore blogger workshop for fashion and beauty bloggersPolyvore
 
20131125 buyer behavior iba mba48 d
20131125 buyer behavior iba mba48 d20131125 buyer behavior iba mba48 d
20131125 buyer behavior iba mba48 dZeeshan Huq
 
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportCFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportViedoc
 
DSN Newsletter August 2010 Vol 1 Issue 1
DSN Newsletter August 2010 Vol 1 Issue 1DSN Newsletter August 2010 Vol 1 Issue 1
DSN Newsletter August 2010 Vol 1 Issue 1Phil Lane Jr.
 
Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Viedoc
 
[Trabalho Acadêmico]Relançamento de Forrest Gump
[Trabalho Acadêmico]Relançamento de Forrest Gump[Trabalho Acadêmico]Relançamento de Forrest Gump
[Trabalho Acadêmico]Relançamento de Forrest GumpMarcello Caetano
 
Miniolimpiada4°2011 dic
Miniolimpiada4°2011 dicMiniolimpiada4°2011 dic
Miniolimpiada4°2011 dicCEASIMON
 
20140408 brand management chapter 5 iba mba48 e
20140408 brand management chapter 5 iba mba48 e20140408 brand management chapter 5 iba mba48 e
20140408 brand management chapter 5 iba mba48 eZeeshan Huq
 
Eu e o nei meu amor
Eu e o nei meu amorEu e o nei meu amor
Eu e o nei meu amorwalmatos0
 
P5 &_p6_cl_powerpoint_slides_2011
P5  &_p6_cl_powerpoint_slides_2011P5  &_p6_cl_powerpoint_slides_2011
P5 &_p6_cl_powerpoint_slides_2011alanpillay79
 
CL introduction of p5_&_p6
CL introduction of p5_&_p6CL introduction of p5_&_p6
CL introduction of p5_&_p6alanpillay79
 
20140301 brand management chapter 1 iba mba48
20140301 brand management chapter 1 iba mba4820140301 brand management chapter 1 iba mba48
20140301 brand management chapter 1 iba mba48Zeeshan Huq
 
P1 & p2_rubric_for_assessment_2011
P1 & p2_rubric_for_assessment_2011P1 & p2_rubric_for_assessment_2011
P1 & p2_rubric_for_assessment_2011alanpillay79
 
Miniolimpiada de 3°abril
Miniolimpiada de 3°abrilMiniolimpiada de 3°abril
Miniolimpiada de 3°abrilCEASIMON
 

Andere mochten auch (17)

TL P5 & P6 parent's briefing 2011
TL P5 & P6 parent's briefing 2011TL P5 & P6 parent's briefing 2011
TL P5 & P6 parent's briefing 2011
 
Polyvore blogger workshop for fashion and beauty bloggers
Polyvore blogger workshop for fashion and beauty bloggersPolyvore blogger workshop for fashion and beauty bloggers
Polyvore blogger workshop for fashion and beauty bloggers
 
20131125 buyer behavior iba mba48 d
20131125 buyer behavior iba mba48 d20131125 buyer behavior iba mba48 d
20131125 buyer behavior iba mba48 d
 
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence ReportCFIA 2012 Food Industry ingredients Competitive Intelligence Report
CFIA 2012 Food Industry ingredients Competitive Intelligence Report
 
Mt2
Mt2Mt2
Mt2
 
DSN Newsletter August 2010 Vol 1 Issue 1
DSN Newsletter August 2010 Vol 1 Issue 1DSN Newsletter August 2010 Vol 1 Issue 1
DSN Newsletter August 2010 Vol 1 Issue 1
 
Adjetivo 2.2
Adjetivo  2.2Adjetivo  2.2
Adjetivo 2.2
 
Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010Rapport de veille_salon_texworld_paris_2010
Rapport de veille_salon_texworld_paris_2010
 
[Trabalho Acadêmico]Relançamento de Forrest Gump
[Trabalho Acadêmico]Relançamento de Forrest Gump[Trabalho Acadêmico]Relançamento de Forrest Gump
[Trabalho Acadêmico]Relançamento de Forrest Gump
 
Miniolimpiada4°2011 dic
Miniolimpiada4°2011 dicMiniolimpiada4°2011 dic
Miniolimpiada4°2011 dic
 
20140408 brand management chapter 5 iba mba48 e
20140408 brand management chapter 5 iba mba48 e20140408 brand management chapter 5 iba mba48 e
20140408 brand management chapter 5 iba mba48 e
 
Eu e o nei meu amor
Eu e o nei meu amorEu e o nei meu amor
Eu e o nei meu amor
 
P5 &_p6_cl_powerpoint_slides_2011
P5  &_p6_cl_powerpoint_slides_2011P5  &_p6_cl_powerpoint_slides_2011
P5 &_p6_cl_powerpoint_slides_2011
 
CL introduction of p5_&_p6
CL introduction of p5_&_p6CL introduction of p5_&_p6
CL introduction of p5_&_p6
 
20140301 brand management chapter 1 iba mba48
20140301 brand management chapter 1 iba mba4820140301 brand management chapter 1 iba mba48
20140301 brand management chapter 1 iba mba48
 
P1 & p2_rubric_for_assessment_2011
P1 & p2_rubric_for_assessment_2011P1 & p2_rubric_for_assessment_2011
P1 & p2_rubric_for_assessment_2011
 
Miniolimpiada de 3°abril
Miniolimpiada de 3°abrilMiniolimpiada de 3°abril
Miniolimpiada de 3°abril
 

Ähnlich wie Building apps with HBase - Data Days Texas March 2013

Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
An Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveAn Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveJoe Andelija
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendrapasalapudi123
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
DBA to Data Scientist
DBA to Data ScientistDBA to Data Scientist
DBA to Data Scientistpasalapudi
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 

Ähnlich wie Building apps with HBase - Data Days Texas March 2013 (20)

Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
An Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache HiveAn Overview Of Apache Pig And Apache Hive
An Overview Of Apache Pig And Apache Hive
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendra
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
DBA to Data Scientist
DBA to Data ScientistDBA to Data Scientist
DBA to Data Scientist
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
hive.pptx
hive.pptxhive.pptx
hive.pptx
 
Big data
Big dataBig data
Big data
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 

Building apps with HBase - Data Days Texas March 2013

  • 1. Building  applica2ons  with  HBase Headline  Goes  Here Amandeep  Khurana  |  Solu7ons  Architect Speaker  Name  or  Subhead  Goes  Here Data  Days  Texas,  Aus2n,  March  2013 1 Monday, April 1, 13
  • 2. About  me • Solu7ons  Architect,  Cloudera  Inc • Amazon  Web  Services • Interested  in  large  scale  distributed  systems • Co-­‐author,  HBase  In  Ac7on • TwiGer:  amansk 2 Monday, April 1, 13
  • 3. I’m  a  wanna-­‐be  (author) www.hbaseinac7on.com Nick Dimiduk Amandeep Khurana MANNING 3 Monday, April 1, 13
  • 4. About  the  workshop • Install  HBase • Basic  interac7ons  with  the  shell • Use  of  HBase  Java  API • Data  model • HBase  and  Hadoop  integra7on 4 Monday, April 1, 13
  • 5. Get  HBase  and  tutorial  instruc7ons • Get  it  here  -­‐>  hGp://hbase.apache.org/ • Downloads  -­‐>  Choose  a  mirror  -­‐>  0.94.1 • Download  the  file  hbase-­‐0.94.1.tar.gz • Get  the  tutorial  handout  -­‐>   hGp://people.apache.org/~ak/ddtx13 5 Monday, April 1, 13
  • 6. What’s  Big  Data? ...  except  another  marke7ng  term 6 Monday, April 1, 13
  • 8. Data  has  changed END-USER APPLICATIONS THE INTERNET DATA GROWTH MOBILE DEVICES SOPHISTICATED MACHINES UNSTRUCTURED DATA – 90% STRUCTURED DATA – 10% 1980 2012 8 Monday, April 1, 13
  • 10. Hadoop  -­‐  Big  Data  poster  child HDFS, MAPREDUCE 10 Monday, April 1, 13
  • 11. Hadoop  -­‐  Big  Data  poster  child File System Mount UI Framework Machine Learning FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE, APACHE MAHOUT Fast Data Integration Read/Write Access APACHE FLUME, APACHE SQOOP APACHE HBASE HDFS, MAPREDUCE APACHE ZOOKEEPER Coordination 10 Monday, April 1, 13
  • 12. What  is  HBase? • Scalable,  distributed  data  store • Sorted  map  of  maps • Open  source  avatar  of  Google’s  Bigtable • Sparse • Mul7  dimensional • Tightly  integrated  with  Hadoop • Not  a  RDBMS 11 Monday, April 1, 13
  • 13. So,  what’s  the  big  deal? • Give  up  system  complexity  and  sophis7cated  features  for  ease   of  scale  and  manageability • Simple  system  !=  easy  to  build  applica7ons  on  top • Goal:  predictable  read  and  write  performance  at  scale • Effec7ve  API  usage • Op7mal  schema  design • Understand  internals  and  leverage  at  scale 12 Monday, April 1, 13
  • 14. Use  cases • Canonical  use  case:  storing  crawl  data  and  indices  for  search Web Search powered by Bigtable Crawlers Internets Crawlers Bigtable Crawlers Crawlers 1 2 MapReduce Indexing the Internet 1 Crawlers constantly scour the Internet for new pages. Those pages are stored as individual records in Bigtable. 4 3 2 A MapReduce job runs over the entire table, generating Search search indexes for the Web Search application. Search Index Search Web Search Index Index 5 You Searching the Internet 3 The user initiates a Web Search request. 4 The Web Search application queries the Search Indexes and retries matching documents directly from Bigtable. 5 Search results are presented to the user. 13 Monday, April 1, 13
  • 15. Use  cases  (con7nued) • Capturing  incremental  data • Content  serving • Informa7on  exchange 14 Monday, April 1, 13
  • 16. Ok,  enough  of  high  level  stuff • Big  data  is  a  buzzword  but  there  is  more  to  it • New  products,  applica7ons,  types  of  data • Hadoop  ecosystem  is  the  poster  child  of  big  data  architectures • Consists  of  several  components • HBase  is  a  part  of  the  ecosystem • Scalable  database • Lots  of  produc7on  deployments 15 Monday, April 1, 13
  • 17. HBase  -­‐  GeEng  Started Installa7on  and  basic  interac7on 16 Monday, April 1, 13
  • 18. Installing  HBase • Get  it  here  -­‐>  hGp://hbase.apache.org/ • Downloads  -­‐>  Choose  a  mirror  -­‐>  0.94.1 • Download  the  file  hbase-­‐0.94.1.tar.gz • Untar  it • mkdir  ~/ddtx • cp  hbase-­‐0.94.1.tar.gz  ~/ddtx • cd  ~/ddtx • tar  xvf  hbase-­‐0.94.1.tar.gz 17 Monday, April 1, 13
  • 19. Basics • Modes  of  opera2on • Standalone • Pseudo-­‐distributed • Fully  distributed • Start  up  in  standalone  mode • cd  hbase-­‐0.94.1 • bin/start-­‐hbase.sh • Logs  are  in  ./logs/ • Check  out  hSp://localhost:60010  in  your  browser 18 Monday, April 1, 13
  • 20. Shell • Basic  interac7on  tool • WriGen  in  JRuby • Uses  the  Java  client  library • Good  place  to  start  poking  around 19 Monday, April 1, 13
  • 21. Shell  (con7nued) • bin/hbase  shell • Create  table • create  ‘users’  ,  ‘info’ • List  tables • list • Describe  table • describe  ‘users’ 20 Monday, April 1, 13
  • 22. Shell  (con7nued) • Put  a  row •put  ‘users’  ,  ‘u1’,  ‘info:name’  ,  ‘Mr.  Rogers’ • Get  a  row • get  ‘users’  ,  ‘u1’ • Put  more • put  ‘users’  ,  ‘u1’  ,  ‘info:email’  ,  ‘mail@rogers.com’ • put  ‘users’  ,  ‘u1’  ,  ‘info:password’  ,  ‘foobar’ • Get  a  row •get  ‘users’  ,  ‘u1’ • Scan  table • scan  ‘users’ 21 Monday, April 1, 13
  • 23. Java  API ...  can’t  really  build  an  applica7on  just  with  the  shell 22 Monday, April 1, 13
  • 24. Important  terms • Table • Consists  of  rows  and  columns • Row • Has  a  bunch  of  columns. • Iden2fied  by  a  rowkey  (primary  key) • Column  Qualifier • Dynamic  column  name • Column  Family • Column  groups  -­‐  logical  and  physical  (Similar  access  paSern) • Cell • The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec2on • Version • Every  cell  has  mul2ple  versions. 23 Monday, April 1, 13
  • 25. Introducing  TwitBase • TwiSer  clone  built  on  HBase • Designed  to  scale • It’s  an  example,  not  a  produc2on  app • Start  with  users  and  twits  for  now • Users • Have  profiles • Post  twits • Twits • Have  text  posted  by  users • Get  sample  code  at:  hSps://github.com/hbaseinac2on/twitbase 24 Monday, April 1, 13
  • 26. Connec7ng  to  HBase • Using default confs: HTableInterface usersTable = new HTable("users"); • Passing custom conf: Configuration myConf = HBaseConfiguration.create(); myConf.set("parameter_name", "parameter_value"); HTableInterface usersTable = new HTable(myConf, "users"); • Connection pooling HTablePool pool = new HTablePool(); HTableInterface usersTable = pool.getTable("users"); // work with the table usersTable.close(); 25 Monday, April 1, 13
  • 27. Inser7ng  data • Put  call  on  HTable Put p = new Put(Bytes.toBytes("TheRealMT")); p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Mark Twain")); p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("samuel@clemens.org")); p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("Langhorne")); usersTable.put(p); 26 Monday, April 1, 13
  • 28. Data  coordinates • Row  is  addressed  using  rowkey • Cell  is  addressed  using              [rowkey  +  family  +  qualifier] 27 Monday, April 1, 13
  • 29. Upda7ng  data • Update  ==  Write Put p = new Put(Bytes.toBytes("TheRealMT")); p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("abc123")); usersTable.put(p); 28 Monday, April 1, 13
  • 30. Write  path WAL Write goes to the write-ahead log (WAL) Memstore as well as the in memory write buffer called the Memstore. Client HFile Client's don't interact directly with the underlying HFiles during writes. HFile Memstore Memstore gets flushed to disk to New HFile form a new HFile when filled up with writes. HFile HFile 29 Monday, April 1, 13
  • 31. Reading  data • Get  the  en2re  row Get g = new Get(Bytes.toBytes("TheRealMT")); Result r = usersTable.get(g); • Get  selected  columns Get g = new Get(Bytes.toBytes("TheRealMT")); g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("password")); Result r = usersTable.get(g); • Extract  the  value  out  of  the  Result  object byte[] b = r.getValue(Bytes.toBytes("info"), Bytes.toBytes("email")); String email = Bytes.toString(b); 30 Monday, April 1, 13
  • 32. Reading  data  (con7nued) • Scan Scan s = new Scan(startRow, stopRow); ResultsScanner rs = twits.getScanner(s); for(Result r : rs) { // iterate over scanner results } • Single threaded iteration over defined set of rows. Could be the entire table too. 31 Monday, April 1, 13
  • 33. What  happens  when  you  read  data? BlockCache Memstore Client HFile HFile 32 Monday, April 1, 13
  • 34. Dele7ng  data • Another  simple  API  call • Delete  en7re  row Delete d = new Delete(Bytes.toBytes("TheRealMT")); usersTable.delete(d); • Delete  selected  columns Delete d = new Delete(Bytes.toBytes("TheRealMT")); d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("email")); usersTable.delete(d); 33 Monday, April 1, 13
  • 35. Delete  internals • Tombstone  markers • Dele7on  only  happens  during  major  compac7ons • Compac7ons • Minor • Major 34 Monday, April 1, 13
  • 36. Compac7ons Memstore Memstore HFile Two or more HFiles get combined Compacted to form a single HFile during HFile a compaction. HFile HFile HFile 35 Monday, April 1, 13
  • 37. Hands-­‐on:  Java  client  to  HBase • Create  a  table  from  the  HBase  shell • create  ‘users’  ,  ‘info’ • Write  a  program  to  do  the  following • Add  10  users  to  the  ‘users’  table. • userid  as  rowkey • user’s  name  in  the  column  ‘info:name’ • userid  could  just  be  user  +  serial  number.  eg:  user1,  user2,  user3 • Get  a  user  from  the  ‘users’  table  and  display  on  screen • Scan  the  en2re  ‘users’  table  and  display  on  screen 36 Monday, April 1, 13
  • 38. DAO • Like  any  other  applica7on • Makes  data  access  management  easy • Cleaner  code • Table  info  at  a  single  place • Op7miza7ons  like  connec7on  pooling  etc • Easier  management  of  database  configs 37 Monday, April 1, 13
  • 39. Admin HBaseAdmin Table  Opera7ons • Administra7ve  API  to  the  cluster. • createTable() • Provides  an  interface  to  manage   • deleteTable() HBase  database  tables  metadata   • addColumn() and  general  administra7ve   func7ons. • modifyColumn() • deleteColumn() • listTables() Monday, April 1, 13
  • 40. Data  Model It’s  not  a  rela7onal  database  system 39 Monday, April 1, 13
  • 41. HBase  is  ... • Column  family  oriented  database • Column  family  oriented • Tables  consis2ng  of  rows  and  columns • Persisted  Map • Sparse • Mul2  dimensional • Sorted • Indexed  by  rowkey,  column  and  2mestamp • Key  Value  store • [rowkey,  col  family,  col  qualifier,  2mestamp]  -­‐>  cell  value 40 Monday, April 1, 13
  • 42. HBase  is  not  ... • A  rela7onal  database • No  SQL  query  language • No  joins • No  secondary  indexing • No  transac7ons 41 Monday, April 1, 13
  • 43. Tabular  representa7on 2 Column Family - Info 1 3 Rowkey name email password GrandpaD Mark Twain samuel@clemens.org abc123 HMS_Surprise Patrick O'Brien aubrey@sea.com abc123 The table is lexicographically sorted on the rowkeys SirDoyle Fyodor Dostoyevsky fyodor@brothers.net abc123 TheRealMT Sir Arthur Conan Doyle art@TheQueensMen.co.uk Langhorne abc123 4 Cells ts1=1329088321289 ts2=1329088818321 Each cell has multiple The coordinates used to identify data in an HBase table are: versions, (1) rowkey, (2) column family, (3) column qualifier, (4) version typically represented by the timestamp of when they were inserted into the table (ts2>ts1) 42 Monday, April 1, 13
  • 44. Key-­‐Value  store Keys Values [TheRealMT, info, password, 1329088818321] abc123 [TheRealMT, info, password, 1329088321289] Langhorne A single KeyValue instance 43 Monday, April 1, 13
  • 45. Key-­‐Value  store [TheRealMT, info, password, 1329088818321] abc123 1 Start with coordinates of full precision { 1329088818321 : "abc123", [TheRealMT, info, password] 1329088321289 : "Langhorne" } 2 Drop version and you're left with a map of version to values Keys { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { [TheRealMT, info] 1329088321289 : "Mark Twain" }, Values "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } 3 Omit qualifier and you have a map of qualifiers to the previous maps { "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" [TheRealMT] }, "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } 4 Finally, drop the column family and you have a row, a map of maps 44 Monday, April 1, 13
  • 46. Sorted  map  of  maps Rowkey { "TheRealMT" : { Column family "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" Column qualifiers }, "password" : { Values 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } Versions } 45 Monday, April 1, 13
  • 47. HFiles  and  physical  data  model • HFiles  are • Immutable • Sorted  on  rowkey  +  qualifier  +  7mestamp • In  the  context  of  a  column  family  per  region "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 46 Monday, April 1, 13
  • 48. HBase  in  distributed  mode ...  what  really  goes  on  under  the  hood 47 Monday, April 1, 13
  • 49. Distributed  mode • Table  is  split  into  Regions T1R1 00001 John 415-111-1234 00002 Paul 408-432-9922 00001 John 415-111-1234 00003 Ron 415-993-2124 00001 John 415-111-1234 00002 Paul 408-432-9922 00004 Bob 818-243-9988 00002 Paul 408-432-9922 00003 Ron 415-993-2124 00003 Ron 415-993-2124 00004 Bob 818-243-9988 00004 Bob 818-243-9988 T1R2 00005 Carly 206-221-9123 00005 Carly 206-221-9123 00005 Carly 206-221-9123 00006 Scott 818-231-2566 00006 Scott 818-231-2566 00006 Scott 818-231-2566 00007 Simon 425-112-9877 00007 Simon 425-112-9877 00007 Simon 425-112-9877 00008 Lucas 415-992-4432 00008 Lucas 415-992-4432 00008 Lucas 415-992-4432 00009 Steve 530-288-9832 00010 Kelly 916-992-1234 00009 Steve 530-288-9832 T1R3 00011 Betty 650-241-1192 00010 Kelly 916-992-1234 00012 Anne 206-294-1298 00011 Betty 650-241-1192 00009 Steve 530-288-9832 00012 Anne 206-294-1298 00010 Kelly 916-992-1234 00011 Betty 650-241-1192 00012 Anne 206-294-1298 Full table T1 Table T1 split into 3 regions - R1, R2 and R3 48 Monday, April 1, 13
  • 50. Distributed  mode  (con7nued) • Regions  are  served  by   DataNode RegionServer T1R1 00001 00002 John Paul 415-111-1234 408-432-9922 RegionServers 00003 00004 Ron Bob 415-993-2124 818-243-9988 • RegionServers  are  Java   Host1 T1R2 processes,  typically   00005 Carly 206-221-9123 00006 Scott 818-231-2566 DataNode RegionServer 00007 Simon 425-112-9877 00008 Lucas 415-992-4432 collocated  with  the   T1R3 00009 00010 Steve Kelly 530-288-9832 916-992-1234 DataNodes Host2 00011 00012 Betty Anne 650-241-1192 206-294-1298 DataNode RegionServer Host3 49 Monday, April 1, 13
  • 51. Finding  your  region • Two  special  tables:  -­‐ROOT-­‐  and  .META. -ROOT- table, consisting of only 1 region -ROOT- hosted on region server RS1 RS1 .META. table, consisting of 3 regions, M1 M2 M3 hosted on RS1, RS2 and RS3 RS2 RS3 RS1 User tables T1 and T2, consisting of T1R1 T1R2 T2R1 T1R3 T2R2 T2R3 T2R4 4 and 3 regions respectively, distributed amongst RS1, RS2 and RS3 RS3 RS1 RS2 RS3 RS2 RS1 RS1 50 Monday, April 1, 13
  • 52. Regions  distributed  across  the  cluster T1-R1 00001 John 415-111-1234 DataNode RegionServer 00002 Paul 408-432-9922 00003 Ron 415-993-2124 RegionServer 1 (RS1), hosting regions R1 of 00004 Bob 818-243-9988 the user table T1 and M2 of .META. .META. - M2 T1:00009-T1:00012 R3 RS2 Host1 T1-R2 00005 Carly 206-221-9123 00006 Scott 818-231-2566 00007 Simon 425-112-9877 DataNode RegionServer 00008 Lucas 415-992-4432 T1-R3 RegionServer 2 (RS2), hosting regions R2, 00009 Steve 530-288-9832 R3 of the user table and M1 of .META. 00010 Kelly 916-992-1234 00011 Betty 650-241-1192 00012 Anne 206-294-1298 Host2 .META. - M1 T1:00001-T1:00004 R1 RS1 T1:00005-T1:00008 R2 RS2 DataNode RegionServer -ROOT- M:T100001-M:T100009 M1 RS2 RegionServer 3 (RS3), hosting just -ROOT- M:T100010-M:T100012 M2 RS1 Host3 51 Monday, April 1, 13
  • 53. What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper 52 Monday, April 1, 13
  • 54. What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper 52 Monday, April 1, 13
  • 55. What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region 52 Monday, April 1, 13
  • 56. What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region Row per table region 52 Monday, April 1, 13
  • 57. What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region Row per table region 52 Monday, April 1, 13
  • 58. HBase  and  Hadoop ...  they  play  well  together 53 Monday, April 1, 13
  • 59. HBase  and  Hadoop • HBase  has  7ght  integra7on  with  Hadoop • Two  separate  concerns • Access  via  MapReduce • Storage  on  HDFS 54 Monday, April 1, 13
  • 62. HBase  and  MapReduce • HBase  is  7ghtly  integrated  with  the  Hadoop  stack • MapReduce • HDFS • Source  for  MapReduce  jobs • Read  from  an  HBase  Table  with  a  Map-­‐Reduce  Job  (Export,  Row  Count,  ...) • Sink  for  MapReduce  jobs • Wri7ng  to  an  HBase  Table  from  a  Map-­‐Reduce  Job • Lookups  during  MapReduce  jobs • Join  between  hbase  and/or  another  source 57 Monday, April 1, 13
  • 63. HBase  as  MR  source • One  mapper  per  region Map Tasks - one per region. The tasks take the key range of the region as their input split and scan over it Regions being Region Region …. Region served by the Region Server Server Server Server 58 Monday, April 1, 13
  • 64. HBase  as  MR  source  (con7nued) • Mapper protected void map(ImmutableBytesWritable  rowkey,  Result  result, Context context) { // mapper logic goes here } • Job  setup Scan scan = new Scan(); scan.addColumn(Bytes.toBytes("twits"), Bytes.toBytes("twit")); TableMapReduceUtil.initTableMapperJob("twits", scan, Map.class, ImmutableBytesWritable.class, Result.class, job); 59 Monday, April 1, 13
  • 65. HBase  as  MR  sink • Write  to  regions  via  MR  job Reduce tasks writing to HBase regions. Reduce tasks don't necessarily write to a region on the same physical host. They will write to whichever region contains the key range that they are writing into. This could potentially mean that all reduce tasks talk to all regions on the cluster Region Region …. Region Regions being served by the Server Server Server Region Server 60 Monday, April 1, 13
  • 66. HBase  as  MR  sink  (con7nued) • Reducer protected void reduce(ImmutableBytesWritable  rowkey, Iterable<Put> values, Context context) { // reduce logic goes here } • Job  setup TableMapReduceUtil.initTableReducerJob("users", Reducer.class, job); 61 Monday, April 1, 13
  • 67. HBase  as  a  shared  resource • Lookups  from  MR  jobs • Map-­‐side  joins • Simple  Get/Scan  calls HBase Regions being served by the Region Server Region Region Region Server Server Server Map Tasks reading files from HDFS as their source and doing random gets or short scans Datanode Datanode …. Datanode from HBase 62 Monday, April 1, 13
  • 68. Integra7on  with  HDFS • HDFS  is  a  filesystem • Underlying  storage  fabric  for  HBase • HBase  7ghtly  integrated  with  HDFS  and  leverages  the  seman7cs • LSM  trees  fit  well  into  a  write  once  filesystem • Availability  (fault  tolerance)  and  reliability  (durability)  at  scale 63 Monday, April 1, 13
  • 69. Availability  and  fault  tolerance Disaster strikes on the host which was serving region R1 Physical host Physical host Physical host Physical host Region Server Region Server Region Server Region Server Serving region R1 HFile1 HFile1 HFile1 Datanode Datanode Datanode Datanode HDFS Another server picks up the region and starts serving it from the persisted HFile from HDFS. The lost copy of the HFile will get replicated to another Datanode Physical host Physical host Physical host Physical host Region Server Region Server Region Server Region Server Serving region R1 HFile1 HFile1 HFile1 Datanode Datanode Datanode Datanode HDFS 64 Monday, April 1, 13
  • 70. Durability • WAL  keeps  writes  intact  even  if  RS  fails • HDFS  is  reliable  and  doesn’t  lose  data  if  individual  nodes  fail 65 Monday, April 1, 13
  • 71. Summary • Big  data  =  new  products,  applica7ons • Hadoop  ecosystem  -­‐  poster  child • HBase  serves  many  use  cases • Allows  to  solve  aspects  of  big  data  problems • Simple  Java  API • Tight  Hadoop  integra7on • Low  latency  access  to  large  amounts  of  data • Large  scale  computa7on  using  MR 66 Monday, April 1, 13
  • 73. Schema  design ...  it’s  a  database  aper-­‐all 68 Monday, April 1, 13
  • 74. But  isn’t  HBase  schema-­‐less? • Number  of  tables • Rowkey  design   • Number  of  column  families  per  table.  What  goes  into  what   column  family • Column  qualifier  names • What  goes  into  the  cells • Number  of  versions 69 Monday, April 1, 13
  • 75. Rowkeys • Rowkey  design  is  the  single  most  important  aspect  of  HBase   table  designs • The  only  way  to  address  rows  in  HBase 70 Monday, April 1, 13
  • 76. TwitBase  rela7onships • Users  follow  users • Rela2onships  need  to  be  persisted  for  usage  later  on • Model  tables  for  the  expected  access  paSerns • Read  paSern • Who  does  A  follow? • Who  follows  A? • Does  A  follow  B? • Write  paSern • A  follows  B • A  unfollows  B 71 Monday, April 1, 13
  • 77. Start  simple • Adjacency  list Column Family : follows row key: userid column qualifier: followed user number cell value: followed userid Cell value Col Qualifier follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers TheRealMT 1:HRogers 2:Olivia 72 Monday, April 1, 13
  • 78. Op7mizing  the  adjacency  list • We  need  a  count • Where  does  a  new  followed  user  go? follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 73 Monday, April 1, 13
  • 79. Adding  a  new  user Row that needs to be updated follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 1 TheFakeMT : follows: {count -> 4} 2 increment count Client code: TheFakeMT : follows: {count -> 5} Step 1: Get current count Step 2: Update count Step 3: Add new entry 3 add new entry Step 4: Write the new data to HBase TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5} 4 follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers 5:MTFanBoy2 count:5 TheRealMT 1:HRogers 2:Olivia count:2 74 Monday, April 1, 13
  • 80. Transac7ons  ==  not  good • HBase  doesn’t  have  na7ve  support  (think  scale) • Don’t  want  to  complicate  client  side  logic • Only  solu7on  -­‐>  simplify  schema follows TheFakeMT TheRealMT:1 MTFanBoy:1 Olivia:1 HRogers:1 TheRealMT HRogers:1 Olivia:1 75 Monday, April 1, 13
  • 81. Revisit  the  ques7ons • Read  paGern • Who  all  does  A  follow? • Who  all  follows  A? • Does  A  follow  B? • Write  paGern • A  follows  B • A  unfollows  B 76 Monday, April 1, 13
  • 82. Revisit  the  ques7ons 76 Monday, April 1, 13
  • 83. Denormaliza7on • Second  table  for  reverse  rela7onship • Otherwise  scan  across  en7re  table  and  affect  read  performance Normalization Dreamland Write performance Poor design Denormalization Read performance 77 Monday, April 1, 13
  • 84. More  op7miza7ons • Convert  into  tall-­‐narrow  table • Leverage  rowkey  indexing  beGer • Gets  -­‐>  short  Scans Keeping the column family and column qualifier names short reduces the data transferred over the network back to the client. The KeyValue objects become smaller. CF : f The + in the row key refers to concatenating row key: CQ: followed user's name the two values. You could delimitate follower + followed using any character you like. eg: A-B or A,B cell value: 1 78 Monday, April 1, 13
  • 85. Tall-­‐narrow  table  example • Denormaliza7on  is  the  way  to  go f Putting the user name in the column qualifier saves you from looking up TheFakeMT+TheRealMT Mark Twain:1 the users table for the name of the user given an id. You can simply TheFakeMT+MTFanBoy Amandeep Khurana:1 list out names or ids while looking TheFakeMT+Olivia Olivia Clemens:1 at relationships just from this table. The downside of this is that you need TheFakeMT+HRogers Henry Rogers:1 to update the name in all the cells TheRealMT+Olivia Olivia Clemens:1 if the user updates their name in their profile. TheRealMT+HRogers Henry Rogers:1 This is classic Denormalization. 79 Monday, April 1, 13
  • 86. Uniform  rowkey  length • MD5  the  userids  -­‐>  16  bytes  +  16  bytes  rowkeys • BeGer  distribu7on  of  load CF : f Using MD5 of the user ids gives you row key: CQ: followed userid fixed lengths instead of variable md5(follower)md5(followed) length user ids. You don't need concatenation logic anymore. cell value: followed users name 80 Monday, April 1, 13
  • 87. Uniform  rowkey  length  (con7nued) f MD5(TheFakeMT) MD5(TheRealMT) TheRealMT:Mark Twain MD5(TheFakeMT) MD5(MTFanBoy) MTFanBoy:Amandeep Khurana MD5(TheFakeMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheFakeMT) MD5(HRogers) HRogers:Henry Rogers MD5(TheRealMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheRealMT) MD5(HRogers) HRogers:Henry Rogers 81 Monday, April 1, 13
  • 88. Tall  v/s  Wide  tables  storage  footprint Logical representation of an HBase table. Actual physical storage of the table We'll look at what it means to Get() row r5 from this table. CF1 CF2 HFile for CF1 HFile for CF2 r1 c1:v1 c1:v9 c6:v2 r1:CF1:c1:t1:v1 r2 c1:v2 c3:v6 r2:CF1:c1:t2:v2 r1:CF2:c1:t1:v9 r2:CF1:c3:t3:v6 r1:CF2:c6:t4:v2 r3 c2:v3 c5:v6 r3:CF1:c2:t1:v3 r3:CF2:c5:t4:v6 r4:CF1:c2:t1:v4 r5:CF2:c7:t3:v8 r4 c2:v4 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r5 c1:v1 c3:v5 c7:v8 Result object returned for a Get() on row r5 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 KeyValue objects r5:cf2:c7:t3:v8 Key Value Row Col Col Time Cell Key Fam Qual Stamp Value Structure of a KeyValue object 82 Monday, April 1, 13
  • 89. Rowkey  design • Single  most  important  aspect  of  designing  tables • Depends  on  expected  access  paGerns • HFiles  are  sorted  on  Key  part  of  KeyValue  objects "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 83 Monday, April 1, 13
  • 90. Write  op7mized • Distribute  writes  across  the  cluster • Issue  most  pronounced  with  7me  series  data • Hashing hash("TheRealMT") -> random byte[] • Sal7ng int salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region servers>; byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") + Bytes.toBytes(timestamp)); 84 Monday, April 1, 13
  • 91. Read  op7mized • Data  to  be  accessed  together  should  be  stored  together • eg:  twit  streams  -­‐  last  10  twits  by  the  users  I  follow Olivia1 1Olivia Olivia2 1TheRealMT Olivia5 2Olivia Olivia7 2TheFakeMT Olivia9 2TheRealMT TheFakeMT2 3TheFakeMT TheFakeMT3 4TheFakeMT TheFakeMT4 5Olivia TheFakeMT5 5TheFakeMT TheFakeMT6 5TheRealMT TheRealMT1 6TheFakeMT TheRealMT2 7Olivia TheRealMT5 8TheRealMT TheRealMT8 9Olivia 85 Monday, April 1, 13
  • 92. Rela7onal  to  Non  rela7onal • Rela7onal  concepts • En77es • AGributes • Rela7onships • En77es • Table  is  a  table.  Not  much  going  on  there • Users  table  contains...  users.  Those  are  en77es • Good  place  to  start.  Denormaliza7on  bends  that  rule  a  bit 86 Monday, April 1, 13
  • 93. Rela7onal  to  Non  rela7onal  (con7nued) • AGributes • Iden7fying • Primary  keys.  Compound  keys • Maps  to  rowkeys • Non-­‐iden7fying • Other  columns • Maps  to  column  qualifiers  and  cells • Rela7onships 87 Monday, April 1, 13
  • 94. Nested  En77es • Column  Qualifiers  can  contain  data  instead  of  just  a  column   name hbase table row key column family fixed qualifier → timestamp → value Nested entities repeating entity variable qualifier → timestamp → value 88 Monday, April 1, 13
  • 95. Schema  design  summary • Schema  can  make  or  break  the  performance  you  get • Rowkey  is  the  single  most  important  thing • Use  tricks  like  hashing  and  sal2ng • Denormalize  to  your  advantage • There  are  no  joins • Isolate  access  paSerns • Separate  CFs  or  even  separate  tables • Shorter  names  -­‐>  lower  storage  footprint • Column  qualifiers  can  be  used  to  store  data  and  not  just  column  names • Big  difference  as  compared  to  RDBMS 89 Monday, April 1, 13
  • 96. Deployments  and  Opera2ons ...  for  some  real  stuff,  beyond  your  laptop 90 Monday, April 1, 13
  • 97. Components • HDFS  (DataNodes) • Storage • ZooKeeper • Membership  management • RegionServers • Serve  the  regions • HBase  Masters • Janitorial  work 91 Monday, April 1, 13
  • 98. Hardware • DataNodes  and  RegionServers  collocated • Plenty  of  RAM  and  Disk • 8-­‐12  cores • 48  GB  RAM • 12  x  1  TB  disks 92 Monday, April 1, 13
  • 99. Hardware  (con7nued) • HBase  Master  and  ZooKeeper  collocated • Don’t  necessarily  need  to  be  though • Keep  odd  number  of  ZKs • ~3  is  good  enough  for  upto  50-­‐70  node  cluster  typically • Give  ZK  independent  spindle  for  log  wri7ng • I/O  latency  sensi7ve • 4-­‐6  cores • 24  GB  RAM 93 Monday, April 1, 13
  • 100. Types  of  deployments • Standalone  for  dev  purpose • Prototype • <  10  nodes • No  real  SLAs • 1  ZK  +  Master,  rest  are  RS • Small  produc2on  cluster • 10-­‐20  nodes • 1  ZK  +  Master,  maybe  NNHA,  rest  are  RS • Medium  produc2on  cluster • 20-­‐50  nodes • 3  ZK  +  Master,  NNHA,  rest  are  RS • Large  produc2on  cluster • 50+  nodes • 3  or  5  ZK  +  Master,  NNHA,  rest  are  RS • Hardware  choices  remain  the  same  mostly 94 Monday, April 1, 13
  • 101. Deploying  sopware  and  configura7on • Automated  sopware  bits  deployment  highly  recommended • Chef • Puppet • Cloudera  Manager • Centralized  configura7on  management  highly  recommended • Consistency • Ease  of  management 95 Monday, April 1, 13
  • 102. Configura7ons • $HBASE_HOME/conf • hbase-­‐env.sh • Environment  configura2ons • hbase-­‐site.xml • HBase  system  configura2ons • Linux  configura2ons • Number  of  open  files • swappiness • HDFS  configura2ons • $HADOOP_HOME/hdfs-­‐site.xml 96 Monday, April 1, 13
  • 103. Monitoring • Metrics  context  from  Hadoop • Ganglia • File  context • JMX • Cac7 • Several  metrics  available • Write  path  specific • Read  path  specific 97 Monday, April 1, 13
  • 105. Performance  tuning • Performance  tes2ng • Use  real  (or  synthe2c)  workloads • YCSB • Tuning • Write  op2mized • Memstore • Random-­‐read  op2mized • Block  cache • Smaller  block  size • Sequen2al-­‐read  op2mized • Block  cache • Higher  block  size • GC  tuning 99 Monday, April 1, 13