SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Todd	
  Lipcon	
  (Kudu	
  team	
  lead)	
  –	
  todd@cloudera.com	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
   	
   	
   	
   	
   	
  	
  @tlipcon	
  
	
  
Tweet	
  about	
  this	
  talk:	
  @apachekudu	
  or	
  #kudu	
  
	
  
*	
  IncubaEng	
  at	
  the	
  Apache	
  SoGware	
  FoundaEon	
  
	
  
Apache	
  Kudu*:	
  Fast	
  AnalyEcs	
  on	
  
Fast	
  Data	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Kudu	
  
Storage	
  for	
  Fast	
  AnalyEcs	
  on	
  Fast	
  Data	
  
• New	
  updatable	
  column	
  
store	
  for	
  Hadoop	
  
	
  
• Apache-­‐licensed	
  open	
  
source	
  
• Beta	
  now	
  available	
  
Columnar	
  Store	
  
Kudu	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  Kudu?	
  
3	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Current	
  Storage	
  Landscape	
  in	
  Hadoop	
  Ecosystem	
  
HDFS	
  (GFS)	
  excels	
  at:	
  
•  Batch	
  ingest	
  only	
  (eg	
  hourly)	
  
•  Efficiently	
  scanning	
  large	
  amounts	
  
of	
  data	
  (analyEcs)	
  
HBase	
  (BigTable)	
  excels	
  at:	
  
•  Efficiently	
  finding	
  and	
  wriEng	
  
individual	
  rows	
  
•  Making	
  data	
  mutable	
  
	
  
Gaps	
  exist	
  when	
  these	
  properEes	
  
are	
  needed	
  simultaneously	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  High	
  throughput	
  for	
  big	
  scans	
  
Goal:	
  Within	
  2x	
  of	
  Parquet	
  
	
  
•  Low-­‐latency	
  for	
  short	
  accesses	
  	
  
	
  	
  	
  Goal:	
  1ms	
  read/write	
  on	
  SSD	
  
	
  
•  Database-­‐like	
  semanEcs	
  (iniEally	
  single-­‐row	
  
ACID)	
  
	
  
•  Rela>onal	
  data	
  model	
  
•  SQL	
  queries	
  are	
  easy	
  
•  “NoSQL”	
  style	
  scan/insert/update	
  (Java/C++	
  client)	
  
Kudu	
  Design	
  Goals	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Changing	
  Hardware	
  landscape	
  
•  Spinning	
  disk	
  -­‐>	
  solid	
  state	
  storage	
  
• NAND	
  flash:	
  Up	
  to	
  450k	
  read	
  250k	
  write	
  iops,	
  about	
  2GB/sec	
  read	
  and	
  1.5GB/
sec	
  write	
  throughput,	
  at	
  a	
  price	
  of	
  less	
  than	
  $3/GB	
  and	
  dropping	
  
• 3D	
  XPoint	
  memory	
  (1000x	
  faster	
  than	
  NAND,	
  cheaper	
  than	
  RAM)	
  
•  RAM	
  is	
  cheaper	
  and	
  more	
  abundant:	
  
• 64-­‐>128-­‐>256GB	
  over	
  last	
  few	
  years	
  
•  Takeaway:	
  The	
  next	
  boIleneck	
  is	
  CPU,	
  and	
  current	
  storage	
  systems	
  weren’t	
  
designed	
  with	
  CPU	
  efficiency	
  in	
  mind.	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  Kudu?	
  
7	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Scalable	
  and	
  fast	
  tabular	
  storage	
  
•  Scalable	
  
• Tested	
  up	
  to	
  275	
  nodes	
  (~3PB	
  cluster)	
  
• Designed	
  to	
  scale	
  to	
  1000s	
  of	
  nodes,	
  tens	
  of	
  PBs	
  
•  Fast	
  
• Millions	
  of	
  read/write	
  operaEons	
  per	
  second	
  across	
  cluster	
  
• Mul>ple	
  GB/second	
  read	
  throughput	
  per	
  node	
  
•  Tabular	
  
• SQL-­‐like	
  schema:	
  finite	
  number	
  of	
  typed	
  columns	
  (unlike	
  HBase/Cassandra)	
  
• Fast	
  ALTER	
  TABLE	
  
• “NoSQL”	
  APIs:	
  Java/C++/Python	
  	
  or	
  SQL	
  (Impala/Spark/etc)	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Use	
  cases	
  and	
  architectures	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  Use	
  Cases	
  
Kudu	
  is	
  best	
  for	
  use	
  cases	
  requiring	
  a	
  simultaneous	
  combina>on	
  of	
  
sequen>al	
  and	
  random	
  reads	
  and	
  writes	
  
	
  
● Time	
  Series	
  
○  Examples:	
  Stream	
  market	
  data;	
  fraud	
  detecEon	
  &	
  prevenEon;	
  network	
  monitoring	
  
○  Workload:	
  Insert,	
  updates,	
  scans,	
  lookups	
  
● Online	
  Repor>ng	
  
○  Examples:	
  ODS	
  
○  Workload:	
  Inserts,	
  updates,	
  scans,	
  lookups	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  AnalyEcs	
  in	
  Hadoop	
  Today	
  
Fraud	
  DetecEon	
  in	
  the	
  Real	
  World	
  =	
  Storage	
  Complexity	
  
Considera>ons:	
  
●  How	
  do	
  I	
  handle	
  failure	
  
during	
  this	
  process?	
  
	
  
●  How	
  oGen	
  do	
  I	
  reorganize	
  
data	
  streaming	
  in	
  into	
  a	
  
format	
  appropriate	
  for	
  
reporEng?	
  
	
  
●  When	
  reporEng,	
  how	
  do	
  I	
  see	
  
data	
  that	
  has	
  not	
  yet	
  been	
  
reorganized?	
  
	
  
●  How	
  do	
  I	
  ensure	
  that	
  
important	
  jobs	
  aren’t	
  
interrupted	
  by	
  maintenance?	
  
New	
  ParEEon	
  
Most	
  Recent	
  ParEEon	
  
Historic	
  Data	
  
HBase	
  
Parquet	
  
File	
  
Have	
  we	
  
accumulated	
  
enough	
  data?	
  
Reorganize	
  
HBase	
  file	
  
into	
  Parquet	
  
•  Wait	
  for	
  running	
  operaEons	
  to	
  complete	
  	
  
•  Define	
  new	
  Impala	
  parEEon	
  referencing	
  
the	
  newly	
  wriyen	
  Parquet	
  file	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
ReporEng	
  
Request	
  
Impala	
  on	
  HDFS	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real-­‐Time	
  AnalyEcs	
  in	
  Hadoop	
  with	
  Kudu	
  
Improvements:	
  
●  One	
  system	
  to	
  operate	
  
●  No	
  cron	
  jobs	
  or	
  background	
  
processes	
  
●  Handle	
  late	
  arrivals	
  or	
  data	
  
correc>ons	
  with	
  ease	
  
●  New	
  data	
  available	
  
immediately	
  for	
  analy>cs	
  or	
  
opera>ons	
  	
  
Historical	
  and	
  Real-­‐Eme	
  
Data	
  
Incoming	
  Data	
  
(Messaging	
  
System)	
  
ReporEng	
  
Request	
  
Storage	
  in	
  Kudu	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  use	
  case	
  
•  World’s	
  4th	
  largest	
  smart-­‐phone	
  maker	
  (most	
  popular	
  in	
  China)	
  
•  Gather	
  important	
  RPC	
  tracing	
  events	
  from	
  mobile	
  app	
  and	
  backend	
  service.	
  	
  
•  Service	
  monitoring	
  &	
  troubleshooEng	
  tool.	
  
u  High	
  write	
  throughput	
  
•  >20	
  Billion	
  records/day	
  and	
  growing	
  
u  Query	
  latest	
  data	
  and	
  quick	
  response	
  
•  IdenEfy	
  and	
  resolve	
  issues	
  quickly	
  
u  Can	
  search	
  for	
  individual	
  records	
  
•  Easy	
  for	
  troubleshooEng	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  Big	
  Data	
  Analy>cs	
  Pipeline	
  
Before	
  Kudu
•  Long	
  pipeline	
  
high	
  latency(1	
  hour	
  ~	
  1	
  day),	
  data	
  conversion	
  pains	
  
•  No	
  ordering	
  
Log	
  arrival(storage)	
  order	
  not	
  exactly	
  logical	
  order	
  
e.g.	
  read	
  2-­‐3	
  days	
  of	
  log	
  for	
  data	
  in	
  1	
  day
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Xiaomi	
  Big	
  Data	
  Analysis	
  Pipeline	
  
Simplified	
  With	
  Kudu
•  ETL	
  Pipeline(0~10s	
  latency)	
  
Apps	
  that	
  need	
  to	
  prevent	
  backpressure	
  or	
  require	
  ETL	
  	
  
•  Direct	
  Pipeline(no	
  latency)	
  
Apps	
  that	
  don’t	
  require	
  ETL	
  and	
  no	
  backpressure	
  issues	
  
	
  
OLAP	
  scan	
  
Side	
  table	
  lookup	
  
Result	
  store	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
ReplicaEon	
  and	
  fault	
  tolerance	
  
16	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tables,	
  Tablets,	
  and	
  Tablet	
  Servers	
  
•  Table	
  is	
  horizontally	
  par>>oned	
  into	
  tablets	
  
• Range	
  or	
  hash	
  parEEoning	
  
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
•  bucketNumber = hashCode(row[‘timestamp’]) % 100
•  Each	
  tablet	
  has	
  N	
  replicas	
  (3	
  or	
  5),	
  with	
  Ra]	
  consensus	
  
• AutomaEc	
  fault	
  tolerance	
  
• MTTR:	
  ~5	
  seconds	
  
•  Tablet	
  servers	
  host	
  tablets	
  on	
  local	
  disk	
  drives	
  
17	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Metadata	
  and	
  the	
  Master	
  
•  Replicated	
  master	
  
• Acts	
  as	
  a	
  tablet	
  directory	
  
• Acts	
  as	
  a	
  catalog	
  (which	
  tables	
  exist,	
  etc)	
  
• Acts	
  as	
  a	
  load	
  balancer	
  (tracks	
  TS	
  liveness,	
  re-­‐replicates	
  under-­‐replicated	
  
tablets)	
  
•  Not	
  a	
  boIleneck	
  
• super	
  fast	
  in-­‐memory	
  lookups	
  
18	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Client	
  
Hey	
  Master!	
  Where	
  is	
  the	
  row	
  for	
  
‘tlipcon’	
  in	
  table	
  “T”?	
  
It’s	
  part	
  of	
  tablet	
  2,	
  which	
  is	
  on	
  servers	
  {Z,Y,X}.	
  
BTW,	
  here’s	
  info	
  on	
  other	
  tablets	
  you	
  might	
  
care	
  about:	
  T1,	
  T2,	
  T3,	
  …	
  
UPDATE	
  tlipcon	
  
SET	
  col=foo	
  
	
  
Meta	
  Cache	
  
T1:	
  …	
  
T2:	
  …	
  
T3:	
  …	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
Columnar	
  storage	
  
20	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  storage	
  
{25059873,	
  
22309487,	
  
23059861,	
  
23010982}	
  
Tweet_id	
  
{newsycbot,	
  
RideImpala,	
  
fastly,	
  
llvmorg}	
  
User_name	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
{Visual	
  exp…,	
  
Introducing	
  ..,	
  
Missing	
  July…,	
  
LLVM	
  3.7….}	
  
text	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  storage	
  
{25059873,	
  
22309487,	
  
23059861,	
  
23010982}	
  
Tweet_id	
  
{newsycbot,	
  
RideImpala,	
  
fastly,	
  
llvmorg}	
  
User_name	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
{Visual	
  exp…,	
  
Introducing	
  ..,	
  
Missing	
  July…,	
  
LLVM	
  3.7….}	
  
text	
  
SELECT	
  COUNT(*)	
  FROM	
  tweets	
  WHERE	
  user_name	
  =	
  ‘newsycbot’;	
  
Only	
  read	
  1	
  column	
  	
  
1GB	
   2GB	
   1GB	
   200GB	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  compression	
  
{1442865158,	
  
1442828307,	
  
1442865156,	
  
1442865155}	
  
Created_at	
  
Created_at	
   Diff(created_at)	
  
1442865158	
   n/a	
  
1442828307	
   -­‐36851	
  
1442865156	
   36849	
  
1442865155	
   -­‐1	
  
64	
  bits	
  each	
   17	
  bits	
  each	
  
•  Many	
  columns	
  can	
  compress	
  to	
  
a	
  few	
  bits	
  per	
  row!	
  
•  Especially:	
  
• Timestamps	
  
• Time	
  series	
  values	
  
• Low-­‐cardinality	
  strings	
  
•  Massive	
  space	
  savings	
  and	
  
throughput	
  increase!	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
IntegraEons	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  DataSource	
  integraEon	
  (WIP)	
  
sqlContext.load("org.kududb.spark",
Map("kudu.table" -> “foo”,
"kudu.master" -> “master.example.com”))
.registerTempTable(“mytable”)
df = sqlContext.sql(
“select col_a, col_b from mytable “ +
“where col_c = 123”)
Available	
  in	
  Kudu	
  0.7.0,	
  but	
  sEll	
  being	
  improved	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  integraEon	
  
• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16
BUCKETS AS SELECT … FROM …
• INSERT/UPDATE/DELETE
• Not an Impala user? Community working on other integrations (Hive,
Drill, Presto, Phoenix)
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
MapReduce	
  integraEon	
  
• MulE-­‐framework	
  cluster	
  (MR	
  +	
  HDFS	
  +	
  Kudu	
  on	
  the	
  same	
  disks)	
  
• KuduTableInputFormat	
  /	
  KuduTableOutputFormat	
  
• Support	
  for	
  pushing	
  predicates,	
  column	
  projecEons,	
  etc	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Performance	
  
28	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
TPC-­‐H	
  (AnalyEcs	
  benchmark)	
  
•  75	
  server	
  cluster	
  
• 12	
  (spinning)	
  disk	
  each,	
  enough	
  RAM	
  to	
  fit	
  dataset	
  
• TPC-­‐H	
  Scale	
  Factor	
  100	
  (100GB)	
  
•  Example	
  query:	
  
•  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
29	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
-­‐	
  Kudu	
  outperforms	
  Parquet	
  by	
  31%	
  (geometric	
  mean)	
  for	
  RAM-­‐resident	
  data	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Versus	
  other	
  NoSQL	
  storage	
  
•  Phoenix:	
  SQL	
  layer	
  on	
  HBase	
  
•  10	
  node	
  cluster	
  (9	
  worker,	
  1	
  master)	
  
•  TPC-­‐H	
  LINEITEM	
  table	
  only	
  (6B	
  rows)	
  
31	
  
2152	
  
219	
  
76	
  
131	
  
0.04	
  
1918	
  
13.2	
  
1.7	
  
0.7	
  
0.15	
  
155	
  
9.3	
  
1.4	
   1.5	
   1.37	
  
0.01	
  
0.1	
  
1	
  
10	
  
100	
  
1000	
  
10000	
  
Load	
   TPCH	
  Q1	
   COUNT(*)	
  
COUNT(*)	
  
WHERE…	
  
single-­‐row	
  
lookup	
  
Time	
  (sec)	
  
Phoenix	
  
Kudu	
  
Parquet	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  about	
  NoSQL-­‐style	
  random	
  access?	
  (YCSB)	
  
•  YCSB	
  0.5.0-­‐snapshot	
  
•  10	
  node	
  cluster	
  
(9	
  worker,	
  1	
  master)	
  
•  100M	
  row	
  data	
  set	
  
•  10M	
  operaEons	
  each	
  
workload	
  
32	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  
33	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Project	
  status	
  
•  Open	
  source	
  beta	
  released	
  in	
  September	
  
•  Latest	
  release	
  0.7.1	
  hot	
  off	
  the	
  presses	
  
• Usable	
  for	
  many	
  applicaEons	
  (Xiaomi	
  in	
  producEon)	
  
• Have	
  not	
  experienced	
  unrecoverable	
  data	
  loss,	
  reasonably	
  stable	
  (almost	
  no	
  
crashes	
  reported).	
  Users	
  tesEng	
  up	
  to	
  200	
  nodes	
  so	
  far.	
  
• SEll	
  requires	
  some	
  expert	
  assistance,	
  and	
  you’ll	
  probably	
  find	
  some	
  bugs	
  
•  Part	
  of	
  the	
  Apache	
  So]ware	
  Founda>on	
  Incubator	
  
• Community-­‐driven	
  open	
  source	
  process	
  
	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Kudu	
  Community	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  user	
  
•  hyp://getkudu.io	
  
•  user@kudu.incubator.apache.org	
  
•  hyp://getkudu-­‐slack.herokuapp.com/	
  
•  Quickstart	
  VM	
  
• Easiest	
  way	
  to	
  get	
  started	
  
• Impala	
  and	
  Kudu	
  in	
  an	
  easy-­‐to-­‐install	
  VM	
  
•  CSD	
  and	
  Parcels	
  
• For	
  installaEon	
  on	
  a	
  Cloudera	
  Manager-­‐managed	
  cluster	
  
36	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ge…ng	
  started	
  as	
  a	
  developer	
  
•  hyp://github.com/apache/incubator-­‐kudu	
  
•  Code	
  reviews:	
  hyp://gerrit.cloudera.org	
  
•  Public	
  JIRA:	
  hyp://issues.apache.org/jira/browse/KUDU	
  
• Includes	
  bugs	
  going	
  back	
  to	
  2013.	
  Come	
  see	
  our	
  dirty	
  laundry!	
  
•  Mailing	
  list:	
  dev@kudu.incubator.apache.org	
  
•  Apache	
  2.0	
  license	
  open	
  source	
  
•  ContribuEons	
  are	
  welcome	
  and	
  encouraged!	
  
37	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
hyp://getkudu.io/	
  
@getkudu	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Backup	
  slides	
  
40	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  it	
  works	
  
Write	
  and	
  read	
  paths	
  
40	
  
41	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Inserts	
  and	
  Flushes	
  
41	
  
MemRowSet	
  
INSERT	
  (“todd”,	
  
“$1000”,”engineer”)	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
flush	
  
42	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Inserts	
  and	
  Flushes	
  
42	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
INSERT	
  (“doug”,	
  “$1B”,	
  “Hadoop	
  man”)	
  
flush	
  
43	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  -­‐	
  Updates	
  
43	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
Each	
  DiskRowSet	
  has	
  its	
  own	
  
DeltaMemStore	
  to	
  
accumulate	
  updates	
  
base	
  data	
  
base	
  data	
  
44	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  -­‐	
  Updates	
  
44	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
UPDATE	
  set	
  pay=“$1M”	
  
WHERE	
  name=“todd”	
  
Is	
  the	
  row	
  in	
  DiskRowSet	
  2?	
  
(check	
  bloom	
  filters)	
  
Is	
  the	
  row	
  in	
  DiskRowSet	
  1?	
  
(check	
  bloom	
  filters)	
  
Bloom	
  says:	
  no!	
  
Bloom	
  says:	
  maybe!	
  
Search	
  key	
  column	
  to	
  find	
  
offset:	
  rowid	
  =	
  150	
  
150:	
  pay=$1M	
  
@	
  Eme	
  T	
  
	
  
base	
  data	
  
45	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Read	
  path	
  
45	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
150:	
  pay=$1M	
  
@	
  Eme	
  T	
  
Read	
  rows	
  in	
  DiskRowSet	
  2	
  
Then,	
  read	
  rows	
  in	
  
DiskRowSet	
  1	
  
Updates	
  are	
  applied	
  based	
  on	
  ordinal	
  
offset	
  within	
  DRS:	
  array	
  indexing	
  =	
  fast	
  
base	
  data	
  
base	
  data	
  
46	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Delta	
  flushes	
  
46	
  
MemRowSet	
  
name	
   pay	
   role	
  
DiskRowSet	
  1	
  
name	
   pay	
   role	
  
DiskRowSet	
  2	
  
Delta	
  MS	
  
Delta	
  MS	
  
150:	
  pay=$1M	
  
@	
  Eme	
  T	
  
REDO	
  DeltaFile	
  
Flush	
  
A	
  REDO	
  delta	
  indicates	
  how	
  to	
  
transform	
  between	
  the	
  ‘base	
  
data’	
  (columnar)	
  and	
  a	
  later	
  version	
  
base	
  data	
  
base	
  data	
  
47	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  Major	
  delta	
  compacEon	
  
47	
  
name	
   pay	
   role	
  
DiskRowSet(pre-­‐compacEon)	
  
Delta	
  MS	
  
REDO	
  DeltaFile	
   REDO	
  DeltaFile	
   REDO	
  DeltaFile	
  
Many	
  deltas	
  accumulate:	
  lots	
  of	
  delta	
  applicaEon	
  
work	
  on	
  reads	
  
Merge	
  updates	
  for	
  columns	
  with	
  high	
  update	
  
percentage	
  
name	
   pay	
   role	
  
DiskRowSet(post-­‐compacEon)	
  
Delta	
  MS	
  
Unmerged	
  REDO	
  
deltas	
  UNDO	
  deltas	
  
base	
  data	
  
If	
  a	
  column	
  has	
  few	
  updates,	
  doesn’t	
  need	
  to	
  be	
  re-­‐
wriyen:	
  those	
  deltas	
  maintained	
  in	
  new	
  DeltaFile	
  
48	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Kudu	
  storage	
  –	
  RowSet	
  CompacEons	
  
DRS	
  1	
  (32MB)	
  
[PK=alice],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=joe],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=linda],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=zach]	
  
DRS	
  2	
  (32MB)	
  
	
  	
  [PK=bob],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=jon],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=mary]	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=zeke]	
  
DRS	
  3	
  (32MB)	
  
	
  	
  	
  	
  [PK=carl],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=julie],	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  [PK=omar]	
  	
  	
  	
  	
  	
  	
  	
  [PK=zoe]	
  
DRS	
  4	
  (32MB)	
   DRS	
  5	
  (32MB)	
   DRS	
  6	
  (32MB)	
  
[alice,	
  bob,	
  carl,	
  
joe]	
  
[jon,	
  julie,	
  linda,	
  
mary]	
  
[omar,	
  zach,	
  
zeke,	
  zoe]	
  
Reorganize	
  rows	
  to	
  avoid	
  rowsets	
  
with	
  overlapping	
  key	
  ranges	
  
49	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RaG	
  consensus	
  
49	
  
TS	
  A	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(LEADER)	
  
Client	
  
TS	
  B	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(FOLLOWER)	
  
TS	
  C	
  
	
  
	
  
	
  
	
  
Tablet	
  1	
  
(FOLLOWER)	
  
WAL	
  
WAL	
  WAL	
  
2b.	
  Leader	
  writes	
  local	
  WAL	
  
1a.	
  Client-­‐>Leader:	
  Write()	
  RPC	
  
2a.	
  Leader-­‐>Followers:	
  
UpdateConsensus()	
  RPC	
  
3.	
  Follower:	
  write	
  WAL	
  
4.	
  Follower-­‐>Leader:	
  success	
  
3.	
  Follower:	
  write	
  WAL	
  
5.	
  Leader	
  has	
  achieved	
  majority	
  
6.	
  Leader-­‐>Client:	
  Success!	
  
50	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling	
  inserts	
  and	
  updates	
  
•  Inserts	
  go	
  to	
  an	
  in-­‐memory	
  row	
  store	
  (MemRowSet)	
  
• Durable	
  due	
  to	
  write-­‐ahead	
  logging	
  
• Later	
  flush	
  to	
  columnar	
  format	
  on	
  disk	
  
•  Updates	
  go	
  to	
  in-­‐memory	
  “delta	
  store”	
  
• Later	
  flush	
  to	
  “delta	
  files”	
  on	
  disk	
  
• Eventually	
  “compact”	
  into	
  the	
  previously-­‐wriyen	
  columnar	
  data	
  files	
  
•  Details	
  elided	
  here	
  due	
  to	
  Eme	
  constraints	
  
• available	
  in	
  other	
  slide	
  decks	
  online,	
  or	
  our	
  academic-­‐style	
  paper	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...DataStax
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 

Was ist angesagt? (20)

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
25 snowflake
25 snowflake25 snowflake
25 snowflake
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
Getting Started with Apache Cassandra and Apache Zeppelin (DuyHai DOAN, DataS...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 

Andere mochten auch

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Josh Elser
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)Amazon Web Services Korea
 
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)Amazon Web Services Korea
 

Andere mochten auch (20)

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
Bases de datos en la nube con AWS
Bases de datos en la nube con AWSBases de datos en la nube con AWS
Bases de datos en la nube con AWS
 
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)
AWS CLOUD 2017 - AWS 코어팀과 함께하는 고객 성공 전략 (황인철 상무 & 박성훈 테크니컬 어카운트 매니저 & 김소희 컨설턴트)
 
AWS Services Overview
AWS Services OverviewAWS Services Overview
AWS Services Overview
 
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)
Amazon 인공 지능(AI) 서비스 및 AWS 기반 딥러닝 활용 방법 - 윤석찬 (AWS, 테크에반젤리스트)
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Hive: A Cloud Story
Hive: A Cloud StoryHive: A Cloud Story
Hive: A Cloud Story
 

Ähnlich wie DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...Dr. Wilfred Lin (Ph.D.)
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
 
Exadata 12c New Features RMOUG
Exadata 12c New Features RMOUGExadata 12c New Features RMOUG
Exadata 12c New Features RMOUGFuad Arshad
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...Imperva Incapsula
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 

Ähnlich wie DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data (20)

Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
A5 oracle exadata-the game changer for online transaction processing data w...
A5   oracle exadata-the game changer for online transaction processing data w...A5   oracle exadata-the game changer for online transaction processing data w...
A5 oracle exadata-the game changer for online transaction processing data w...
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Exadata 12c New Features RMOUG
Exadata 12c New Features RMOUGExadata 12c New Features RMOUG
Exadata 12c New Features RMOUG
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 

Mehr von Hakka Labs

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsHakka Labs
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineHakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastHakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...Hakka Labs
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedHakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 

Mehr von Hakka Labs (18)

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris Wiggins
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation Engine
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeed
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Todd  Lipcon  (Kudu  team  lead)  –  todd@cloudera.com                                                        @tlipcon     Tweet  about  this  talk:  @apachekudu  or  #kudu     *  IncubaEng  at  the  Apache  SoGware  FoundaEon     Apache  Kudu*:  Fast  AnalyEcs  on   Fast  Data  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Kudu   Storage  for  Fast  AnalyEcs  on  Fast  Data   • New  updatable  column   store  for  Hadoop     • Apache-­‐licensed  open   source   • Beta  now  available   Columnar  Store   Kudu  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Why  Kudu?   3  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Current  Storage  Landscape  in  Hadoop  Ecosystem   HDFS  (GFS)  excels  at:   •  Batch  ingest  only  (eg  hourly)   •  Efficiently  scanning  large  amounts   of  data  (analyEcs)   HBase  (BigTable)  excels  at:   •  Efficiently  finding  and  wriEng   individual  rows   •  Making  data  mutable     Gaps  exist  when  these  properEes   are  needed  simultaneously  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   •  High  throughput  for  big  scans   Goal:  Within  2x  of  Parquet     •  Low-­‐latency  for  short  accesses          Goal:  1ms  read/write  on  SSD     •  Database-­‐like  semanEcs  (iniEally  single-­‐row   ACID)     •  Rela>onal  data  model   •  SQL  queries  are  easy   •  “NoSQL”  style  scan/insert/update  (Java/C++  client)   Kudu  Design  Goals  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Changing  Hardware  landscape   •  Spinning  disk  -­‐>  solid  state  storage   • NAND  flash:  Up  to  450k  read  250k  write  iops,  about  2GB/sec  read  and  1.5GB/ sec  write  throughput,  at  a  price  of  less  than  $3/GB  and  dropping   • 3D  XPoint  memory  (1000x  faster  than  NAND,  cheaper  than  RAM)   •  RAM  is  cheaper  and  more  abundant:   • 64-­‐>128-­‐>256GB  over  last  few  years   •  Takeaway:  The  next  boIleneck  is  CPU,  and  current  storage  systems  weren’t   designed  with  CPU  efficiency  in  mind.  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Kudu?   7  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Scalable  and  fast  tabular  storage   •  Scalable   • Tested  up  to  275  nodes  (~3PB  cluster)   • Designed  to  scale  to  1000s  of  nodes,  tens  of  PBs   •  Fast   • Millions  of  read/write  operaEons  per  second  across  cluster   • Mul>ple  GB/second  read  throughput  per  node   •  Tabular   • SQL-­‐like  schema:  finite  number  of  typed  columns  (unlike  HBase/Cassandra)   • Fast  ALTER  TABLE   • “NoSQL”  APIs:  Java/C++/Python    or  SQL  (Impala/Spark/etc)  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Use  cases  and  architectures  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  Use  Cases   Kudu  is  best  for  use  cases  requiring  a  simultaneous  combina>on  of   sequen>al  and  random  reads  and  writes     ● Time  Series   ○  Examples:  Stream  market  data;  fraud  detecEon  &  prevenEon;  network  monitoring   ○  Workload:  Insert,  updates,  scans,  lookups   ● Online  Repor>ng   ○  Examples:  ODS   ○  Workload:  Inserts,  updates,  scans,  lookups  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  AnalyEcs  in  Hadoop  Today   Fraud  DetecEon  in  the  Real  World  =  Storage  Complexity   Considera>ons:   ●  How  do  I  handle  failure   during  this  process?     ●  How  oGen  do  I  reorganize   data  streaming  in  into  a   format  appropriate  for   reporEng?     ●  When  reporEng,  how  do  I  see   data  that  has  not  yet  been   reorganized?     ●  How  do  I  ensure  that   important  jobs  aren’t   interrupted  by  maintenance?   New  ParEEon   Most  Recent  ParEEon   Historic  Data   HBase   Parquet   File   Have  we   accumulated   enough  data?   Reorganize   HBase  file   into  Parquet   •  Wait  for  running  operaEons  to  complete     •  Define  new  Impala  parEEon  referencing   the  newly  wriyen  Parquet  file   Incoming  Data   (Messaging   System)   ReporEng   Request   Impala  on  HDFS  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real-­‐Time  AnalyEcs  in  Hadoop  with  Kudu   Improvements:   ●  One  system  to  operate   ●  No  cron  jobs  or  background   processes   ●  Handle  late  arrivals  or  data   correc>ons  with  ease   ●  New  data  available   immediately  for  analy>cs  or   opera>ons     Historical  and  Real-­‐Eme   Data   Incoming  Data   (Messaging   System)   ReporEng   Request   Storage  in  Kudu  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  use  case   •  World’s  4th  largest  smart-­‐phone  maker  (most  popular  in  China)   •  Gather  important  RPC  tracing  events  from  mobile  app  and  backend  service.     •  Service  monitoring  &  troubleshooEng  tool.   u  High  write  throughput   •  >20  Billion  records/day  and  growing   u  Query  latest  data  and  quick  response   •  IdenEfy  and  resolve  issues  quickly   u  Can  search  for  individual  records   •  Easy  for  troubleshooEng  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  Big  Data  Analy>cs  Pipeline   Before  Kudu •  Long  pipeline   high  latency(1  hour  ~  1  day),  data  conversion  pains   •  No  ordering   Log  arrival(storage)  order  not  exactly  logical  order   e.g.  read  2-­‐3  days  of  log  for  data  in  1  day
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Xiaomi  Big  Data  Analysis  Pipeline   Simplified  With  Kudu •  ETL  Pipeline(0~10s  latency)   Apps  that  need  to  prevent  backpressure  or  require  ETL     •  Direct  Pipeline(no  latency)   Apps  that  don’t  require  ETL  and  no  backpressure  issues     OLAP  scan   Side  table  lookup   Result  store  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   ReplicaEon  and  fault  tolerance   16  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Tables,  Tablets,  and  Tablet  Servers   •  Table  is  horizontally  par>>oned  into  tablets   • Range  or  hash  parEEoning   • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS •  bucketNumber = hashCode(row[‘timestamp’]) % 100 •  Each  tablet  has  N  replicas  (3  or  5),  with  Ra]  consensus   • AutomaEc  fault  tolerance   • MTTR:  ~5  seconds   •  Tablet  servers  host  tablets  on  local  disk  drives   17  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Metadata  and  the  Master   •  Replicated  master   • Acts  as  a  tablet  directory   • Acts  as  a  catalog  (which  tables  exist,  etc)   • Acts  as  a  load  balancer  (tracks  TS  liveness,  re-­‐replicates  under-­‐replicated   tablets)   •  Not  a  boIleneck   • super  fast  in-­‐memory  lookups   18  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Client   Hey  Master!  Where  is  the  row  for   ‘tlipcon’  in  table  “T”?   It’s  part  of  tablet  2,  which  is  on  servers  {Z,Y,X}.   BTW,  here’s  info  on  other  tablets  you  might   care  about:  T1,  T2,  T3,  …   UPDATE  tlipcon   SET  col=foo     Meta  Cache   T1:  …   T2:  …   T3:  …  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   Columnar  storage   20  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  storage   {25059873,   22309487,   23059861,   23010982}   Tweet_id   {newsycbot,   RideImpala,   fastly,   llvmorg}   User_name   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   {Visual  exp…,   Introducing  ..,   Missing  July…,   LLVM  3.7….}   text  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  storage   {25059873,   22309487,   23059861,   23010982}   Tweet_id   {newsycbot,   RideImpala,   fastly,   llvmorg}   User_name   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   {Visual  exp…,   Introducing  ..,   Missing  July…,   LLVM  3.7….}   text   SELECT  COUNT(*)  FROM  tweets  WHERE  user_name  =  ‘newsycbot’;   Only  read  1  column     1GB   2GB   1GB   200GB  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  compression   {1442865158,   1442828307,   1442865156,   1442865155}   Created_at   Created_at   Diff(created_at)   1442865158   n/a   1442828307   -­‐36851   1442865156   36849   1442865155   -­‐1   64  bits  each   17  bits  each   •  Many  columns  can  compress  to   a  few  bits  per  row!   •  Especially:   • Timestamps   • Time  series  values   • Low-­‐cardinality  strings   •  Massive  space  savings  and   throughput  increase!  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   IntegraEons  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataSource  integraEon  (WIP)   sqlContext.load("org.kududb.spark", Map("kudu.table" -> “foo”, "kudu.master" -> “master.example.com”)) .registerTempTable(“mytable”) df = sqlContext.sql( “select col_a, col_b from mytable “ + “where col_c = 123”) Available  in  Kudu  0.7.0,  but  sEll  being  improved  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Impala  integraEon   • CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT/UPDATE/DELETE • Not an Impala user? Community working on other integrations (Hive, Drill, Presto, Phoenix)
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   MapReduce  integraEon   • MulE-­‐framework  cluster  (MR  +  HDFS  +  Kudu  on  the  same  disks)   • KuduTableInputFormat  /  KuduTableOutputFormat   • Support  for  pushing  predicates,  column  projecEons,  etc  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Performance   28  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   TPC-­‐H  (AnalyEcs  benchmark)   •  75  server  cluster   • 12  (spinning)  disk  each,  enough  RAM  to  fit  dataset   • TPC-­‐H  Scale  Factor  100  (100GB)   •  Example  query:   •  SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 29  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   -­‐  Kudu  outperforms  Parquet  by  31%  (geometric  mean)  for  RAM-­‐resident  data  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Versus  other  NoSQL  storage   •  Phoenix:  SQL  layer  on  HBase   •  10  node  cluster  (9  worker,  1  master)   •  TPC-­‐H  LINEITEM  table  only  (6B  rows)   31   2152   219   76   131   0.04   1918   13.2   1.7   0.7   0.15   155   9.3   1.4   1.5   1.37   0.01   0.1   1   10   100   1000   10000   Load   TPCH  Q1   COUNT(*)   COUNT(*)   WHERE…   single-­‐row   lookup   Time  (sec)   Phoenix   Kudu   Parquet  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   What  about  NoSQL-­‐style  random  access?  (YCSB)   •  YCSB  0.5.0-­‐snapshot   •  10  node  cluster   (9  worker,  1  master)   •  100M  row  data  set   •  10M  operaEons  each   workload   32  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started   33  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Project  status   •  Open  source  beta  released  in  September   •  Latest  release  0.7.1  hot  off  the  presses   • Usable  for  many  applicaEons  (Xiaomi  in  producEon)   • Have  not  experienced  unrecoverable  data  loss,  reasonably  stable  (almost  no   crashes  reported).  Users  tesEng  up  to  200  nodes  so  far.   • SEll  requires  some  expert  assistance,  and  you’ll  probably  find  some  bugs   •  Part  of  the  Apache  So]ware  Founda>on  Incubator   • Community-­‐driven  open  source  process    
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Kudu  Community  
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  user   •  hyp://getkudu.io   •  user@kudu.incubator.apache.org   •  hyp://getkudu-­‐slack.herokuapp.com/   •  Quickstart  VM   • Easiest  way  to  get  started   • Impala  and  Kudu  in  an  easy-­‐to-­‐install  VM   •  CSD  and  Parcels   • For  installaEon  on  a  Cloudera  Manager-­‐managed  cluster   36  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Ge…ng  started  as  a  developer   •  hyp://github.com/apache/incubator-­‐kudu   •  Code  reviews:  hyp://gerrit.cloudera.org   •  Public  JIRA:  hyp://issues.apache.org/jira/browse/KUDU   • Includes  bugs  going  back  to  2013.  Come  see  our  dirty  laundry!   •  Mailing  list:  dev@kudu.incubator.apache.org   •  Apache  2.0  license  open  source   •  ContribuEons  are  welcome  and  encouraged!   37  
  • 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   hyp://getkudu.io/   @getkudu  
  • 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   Backup  slides  
  • 40. 40  ©  Cloudera,  Inc.  All  rights  reserved.   How  it  works   Write  and  read  paths   40  
  • 41. 41  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Inserts  and  Flushes   41   MemRowSet   INSERT  (“todd”,   “$1000”,”engineer”)   name   pay   role   DiskRowSet  1   flush  
  • 42. 42  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Inserts  and  Flushes   42   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   INSERT  (“doug”,  “$1B”,  “Hadoop  man”)   flush  
  • 43. 43  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  -­‐  Updates   43   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   Each  DiskRowSet  has  its  own   DeltaMemStore  to   accumulate  updates   base  data   base  data  
  • 44. 44  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  -­‐  Updates   44   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   UPDATE  set  pay=“$1M”   WHERE  name=“todd”   Is  the  row  in  DiskRowSet  2?   (check  bloom  filters)   Is  the  row  in  DiskRowSet  1?   (check  bloom  filters)   Bloom  says:  no!   Bloom  says:  maybe!   Search  key  column  to  find   offset:  rowid  =  150   150:  pay=$1M   @  Eme  T     base  data  
  • 45. 45  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Read  path   45   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   150:  pay=$1M   @  Eme  T   Read  rows  in  DiskRowSet  2   Then,  read  rows  in   DiskRowSet  1   Updates  are  applied  based  on  ordinal   offset  within  DRS:  array  indexing  =  fast   base  data   base  data  
  • 46. 46  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Delta  flushes   46   MemRowSet   name   pay   role   DiskRowSet  1   name   pay   role   DiskRowSet  2   Delta  MS   Delta  MS   150:  pay=$1M   @  Eme  T   REDO  DeltaFile   Flush   A  REDO  delta  indicates  how  to   transform  between  the  ‘base   data’  (columnar)  and  a  later  version   base  data   base  data  
  • 47. 47  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  Major  delta  compacEon   47   name   pay   role   DiskRowSet(pre-­‐compacEon)   Delta  MS   REDO  DeltaFile   REDO  DeltaFile   REDO  DeltaFile   Many  deltas  accumulate:  lots  of  delta  applicaEon   work  on  reads   Merge  updates  for  columns  with  high  update   percentage   name   pay   role   DiskRowSet(post-­‐compacEon)   Delta  MS   Unmerged  REDO   deltas  UNDO  deltas   base  data   If  a  column  has  few  updates,  doesn’t  need  to  be  re-­‐ wriyen:  those  deltas  maintained  in  new  DeltaFile  
  • 48. 48  ©  Cloudera,  Inc.  All  rights  reserved.   Kudu  storage  –  RowSet  CompacEons   DRS  1  (32MB)   [PK=alice],                              [PK=joe],                            [PK=linda],                      [PK=zach]   DRS  2  (32MB)      [PK=bob],                              [PK=jon],                            [PK=mary]                  [PK=zeke]   DRS  3  (32MB)          [PK=carl],                              [PK=julie],                        [PK=omar]                [PK=zoe]   DRS  4  (32MB)   DRS  5  (32MB)   DRS  6  (32MB)   [alice,  bob,  carl,   joe]   [jon,  julie,  linda,   mary]   [omar,  zach,   zeke,  zoe]   Reorganize  rows  to  avoid  rowsets   with  overlapping  key  ranges  
  • 49. 49  ©  Cloudera,  Inc.  All  rights  reserved.   RaG  consensus   49   TS  A           Tablet  1   (LEADER)   Client   TS  B           Tablet  1   (FOLLOWER)   TS  C           Tablet  1   (FOLLOWER)   WAL   WAL  WAL   2b.  Leader  writes  local  WAL   1a.  Client-­‐>Leader:  Write()  RPC   2a.  Leader-­‐>Followers:   UpdateConsensus()  RPC   3.  Follower:  write  WAL   4.  Follower-­‐>Leader:  success   3.  Follower:  write  WAL   5.  Leader  has  achieved  majority   6.  Leader-­‐>Client:  Success!  
  • 50. 50  ©  Cloudera,  Inc.  All  rights  reserved.   Handling  inserts  and  updates   •  Inserts  go  to  an  in-­‐memory  row  store  (MemRowSet)   • Durable  due  to  write-­‐ahead  logging   • Later  flush  to  columnar  format  on  disk   •  Updates  go  to  in-­‐memory  “delta  store”   • Later  flush  to  “delta  files”  on  disk   • Eventually  “compact”  into  the  previously-­‐wriyen  columnar  data  files   •  Details  elided  here  due  to  Eme  constraints   • available  in  other  slide  decks  online,  or  our  academic-­‐style  paper