SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Data Evolution on HBase with Kiji
Who am I?
How do we store data in HBase?
• HBase

provides us with a single value type: byte[]

• In

an application it’s necessary to store various data types in a
cell: e.g. Java primitives, Java objects…	

• The

description of data we store in an HBase cell is the
schema.	


• Write

application code or a library to convert our data to/
from byte[].
What about when we want to
get our data back from HBase?
• HBase

is unaware of what we put in a cell.	


• What’s

in the bytes[]?


x0010312985x00
column=B:B:D, timestamp=1381493621000,
value=x0Bx00xB0x9AxA4x9BxB3Px02x80xF6xC4xD5xE6'x00x00<<b>

	


• Already

wrote a library to serialize/deserialize this data, so
everything’s great, right?
Sometime soon…
• The

data structure has changed	


• Ok, so
• What

we update our library with the changes.	


happens when we try to read back old data?	


• Raise

an exception? 	


• Write

a bunch of if (…) else if (…) code to determine the
correct format?
Instead, use a serialization library
with evolvable records
• Examples:	

• Avro	

• Thrift	

• Protocol
• Have

Buffers	


some notion of compatible changes to help us avoid
common pitfalls.
A little bit about Avro
• Datum
• Rules

structure defined by schema	


for compatible and incompatible schema changes.	


• Backward-compatibility, Forward-compatibility	

• Assumes

a linear evolution of schema	


• Reality: Schema

evolution is more complicated.
Ideal vs. Reality
Schema v1

Schema v1

Schema v2

Schema v2a

Schema v3

Schema v3a

Schema v4

Schema v2b

Schema v4aa Schema v4ab
A little (more) about Avro
• Schemas
• Specific
• Avro




can be defined in JSON or IDL format.	


& Generic API	


Schema Example (IDL):


record Pet {

string name;

int age;

string owner_name;

}
(even more) about Avro
• You

don't have to use compiled record classes.	


• Can

use GenericReader API to deserialize records with a
specified schema.	


• Makes

it easier to migrate data when you do have to make
an incompatible schema change. (sleep on it)
Adding New Fields
• Old




Schema:


record Pet {

string name; // Ms. Kitty

}
!

• New




Schema:


record Pet {

string name;

// kitten

string kind = “animal”;

}
Remove Fields
• Old




Schema:


record Pet {

string name;

string kind;

}

• New




Schema:


record Pet {

string name;

string kind;

}

What happens when an old reader reads this new record?
Can’t find kind
Remove Fields
• Old




Schema:


• New




record Pet {

string name = “”;

string kind = “animal”;

}

Schema:


record Pet {

string name = “”;

string kind = “animal”;

}

When an ‘old’ reader encounters a new record, the default
value will be used.	

!

Protip: Always provide default field values so you don’t kill
your kittens.
Type Promotion
• Old




Schema:


• New




record PetOwner {

int ownerId;

string name;

}

Schema:


record PetOwner {

long ownerId;

string name;

}

!

Detailed specification:	


http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution
Enter
• Human-friendly
• Uses Avro

• Provides

table layouts 	


for serialization	


• Supports
• Schema

Schema

primitive types and complex records	


is stored as part of the datum	


schema validation & audit trail
KijiTables >> Plain
• Layout

defined using JSON or DDL	


• Formatted
• Schemas

row keys	


stored in metadata table	


• Schema Validation
• Basically, an

on read & write	


enhanced HBase table

tables
Example Table Definition
CREATE TABLE foo WITH DESCRIPTION 'some data'

ROW KEY FORMAT (pet_id LONG)

WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (

MAXVERSIONS = INFINITY,

TTL = FOREVER,

INMEMORY = false,

COMPRESSED WITH SNAPPY,

FAMILY info WITH DESCRIPTION 'basic information' (

metadata CLASS com.mycompany.avro.Pet

);

• Visit

www.kiji.org for more details
Beyond Client-side Validation
• Server-side

Schema validation	


• Ensure

your reader is able to read stored records. 	


• Ensure

you make compatible schema changes	


• Ensure

you don’t accidentally introduce a new schema
Schema Validation
• Three

available modes:	


• Developer	

• Strict	

• None

(Not recommended in most cases, but still possible.)
Developer Mode
!
!

• Don’t
• New

need an ALTER statement to write with a new schema	


schemas are automatically registered on write	


• Incompatible

writers are rejected at run-time, so still safe	


• Convenience

when developing
Strict Mode
!
!

• New

schemas must be registered with an ALTER statement.	


• Incompatible

time.	


readers and writers are rejected at registration

• Production-safe
ALTER Examples
•

ALTER TABLE t SET VALIDATION = STRICT;

•

ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN
info:foo;




In column: 'info:foo' Reader schema: "int" is
incompatible with writer schema: "long".
Avoid Common Pitfalls w/
KijiSchema Validation
1. Record has string field with default value	

2. Field removed (compatible)	

3. New field with same name added but different type
(compatible from perspective of 2)	

4. Incompatible between 1 and 3!
• Apache

v2 Licensed Open Source	


• Includes

KijiSchema as well as components for writing	


•

MapReduce

•

Hive Adapter

• Scalding

flows for data science	


• REST API

supporting on-demand computation for real-time
web applications
BentoBox
• Complete
• Single

development environment for Kiji & HBase	


process Hadoop & HBase cluster	


• We

accept community contributions	


• Try

it today: www.kiji.org	


• User

Mailing List

Developer Mailing List
Questions?
Adam Kunicki	

adam@wibidata.com	

@ramblingpolak on Twitter

Weitere ähnliche Inhalte

Was ist angesagt?

General Programming Concept
General Programming ConceptGeneral Programming Concept
General Programming ConceptHaris Bin Zahid
 
Javawug bof 57 scala why now
Javawug bof 57 scala why nowJavawug bof 57 scala why now
Javawug bof 57 scala why nowSkills Matter
 
Coming Clean on Records at Tirana JUG
Coming Clean on Records at Tirana JUGComing Clean on Records at Tirana JUG
Coming Clean on Records at Tirana JUGAlbin Hasani
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBSebastian Dahlgren
 
Understanding AWS Storage Options
Understanding AWS Storage OptionsUnderstanding AWS Storage Options
Understanding AWS Storage OptionsAmazon Web Services
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven ProgrammingAmazon Web Services
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Amazon DynamoDB by Aswin
Amazon DynamoDB by AswinAmazon DynamoDB by Aswin
Amazon DynamoDB by AswinAgate Studio
 
Scala.js for large and complex frontend apps
Scala.js for large and complex frontend appsScala.js for large and complex frontend apps
Scala.js for large and complex frontend appsOtto Chrons
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 

Was ist angesagt? (13)

General Programming Concept
General Programming ConceptGeneral Programming Concept
General Programming Concept
 
Javawug bof 57 scala why now
Javawug bof 57 scala why nowJavawug bof 57 scala why now
Javawug bof 57 scala why now
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Coming Clean on Records at Tirana JUG
Coming Clean on Records at Tirana JUGComing Clean on Records at Tirana JUG
Coming Clean on Records at Tirana JUG
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDB
 
Understanding AWS Storage Options
Understanding AWS Storage OptionsUnderstanding AWS Storage Options
Understanding AWS Storage Options
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
 
Amazon DynamoDB by Aswin
Amazon DynamoDB by AswinAmazon DynamoDB by Aswin
Amazon DynamoDB by Aswin
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Scala.js for large and complex frontend apps
Scala.js for large and complex frontend appsScala.js for large and complex frontend apps
Scala.js for large and complex frontend apps
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 

Ähnlich wie Data Evolution on HBase (with Kiji)

Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
Cache on Delivery
Cache on DeliveryCache on Delivery
Cache on DeliverySensePost
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxsandeshshahapur
 
Tk2323 lecture 7 sql
Tk2323 lecture 7   sql Tk2323 lecture 7   sql
Tk2323 lecture 7 sql MengChun Lam
 
Flume HBase
Flume HBaseFlume HBase
Flume HBaseirayan
 
Java basic datatypes
Java basic datatypesJava basic datatypes
Java basic datatypesSoba Arjun
 
5variables in c#
5variables in c#5variables in c#
5variables in c#Sireesh K
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming ApplicationsDataWorks Summit
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptxnatesanp1234
 
Java Script
Java ScriptJava Script
Java ScriptSarvan15
 
Java Script
Java ScriptJava Script
Java ScriptSarvan15
 
AWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsAWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsSerhat Can
 

Ähnlich wie Data Evolution on HBase (with Kiji) (20)

Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Java script
Java scriptJava script
Java script
 
London devops logging
London devops loggingLondon devops logging
London devops logging
 
Cache on Delivery
Cache on DeliveryCache on Delivery
Cache on Delivery
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
 
Tk2323 lecture 7 sql
Tk2323 lecture 7   sql Tk2323 lecture 7   sql
Tk2323 lecture 7 sql
 
Flume HBase
Flume HBaseFlume HBase
Flume HBase
 
Java basic datatypes
Java basic datatypesJava basic datatypes
Java basic datatypes
 
5variables in c#
5variables in c#5variables in c#
5variables in c#
 
React-Native Lecture 11: In App Storage
React-Native Lecture 11: In App StorageReact-Native Lecture 11: In App Storage
React-Native Lecture 11: In App Storage
 
Javascript
JavascriptJavascript
Javascript
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Variables in Pharo
Variables in PharoVariables in Pharo
Variables in Pharo
 
Evolving Streaming Applications
Evolving Streaming ApplicationsEvolving Streaming Applications
Evolving Streaming Applications
 
PHP and MySQL.pptx
PHP and MySQL.pptxPHP and MySQL.pptx
PHP and MySQL.pptx
 
Java Script
Java ScriptJava Script
Java Script
 
Java Script
Java ScriptJava Script
Java Script
 
AWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, AnalyticsAWS Kinesis - Streams, Firehose, Analytics
AWS Kinesis - Streams, Firehose, Analytics
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Data Evolution on HBase (with Kiji)

  • 1. Data Evolution on HBase with Kiji
  • 3. How do we store data in HBase? • HBase provides us with a single value type: byte[] • In an application it’s necessary to store various data types in a cell: e.g. Java primitives, Java objects… • The description of data we store in an HBase cell is the schema. • Write application code or a library to convert our data to/ from byte[].
  • 4. What about when we want to get our data back from HBase? • HBase is unaware of what we put in a cell. • What’s in the bytes[]?
 x0010312985x00 column=B:B:D, timestamp=1381493621000, value=x0Bx00xB0x9AxA4x9BxB3Px02x80xF6xC4xD5xE6'x00x00<<b> • Already wrote a library to serialize/deserialize this data, so everything’s great, right?
  • 5. Sometime soon… • The data structure has changed • Ok, so • What we update our library with the changes. happens when we try to read back old data? • Raise an exception? • Write a bunch of if (…) else if (…) code to determine the correct format?
  • 6. Instead, use a serialization library with evolvable records • Examples: • Avro • Thrift • Protocol • Have Buffers some notion of compatible changes to help us avoid common pitfalls.
  • 7. A little bit about Avro • Datum • Rules structure defined by schema for compatible and incompatible schema changes. • Backward-compatibility, Forward-compatibility • Assumes a linear evolution of schema • Reality: Schema evolution is more complicated.
  • 8. Ideal vs. Reality Schema v1 Schema v1 Schema v2 Schema v2a Schema v3 Schema v3a Schema v4 Schema v2b Schema v4aa Schema v4ab
  • 9. A little (more) about Avro • Schemas • Specific • Avro 
 can be defined in JSON or IDL format. & Generic API Schema Example (IDL):
 record Pet {
 string name;
 int age;
 string owner_name;
 }
  • 10. (even more) about Avro • You don't have to use compiled record classes. • Can use GenericReader API to deserialize records with a specified schema. • Makes it easier to migrate data when you do have to make an incompatible schema change. (sleep on it)
  • 11. Adding New Fields • Old 
 Schema:
 record Pet {
 string name; // Ms. Kitty
 } ! • New 
 Schema:
 record Pet {
 string name;
 // kitten
 string kind = “animal”;
 }
  • 12. Remove Fields • Old 
 Schema:
 record Pet {
 string name;
 string kind;
 } • New 
 Schema:
 record Pet {
 string name;
 string kind;
 } What happens when an old reader reads this new record? Can’t find kind
  • 13. Remove Fields • Old 
 Schema:
 • New 
 record Pet {
 string name = “”;
 string kind = “animal”;
 } Schema:
 record Pet {
 string name = “”;
 string kind = “animal”;
 } When an ‘old’ reader encounters a new record, the default value will be used. ! Protip: Always provide default field values so you don’t kill your kittens.
  • 14. Type Promotion • Old 
 Schema:
 • New 
 record PetOwner {
 int ownerId;
 string name;
 } Schema:
 record PetOwner {
 long ownerId;
 string name;
 } ! Detailed specification: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution
  • 15. Enter • Human-friendly • Uses Avro • Provides table layouts for serialization • Supports • Schema Schema primitive types and complex records is stored as part of the datum schema validation & audit trail
  • 16. KijiTables >> Plain • Layout defined using JSON or DDL • Formatted • Schemas row keys stored in metadata table • Schema Validation • Basically, an on read & write enhanced HBase table tables
  • 17. Example Table Definition CREATE TABLE foo WITH DESCRIPTION 'some data'
 ROW KEY FORMAT (pet_id LONG)
 WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
 MAXVERSIONS = INFINITY,
 TTL = FOREVER,
 INMEMORY = false,
 COMPRESSED WITH SNAPPY,
 FAMILY info WITH DESCRIPTION 'basic information' (
 metadata CLASS com.mycompany.avro.Pet
 ); • Visit www.kiji.org for more details
  • 18. Beyond Client-side Validation • Server-side Schema validation • Ensure your reader is able to read stored records. • Ensure you make compatible schema changes • Ensure you don’t accidentally introduce a new schema
  • 19. Schema Validation • Three available modes: • Developer • Strict • None (Not recommended in most cases, but still possible.)
  • 20. Developer Mode ! ! • Don’t • New need an ALTER statement to write with a new schema schemas are automatically registered on write • Incompatible writers are rejected at run-time, so still safe • Convenience when developing
  • 21. Strict Mode ! ! • New schemas must be registered with an ALTER statement. • Incompatible time. readers and writers are rejected at registration • Production-safe
  • 22. ALTER Examples • ALTER TABLE t SET VALIDATION = STRICT; • ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN info:foo;
 
 In column: 'info:foo' Reader schema: "int" is incompatible with writer schema: "long".
  • 23. Avoid Common Pitfalls w/ KijiSchema Validation 1. Record has string field with default value 2. Field removed (compatible) 3. New field with same name added but different type (compatible from perspective of 2) 4. Incompatible between 1 and 3!
  • 24. • Apache v2 Licensed Open Source • Includes KijiSchema as well as components for writing • MapReduce • Hive Adapter • Scalding flows for data science • REST API supporting on-demand computation for real-time web applications
  • 25. BentoBox • Complete • Single development environment for Kiji & HBase process Hadoop & HBase cluster • We accept community contributions • Try it today: www.kiji.org • User Mailing List
 Developer Mailing List