Data changes over time often requiring carefully planned changes to database tables and application code. KijiSchema integrates best practices with serialization, schema design & evolution, and metadata management common in NoSQL storage solutions. In particular, KijiSchema provides strong guarantees of schema evolution and validation of reads and writes issued by application code.
We'll be looking at how you can take advantage of KijiSchema in your HBase applications, especially if you're new to HBase.
The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.
Developers are using Kiji to build:
• Product and content recommendation systems
• Risk analysis and fraud monitoring
• Customer profile and segmentation applications
• Energy usage analytics & reporting
3. How do we store data in HBase?
• HBase
provides us with a single value type: byte[]
• In
an application it’s necessary to store various data types in a
cell: e.g. Java primitives, Java objects…
• The
description of data we store in an HBase cell is the
schema.
• Write
application code or a library to convert our data to/
from byte[].
4. What about when we want to
get our data back from HBase?
• HBase
is unaware of what we put in a cell.
• What’s
in the bytes[]?
x0010312985x00
column=B:B:D, timestamp=1381493621000,
value=x0Bx00xB0x9AxA4x9BxB3Px02x80xF6xC4xD5xE6'x00x00<<b>
• Already
wrote a library to serialize/deserialize this data, so
everything’s great, right?
5. Sometime soon…
• The
data structure has changed
• Ok, so
• What
we update our library with the changes.
happens when we try to read back old data?
• Raise
an exception?
• Write
a bunch of if (…) else if (…) code to determine the
correct format?
6. Instead, use a serialization library
with evolvable records
• Examples:
• Avro
• Thrift
• Protocol
• Have
Buffers
some notion of compatible changes to help us avoid
common pitfalls.
7. A little bit about Avro
• Datum
• Rules
structure defined by schema
for compatible and incompatible schema changes.
• Backward-compatibility, Forward-compatibility
• Assumes
a linear evolution of schema
• Reality: Schema
evolution is more complicated.
9. A little (more) about Avro
• Schemas
• Specific
• Avro
can be defined in JSON or IDL format.
& Generic API
Schema Example (IDL):
record Pet {
string name;
int age;
string owner_name;
}
10. (even more) about Avro
• You
don't have to use compiled record classes.
• Can
use GenericReader API to deserialize records with a
specified schema.
• Makes
it easier to migrate data when you do have to make
an incompatible schema change. (sleep on it)
11. Adding New Fields
• Old
Schema:
record Pet {
string name; // Ms. Kitty
}
!
• New
Schema:
record Pet {
string name;
// kitten
string kind = “animal”;
}
12. Remove Fields
• Old
Schema:
record Pet {
string name;
string kind;
}
• New
Schema:
record Pet {
string name;
string kind;
}
What happens when an old reader reads this new record?
Can’t find kind
13. Remove Fields
• Old
Schema:
• New
record Pet {
string name = “”;
string kind = “animal”;
}
Schema:
record Pet {
string name = “”;
string kind = “animal”;
}
When an ‘old’ reader encounters a new record, the default
value will be used.
!
Protip: Always provide default field values so you don’t kill
your kittens.
14. Type Promotion
• Old
Schema:
• New
record PetOwner {
int ownerId;
string name;
}
Schema:
record PetOwner {
long ownerId;
string name;
}
!
Detailed specification:
http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution
15. Enter
• Human-friendly
• Uses Avro
• Provides
table layouts
for serialization
• Supports
• Schema
Schema
primitive types and complex records
is stored as part of the datum
schema validation & audit trail
16. KijiTables >> Plain
• Layout
defined using JSON or DDL
• Formatted
• Schemas
row keys
stored in metadata table
• Schema Validation
• Basically, an
on read & write
enhanced HBase table
tables
17. Example Table Definition
CREATE TABLE foo WITH DESCRIPTION 'some data'
ROW KEY FORMAT (pet_id LONG)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
COMPRESSED WITH SNAPPY,
FAMILY info WITH DESCRIPTION 'basic information' (
metadata CLASS com.mycompany.avro.Pet
);
• Visit
www.kiji.org for more details
18. Beyond Client-side Validation
• Server-side
Schema validation
• Ensure
your reader is able to read stored records.
• Ensure
you make compatible schema changes
• Ensure
you don’t accidentally introduce a new schema
20. Developer Mode
!
!
• Don’t
• New
need an ALTER statement to write with a new schema
schemas are automatically registered on write
• Incompatible
writers are rejected at run-time, so still safe
• Convenience
when developing
21. Strict Mode
!
!
• New
schemas must be registered with an ALTER statement.
• Incompatible
time.
readers and writers are rejected at registration
• Production-safe
22. ALTER Examples
•
ALTER TABLE t SET VALIDATION = STRICT;
•
ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN
info:foo;
In column: 'info:foo' Reader schema: "int" is
incompatible with writer schema: "long".
23. Avoid Common Pitfalls w/
KijiSchema Validation
1. Record has string field with default value
2. Field removed (compatible)
3. New field with same name added but different type
(compatible from perspective of 2)
4. Incompatible between 1 and 3!
24. • Apache
v2 Licensed Open Source
• Includes
KijiSchema as well as components for writing
•
MapReduce
•
Hive Adapter
• Scalding
flows for data science
• REST API
supporting on-demand computation for real-time
web applications
25. BentoBox
• Complete
• Single
development environment for Kiji & HBase
process Hadoop & HBase cluster
• We
accept community contributions
• Try
it today: www.kiji.org
• User
Mailing List
Developer Mailing List