Weitere ähnliche Inhalte Ähnlich wie Kite SDK: Working with Datasets (20) Mehr von Cloudera, Inc. (20) Kürzlich hochgeladen (20) Kite SDK: Working with Datasets2. What problem is Kite solving?
©2014 Cloudera, Inc. All rights reserved.
• Accessibility
• Hadoop is flexible, but low level
• Should be easy to use, without being an expert
3. Kite SDK
©2014 Cloudera, Inc. All rights reserved.
• A set of off-the-shelf tools
• Based on experience and best practices
• Lets you focus on your problem
• Helps you solve new challenges
4. Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Focus on using your data, not managing it
• You shouldn’t have to maintain data files
• This is the first thing you need
7. Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBase
Data files HBase
Maintained by the Kite
8. Kite Datasets: Goals
©2014 Cloudera, Inc. All rights reserved.
• Think in terms of data, not files
• Describe your data and Kite does the right thing
• Should work consistently across the platform
• Reliable
9. Kite Datasets: Compatibility
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
10. Kite Datasets: What is it?
©2014 Cloudera, Inc. All rights reserved.
• A high-level API for data management
• Work with records and datasets
• Not files, directories, or byte arrays
• Standard descriptions for records and storage
• Schemas describe records
• Partition strategies describe layout
• Opinionated
11. Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar
--output rating.avsc
12. Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar
--output rating.avsc
1. Describe your layout
dataset partition-config ts:year ts:month ts:day
--schema rating.avsc --output ymd.json
13. Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar
--output rating.avsc
1. Describe your layout
dataset partition-config ts:year ts:month ts:day
--schema rating.avsc --output ymd.json
1. Create a dataset
dataset create ratings --schema rating.avsc
--partition-by ymd.json
14. Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
datasets/
└── ratings/
├── year=1997/
│ ├── month=09/
│ │ ├── day=20/
│ │ ├── ...
│ │ └── day=30/
│ ├── month=10/
│ │ ├── day=01/
│ │ ├── ...
16. Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBase
Data files HBase
Maintained by the Kite
17. Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
• Rows identified by keys, managed by HBase
• Columns are organized as cells
• Cells are identified by column family, qualifier
• The catch: everything is a byte array
family name ...
row key last first ...
buzz@pixar.com Lightyear Buzz ...
18. • Uniform interaction with HBase and HDFS datasets
• Need to make keys from records
• Need configuration to map fields to cells
Kite HBase
©2014 Cloudera, Inc. All rights reserved.
19. Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite builds the key from each record
• Kite translates keys to HBase row id bytes
20. Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy produces a storage key
• HDFS partitioning uses a group key
1403028411014 => (2014, 6, 17)
• HBase partitioning uses a unique key
• Grouping is done dynamically by HBase
1403028411014 => (1403028411014)
21. Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset partition-config --schema user.avsc
email:copy
22. Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset partition-config --schema user.avsc
email:copy
[ {
"source" : "email", "type" : "identity",
"name" : "email_copy"
} ]
23. Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc
email:hash[16] email:copy
24. Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc
email:hash[16] email:copy
[ {
"source" : "email", "type" : "hash",
"buckets" : 16, "name" : "email_hash"
}, {
"source" : "email", "type" : "identity",
"name" : "email_copy"
} ]
25. Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite builds the key from each record
• Kite translates keys to HBase row id bytes
• Some operations require keys
26. Kite HBase: Field mapping
©2014 Cloudera, Inc. All rights reserved.
• Configure the column family and qualifier for a field
{ "email": "buzz@pixar.com",
"firstName": "Buzz", ... }
family name ...
row key last first ...
buzz@pixar.com Lightyear Buzz ...
27. Kite HBase: Basic column mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column",
"family": "name", "qualifier": "first" }
28. Kite HBase: Counter mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column",
"family": "name", "qualifier": "first" }
counter (can be incremented)
{ "source": "visits", "type": "counter"
"family": "counts", "qualifier": "visits"}
29. Kite HBase: Key mapping
©2014 Cloudera, Inc. All rights reserved.
key (stored in the row key using identity)
{ "source": "email", "type": "key" }
30. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
31. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
[
{ "source": "email",
"type": "key" },
...
]
32. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.co
m
Lightyear Buzz 315 true
[
{ "source": "email",
"type": "key" },
...
]
33. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
34. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
[
{ "source": "lastName",
"type": "column",
"family": "name",
"qualifier": "last" },
...
]
35. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "lastName",
"type": "column",
"family": "name",
"qualifier": "last" },
...
]
36. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "visits",
"type" : "long"
}, ... ]
}
Kite HBase: Example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "visits",
"type": "counter",
"family": "counts",
"qualifier": "visits" },
...
]
37. • Working with a dataset in HBase does not change
• Readers / writers are backed by scans
• CLI tools work:
dataset csv-import pixar_users.csv users --use-hbase
• Additional methods on RandomAccessDataset
• get, put, delete, increment
Kite HBase: Interaction
©2014 Cloudera, Inc. All rights reserved.
38. RandomAccessDataset<User> users = ...;
Key buzzEmailKey = new Key.Builder()
.add("email", "buzz@pixar.com")
.build();
User buzz = users.get(buzzEmailKey);
buzz.addPreference("flash", true);
users.put(buzz);
Kite HBase: Interaction using keys
©2014 Cloudera, Inc. All rights reserved.
39. • Versioning and concurrency
• Additional occVersion type, like a counter
• Rejects a put if the record has changed
• Key-as-column mapping
• Stores maps or records in a column family
• Uses the key or field name as the qualifier
Kite HBase: More features
©2014 Cloudera, Inc. All rights reserved.
40. • Translation between objects and byte arrays in Kite
• Configuration to define key format
• Configuration to define how fields are stored
• Decreases the code and time required to
experiment
• Key format and column mappings are hard
• Try out configurations to find the right one
Kite HBase: Conclusion
©2014 Cloudera, Inc. All rights reserved.