4. A LOT OF OUR
DATA IS IN MONGO
• MongoDB is a fantastic application
database
• uses BSON - like JSON, but has a
binary representation
• MongoDB is schemaless, but has
indexed queries and other
features that are nice for
applications
5. APPLICATION DBS
SUCK FOR ANALYSIS
• well, sometimes. relational
databases are OK
• MongoDB is awful (for this)
• no joins
• scans are painful
• no declarative query language
7. V1:
TSV + IMPALA
• threw together a Hadoop cluster
on the developer boxes
script dumped models to
• “nightly” in HDFS
TSV files
script output
• jankyour models the schema
from
• query from Impala
8. ASIDE: IMPALA IS
PRETTY COOL
• developed by Cloudera
• absurdly fast queries over HDFS
• SQL is great
• most of our questions are ad-hoc
secrets =(
woah
9. A NICE
EXPERIMENT, BUT...
• schema translation is hard
• SLOW SLOW SLOW
• TSV is not a great format
• script never runs
• not production data
10. V2:
MONGO -> HBASE
• Impala can query HBase, I think?
wrote MoSQL - let’s do
• @nelhagething, but put the data in
the same
HBase!
from
• translatingeasier one k/v store to
another is
15. THEN, QUERY IT
WITH IMPALA...UM
• wait, impala can’t actually query
HBase effectively
• 30-40x slower over the same
data
• limitingI factor is HBase scan
speed, think
16. LOST IN
TRANSLATION
• our schema problem is still there!
• BSON is typed, but HBase is just
strings
• nested hashes still don’t work
• lists???
• what is the canonical schema?
17.
18. V3:
PARQUET + THRIFT
storing k/v pairs,
• instead ofraw BSON blobs just
store the
• write your MR jobs against HBase
if you want up-to-date data
• also periodically dump out
Parquet files
• use thrift definitions to manage
schema
19. USING THRIFT AS
SCHEMA
nice way
• thrift is a expect toto define what
fields we
be in the
BSON
• in most cases, we can do the
translation automatically
on the backend, instead of
• decodereplication
during
• no information loss
20. GENERATE THRIFT
DEFINITIONS?
• thrift still isn’t the canonical- that
schema for our application
exists in our ODM
• wrote a quick ruby script to
generate thrift definitions from
our application models
22. IMPALA <3
PARQUET
• more glue can automatically
import parquet files into Impala
designed
• Impala and parquet areother
to work well with each
• nested structs don’t work yet =(
23. SCALDING <3
PARQUET
• we use scalding for a lot of
MapReduce stuff
• added ParquetSource to scalding
to make this easy (source and
sink)
24. THIS WORKS FOR
ANY DATA
• use thrift to define an data type,
intermediate or derived
and you get, for free:
• serialization using parquet
• easy MR jobs with scalding
• ad-hoc querying with Impala