1. 2013: year of real-time access to Big Data?
Geoffrey Hendrey
@geoffhendrey
@vertascale
2. Agenda
⢠Motivation
⢠Hadoop stack & data formats
⢠File access times and mechanics
⢠Key-based indexing systems (HBase)
⢠MapReduce, Hive/Pig
⢠MPP approaches & alternatives
3. Motivation
⢠Big Data is more opaque than small data
â Spreadsheets choke
â BI tools canât scale
â Small samples often fail to replicate issues
⢠Engineers, data scientists, analysts need:
â Faster âtime to answerâ on Big Data
â Rapid âfind, quantify, extractâ
⢠Solve âI donât know what I donât knowâ
4. Survey or real-time capabilities
⢠Real-time, in-situ, self-service is the
âHoly Grailâ for the business analyst
⢠spectrum of real-time capabilities exists
on Hadoop
Available in Hadoop Proprietary
HDFS HBase Drill
Easy Hard
6. Real-time spectrum on Hadoop
Use Case Support Real-time
Seek to a particular byte in a distributed HDFS YES
file
Seek to a particular value in a distributed HBase YES
file, by key (1-dimensional indexing)
Answer complex questions expressible in MapReduce NO
code (e.g. matching users to music (Hive, Pig)
albums). Data science.
Ad-hoc query for scattered records given MPP YES
simple constraints (âfield*4+==âmusicâ && Architectures
field*9+==âdvdâ)
7. Hadoop Underpinned By HDFS
⢠Hadoop Distributed File System (HDFS)
⢠inspired by Google FileSystem (GFS)
⢠underpins every piece of data in âHadoopâ
⢠Hadoop FileSystem API is pluggable
⢠HDFS can be replaced with other suitable
distributed filesystem
â S3
â kosmos
â etc
11. HDFS performance characteristics
⢠HDFS was designed for high throughput, not low
seek latency
⢠best-case configurations have shown HDFS to
perform 92K/s random reads
[http://hadoopblog.blogspot.com/]
⢠Personal experience: HDFS very robust. Fault
tolerance is ârealâ. Iâve unplugged machines
and never lost data.
13. MapFile for real-time access?
â Index file must be loaded by client (slow)
â Index file must fit in RAM of client by default
â scan an average of 50% of the sampling
interval
â Large records make scanning intolerable
â not a viable âreal worldâ solution for random
access
14. Apache HBase
⢠Clone of Googleâs Big Table.
⢠Key-based access mechanism
⢠Designed to hold billions of rows
⢠âTablesâ stored in HDFS
⢠Supports MapReduce over tables, into
tables
⢠Requires you to think hard, and commit
to a key design.
16. HBase random read performance
http://hstack.org/hbase-performance-testing/
⢠7 servers, each with
⢠8 cores
⢠32GB DDR3 and
⢠24 x 146GB SAS 2.0 10K RPM disks.
⢠Hbase table
⢠3 billion records,
⢠6600 regions.
⢠data size is between 128-256 bytes per row,
spread in 1 to 5 columns.
18. MapReduce
⢠âMapReduce is a framework for processing
parallelizable problems across huge datasets
using a large number of computersâ-wikipedia
⢠MapReduce is strongly tied to HDFS in Hadoop.
⢠Systems built on HDFS (i.e. HBase) leverage this
common foundation for integration with the MR
paradigm
19. MapReduce and Data Science
⢠Many complex algorithms can be expressed in
the MapReduce paradigm
â NLP
â Graph processing
â Image codecs
⢠The more complex the algorithm, the more Map
and Reduce processes become complex
programs in their own right.
⢠Often cascade multiple MR jobs in succession
20. A very bad* diagram
*this diagram makes it appear that data flows through the master node.
22. Is MapReduce real-time?
⢠MapReduce on Hadoop has certain latencies
that are hard to improve
â Copy
â Shuffle, sort
â Iterate
⢠time-dependent on the both the size of the
input data and the number of processors
available
⢠In a nutshell, itâs a âbatch processâ and isnât
âreal-timeâ
23. Hive and Pig
⢠Run on top of MapReduce
⢠Provide âTableâ metaphor familiar to SQL users
⢠Provide SQL-like (or actually same) syntax
⢠Store a âschemaâ in a database, mapping tables
to HDFS files
⢠Translate âqueriesâ to MapReduce jobs
⢠No more real-time than MapReduce
24. MPP Architectures
⢠Massively Parallel Processing
⢠Lots of machines, so also lots of memory
Examples:
⢠Spark â general purpose data science framework
sort of like real-time MapReduce for data
science
⢠Dremel â columnar approach, geared toward
answering SQL-like aggregations and BI-style
questions
25. Spark
⢠Originally designed for iterative machine
learning problems at Berkeley
⢠MapReduce does not do a great job on iterative
workloads
⢠Spark makes more explicit use of memory
caches than Hadoop
⢠Spark can load data from any Hadoop input
source
27. Is Spark Real-time?
⢠If data fits in memory, execution time for most
algorithms still depends on
â amount of data to be processed
â number of processors
⢠So, it still âdependsâ
⢠âŚbut definitely more focused on fast time-to-
answer
⢠Interactive scala and java shells
28. Dremel MPP architecture
⢠MPP architecture for ad-hoc query on nested
data
⢠Apache Drill is an OS clone of Dremel
⢠Dremel originally developed at Google
⢠Features âin situâ data analysis
⢠âDremel is not intended as a replacement for
MR and is often used in conjunction with it to
analyze outputs of MR pipelines or rapidly
prototype larger computations.â -Dremel:
Interactive Analysis of WebScaleDatasets
29. In Situ Analysis
⢠Moving Big Data is a nightmare
⢠In situ: ability to access data in
place
â In HDFS
â In Big Table
30. Uses For Dremel At Google
⢠Analysis of crawled web documents.
⢠Tracking install data for applications on Android
Market.
⢠Crash reporting for Google products.
⢠OCR results from Google Books.
⢠Spam analysis.
⢠Debugging of map tiles on Google Maps.
⢠Tablet migrations in managed Bigtable instances.
⢠Results of tests run on Googleâs distributed build
system.
⢠Etc, etc.
31. Why so many uses for Dremel?
⢠On any Big Data problem or application, dev
team faces these problems:
â âI donât know what I donât knowâ about data
â Debugging often requires finding and correlating
specific needles in the haystack
â Support and marketing often require segmentation
analysis (identify and characterize wide swaths of
data)
⢠Every developer/analyst wants
â Faster time to answer
â Fewer trips around the mulberry bush
35. Alternative approaches?
⢠Both MapReduce and MPP query architectures
take âthrow hardware at the problemâ
approach.
⢠Alternatives?
â Use MapReduce to build distributed indexes on data
â Combine columnar storage and inverted indexes to
create columnar inverted indexes
â Aim for the sweet spot for data scientist and
engineer: Ad-hoc queries with results returned in
seconds on a single processing node.
36. Contact Info
Email:
geoff@vertascale.com
Twitter:
@geoffhendrey
@vertascale
www:
http://vertascale.com