This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
5. Small vs. Big Data
• Matrix Algebra solution of Linear Regression
– Input record #: m, Feature #: n
– In memory data structure: O(m*n)
– O(m*n^2)
• Gradient Decent
– Each iteration: Sequential read of input data
– In memory: O(n)
– Iteration # is bounded by a constant.
– Parallel Computation: input processing can be partitioned, and
gradient is aggregated.
6. A Real Life Example
Google Ads
Server
query
Ad
Impression
Click
Server
Ad click
Logs
impressions
clicks
Stream Processing of
Impressions/Clicks
Billing
Budget
Service
Daily/Hourly
Batch Procesing
hour Imp
s
clic
ks
cos
ts
8:00 100 10 $20
9:00 120 9 $16
Base Stats Table
OLAP
Processing
Adwords
Stats
Console
Daily
Stats
7. Processing Type Programming Pattern Example Usages
Offline
Data
Processing
Batch
Processing
- Gathering of data and processing as a
group at one time.
- Jobs run to completion
- Data might be out of date
- Adhoc/one-time data
analysis.
- Periodically reprocess of
complete data set.
Iterative
Batch
- Data stream is chunked
- Each Chunk is batch-processed.
- Effect of each batch is incrementally
applied.
- Continuous running.
- Chunk an additive
processing.
- Iterative Machine Learning
Algorithm (eg, Stochastic
Gradient Decent)
Stream
Processing
- Real-time or near real-time processing of
events.
- Process one event at a time.
- Continuous running.
- Real time alert on time
series data.
OLAP - Online Analytic Processing for interactive
queries.
- Processing logic is distributed with data.
- Interactive analysis of log
data.
- Business/operational stats
queries over base table.
8. Processing
Type
Technology Core Concepts Major Functionality
Batch Processing Hadoop MapReduce
Hive Sql over Mapreduce Compile sql syntax to MR
programs
Google Flume Data processing DAG over
Mapreduce
Compile computation DAG
to MR
Spark Repeated processing of
working set
Iterative Batch
OLAP Google Dremel OLAP Sql over Columnar DB OLAP on complex
structured data
Apache Impala OLAP Sql over Hadoop data
Stream
Processing
Google Millwheel Computation DAG for event
processing
Apache Storm Computation DAG for event
processing
9. Offline Data Processing Topology
• DAG (Directed Acyclic Graph)
Data Input1
Data Input2
Output1
Output2
11. Mapreduce
• A DAG composed of Mappers and Reducers
• Mapper operates on each input record:
– Record one or more {key, data}
– Mapper output is resharded by key
• Reducer operates on a list of mapper output records of the same
key.
Sharded
Data Input
Reducer
Mapper
Mapper
Mapper
Reducer
shuffle
shuffle
shuffle
merge/
sort
merge/
sort
Sharded
Data output
12. The Problem of “Word Count”
• Find the top 100 most frequent English words on the WEB.
– Assume you already have downloaded web documents.
13. A Mapreduce Solution
• MR1
– Input: Arbitrarily sharded downloaded WEB HTML documents.
– Mapper:
• Input Record: a document.
• Parse the document, chunk into words.
• Output records: {key=word, data=word_freq_in_input_record}
– Reducer:
• For batch input records (of the same key), add up word_freq.
• Output records: {word, freq} // you don’t have to output every record here !!
• MR2
– Input: output from MR1
– Mapper
• For input record of {word, freq}, output {key=freq, data=word}
– Reducer
• Input records are sorted by freq, only output top 100 records by freq.
14. Mapreduce
• Appears to be inefficient:
– So much data read/write and computations for the simplest problems.
• Very efficient data read/write – sequential IO
– Data including intermediate data are stored in distributed file system that
is optimized for sequential IO.
• Imply shard, sort, merge-sort
• Simple programming model of Mapper/Reducer
• Optimizations:
– Combiner: a local reducer for each mapper
15. Mapreduce
• Inefficiencies:
– Processing time is determined by slowest mapper and reducer.
• No reducer can start until all mappers finish.
• One reducer may take long time to finish if too much data in one reducer
shard.
16. Hive: A data warehouse on Hadoop
Based on Facebook Team’s paper
7/23/2018 16
17. Motivation
• Yahoo worked on Pig to facilitate application deployment
on Hadoop.
– Their need mainly was focused on unstructured data
• Simultaneously Facebook started working on deploying
warehouse solutions on Hadoop that resulted in Hive.
– The size of data being collected and analyzed in industry for
business intelligence (BI) is growing rapidly making traditional
warehousing solution prohibitively expensive.
7/23/2018 17
18. Hadoop MR
• MR is very low level and requires customers to write custom
programs.
• HIVE supports queries expressed in SQL-like language called
HiveQL which are compiled into MR jobs that are executed on
Hadoop.
• Hive also allows MR scripts
• It also includes MetaStore that contains schemas and statistics that
are useful for data explorations, query optimization and query
compilation.
• At Facebook Hive warehouse contains tens of thousands of tables,
stores over 700TB and is used for reporting and ad-hoc analyses by
200 Fb users.
7/23/2018 18
20. Data model
• Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
• It supports primitive types: integers, floats, doubles, and
strings
• Hive also supports:
– associative arrays: map<key-type, value-type>
– Lists: list<element type>
– Structs: struct<file name: file type…>
• SerDe: serialize and deserialized API is used to move data in
and out of tables
7/23/2018 20
21. Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve worm property)
– Can overwrite an entire table
7/23/2018 21
22. Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';
7/23/2018 22
23. Session/tmstamp example
Construct session as a sorted list of events by ts.
FROM (
FROM session_events_table
SELECT sessionid, tstamp, data
DISTRIBUTE BY sessionid SORT BY tstamp
) a
REDUCE sessionid, tstamp, data USING
'session_reducer.sh';7/23/2018 23
24. Data Storage
• Tables are logical data units; table metadata associates
the data in the table to hdfs directories.
• Hdfs namespace: tables (hdfs directory), partition (hdfs
subdirectory), buckets (subdirectories within partition)
• /user/hive/warehouse/test_table is a hdfs directory
7/23/2018 24
26. Architecture
• Metastore: stores system catalog
• Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also
manages session handle and session statistics
• Query compiler: Compiles HiveQL into a directed acyclic graph of
map/reduce tasks
• Execution engines: The component executes the tasks in proper
dependency order; interacts with Hadoop
• HiveServer: provides Thrift interface and JDBC/ODBC for integrating other
applications.
• Client components: CLI, web interface, jdbc/odbc inteface
• Extensibility interface include SerDe, User Defined Functions and User
Defined Aggregate Function.
7/23/2018 26
28. Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.
7/23/2018 28
29. Spark & RDD: Cluster Computing w.
Working Set
29
30. RDD
• A resilient distributed dataset (RDD) is a read-only collection
of objects partitioned across a set of machines that can be
rebuilt if a partition is lost.
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• Constraints on RDD
– Immutable after generation
– Generated with a set of coarse grained operations
• RDD is input/output/intermediate data of a computation DAG
31. RDD -- construction
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).
• By “parallelizing” a Scala collection (e.g., an array) in the
driver program, which means dividing it into a number of
slices that will be sent to multiple nodes.
• By transforming an existing RDD.
– A dataset with elements of type A can be transformed into a dataset
with elements of type B using an operation called flatMap, which
passes each element through a user-provided function of type A ⇒
List[B].
– Other transformations can be expressed using flatMap, including
map (pass elements through a function of type A ⇒ B) and filter
(pick elements matching a predicate).
32. RDD -- persistence
• By default, RDDs are lazy and ephemeral.
– Partitions of a dataset are materialized on demand when they are
used in a parallel operation (e.g., by passing a block of a file through
a map function), and are discarded from memory after use.
• Alter persistence:
– cache action: leaves the dataset lazy, but hints that it should be kept
in memory after the first time it is computed, because it will be
reused.
– save action: evaluates the dataset and writes it to a distributed
filesystem such as HDFS. The saved version is used in future
operations on it.
33. RDD – parallel operations
• reduce: Combines dataset elements using an associative
function to produce a result at the driver program.
• collect: Sends all elements of the dataset to the driver
program. For example, an easy way to update an array in
parallel is to parallelize, map and collect the array.
• foreach: Passes each element through a user provided
function. This is only done for the side effects of the
function (which might be to copy data to another system
or to update a shared variable as explained below).
34. Programming Model
• A spark driver program defines a computation DAG
– Input/output/intermediate data as RDDs
– Transformations on RDDs: filtering, for-each, reduce
– A transformation takes a function for the exact logic for the
transformation.
• Scala: a language for statistical operations on data set
– Operators and syntax for vector manipulations.
35. Spark Shared Variables
• map, filter and reduce operations can refer to variables in
the scope where they are created.
• Restricted types of shared variables:
– Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of
packaging it with every closure.
– Accumulators: These are variables that workers can only “add”
to using an associative operation, and that only the driver can
read.
• Usage: counters in Mapreduce, parallel sums
36. Programming Examples – Text Search
• Count the lines containing errors in a large log file stored
in HDFS.
• Looks just like MapReduce, except an RDD can be
cached by:
– val cachedErrs = errs.cache()
37. Programming Examples – Logistic
Regression
Train a logistic regression model using gradient decent.
• Model: y = logit(W*X)
• Points: a set of pairs (y,
X), X is a vector.
• Starts with a vector W
with random values.
• Each iteration: adjust W
based on accumulated
differences between y
and y_exp over all
points
39. Node
Partition
of points
Compute grad
over all points
W
In each iteration of the Gradient Decent
grad
Driver
Program
Node
Partition
of points
Compute grad
over all points
Node
Partition
of points
Compute grad
over all points
grad
grad
Aggregate grad
W -= grad.value
W
W
W
• At start:
W is shared from driver to
all nodes.
• At end:
grad is propagated from
each node and
aggregated.
Aggregated grad is
applied to W.
• W is a broad-cast variable
• Grad is a aggregated
variable
56. • Cube Operation: slice, dice, drill-up/down, rollup
• Implemented with OLAP Queries:
– Eg: select dim1, dim2, agg(met1), agg(met2) from BaseTable where
selector(dim3) group by dim1, dim2
– select publisher,sum(click) from BaseTable where advertiser=“Google”
group by publisher
• Time Series OLAP data
– Aggregate metrics over a time epoch: sum, ave, median, count, ….
– Eg: aggregate data from each point of “every 5 secs” to “every 5 mins”
Traditional OLAP Cube
57. OLAP Solutions
• Relational OLAP (ROLAP)
– Use Relational DB with extensive indexes
– Expensive due to large index size but support online data update.
• Multidimensional OLAP (MOLAP)
– Build data cubes based on predefined dimensions during the design
phase.
– Not flexible for new analytical requirements.
– Good for predefined BI reporting applications.
• Full Scan
– Adhoc queries require full-scan (without indexes)
– Relies on disk IO thruput, distributed data store and parallel query
processing
– Columnar storage
– Don’t support online data update
58. OLAP over “Big Data”
• Base table has large number of records
• Requirement on OLAP queries:
– Low latency
– Low concurrency
• Horizontal Scaling
– Partition base table
– No or limited support for join denormed base table
59. Google’s Dremel
“A scalable (over data set), interactive ad-hoc query
system for analysis of read-only nested data.”
-- http://research.google.com/pubs/archive/36632.pdf
60. Google’s Dremel: Data Model
• Data Model (Protocol Buffer)
dom is any value of basic types:
string, integer, float, …
61. Google’s Dremel: Query Model
• Interactive ad hoc
queries more than
just OLAP
• If repeated nested
fields are used in
query, base table
is denormed:
DocId Name.Url Name.Language
.Code
10 ‘http://a’ en-us
10 ‘http://a’ en
62. Google’s Dremel: Data Storage
• Columnar storage
– Separates a record
into column values
and stores each
value on different
storage volume.
– Traditional
databases
normally store the
whole record on
one volume.
63. Google’s Dremel: Data Storage
• Columnar Store Advantages:
– Traffic minimization. Only required column values on each
query are scanned and transferred on query execution.
• For example, a query “SELECT top(title) FROM foo” would access the
title column values only.
– Higher compression ratio. One study3 reports that columnar
storage can achieve a compression ratio of 1:10, whereas
ordinary row-based storage can compress at roughly 1:3.
• Because each column would have similar values, especially if the
cardinality of the column (variation of possible column values) is low.
64. Google’s Dremel: Data Storage
• Columnar Store Disadvantages:
– Record insertion/update needs to fan out to multiple volumes.
• Dremel does not support online update/insertion
• Table partitions are generated in batch
– Daily log table: partition by time (eg, hourly)
66. Google’s Dremel: Query Execution
• Example Query Execution
Base table T is partitioned into {T_1, … T_i,}
Original query:
SELECT A, COUNT(B) FROM T GROUP BY A
Divide query and aggregate result sets thru the serving tree:
SELECT A, SUM(c) FROM (R_1 UNION ALL ... R_n ) GROUP BY A
Query at the bottom (directly on each tablet)
R_i = SELECT A, COUNT(B) AS c FROM T_i GROUP BY A
67. Dremel: Some Applications inside Google
• Adhoc analysis of event logs from many services
• Analysis of crawled web documents
• Tracking install data for applications in the Android Market
• Crash reporting for Google products
• OCR results from Google Books
• Spam analysis
• Debugging of map tiles on Google Maps
• Tablet migrations in managed Bigtable instances
• Results of tests run on Google’s distributed build system
• Disk I/O statistics for hundreds of thousands of disks
• Resource monitoring for jobs run in Google’s data centers
• Symbols and dependencies in Google’s codebase
70. Apache Impala
• Motivated by Dremel
• More general data model and query support
• Fits into Apache/Hadoop Eco-system
71. Druid (open source)
• Motivated by Dremel
• Dedicated to OLAP cube of time series data
• Base table: multi-dimensional time series data
• Query: select-aggregate-groupby for cube operations
72. Exercise: Design Google Ad’s Data
Processing Pipeline
• Ad: {keywords, bid, budget}
– Keywords: match an ad with impression (a search query)
– Bid: max price an Ad pays for an impression
– Budget: cap total daily spending of an ad
• Google Ads System:
– 80 TB of log data per day: impression logs, clicks logs, conversion
logs
– AdWords website shows ads statistics for advertisers.
– An ad is matched with impression based on keywords.
– Ads compete on impressions in real-time auction:
• Rev for showing an Ad = bid * pCtr
74. Exercise: Design Google Ad’s Data
Processing Pipeline
• Problems:
– Generate hourly/daily/weekly stats: {timeperiod, adId, keyword,
imps, ctr, cvr}.
– Website dynamically shows drill-down and aggregate stats.
eg: “Total spending of this ad in past week”.
Eg: “CTR of this ad on this keyword yesterday”
– Bill ads spending at the end of the day.
– Stop ads when budget is reached.
– Predict CTR as a model of {ad, impression} (for use in auction)