SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Big Data Processing
雷鸣
lei.m.ming@gmail.com
Agenda
• Overview of Big Data Processing
• Batch Processing: Mapreduce, Hive
• Iterative Batch: Spark
• Stream Processing: Apache Storm
• OLAP over Big Data: Dremel, Druid
• Design Exercise and Student Presentations
Small vs. Big Data
• A perspective from Linear Regression
Small vs. Big Data
Small vs. Big Data
• Matrix Algebra solution of Linear Regression
– Input record #: m, Feature #: n
– In memory data structure: O(m*n)
– O(m*n^2)
• Gradient Decent
– Each iteration: Sequential read of input data
– In memory: O(n)
– Iteration # is bounded by a constant.
– Parallel Computation: input processing can be partitioned, and
gradient is aggregated.
A Real Life Example
Google Ads
Server
query
Ad
Impression
Click
Server
Ad click
Logs
impressions
clicks
Stream Processing of
Impressions/Clicks
Billing
Budget
Service
Daily/Hourly
Batch Procesing
hour Imp
s
clic
ks
cos
ts
8:00 100 10 $20
9:00 120 9 $16
Base Stats Table
OLAP
Processing
Adwords
Stats
Console
Daily
Stats
Processing Type Programming Pattern Example Usages
Offline
Data
Processing
Batch
Processing
- Gathering of data and processing as a
group at one time.
- Jobs run to completion
- Data might be out of date
- Adhoc/one-time data
analysis.
- Periodically reprocess of
complete data set.
Iterative
Batch
- Data stream is chunked
- Each Chunk is batch-processed.
- Effect of each batch is incrementally
applied.
- Continuous running.
- Chunk an additive
processing.
- Iterative Machine Learning
Algorithm (eg, Stochastic
Gradient Decent)
Stream
Processing
- Real-time or near real-time processing of
events.
- Process one event at a time.
- Continuous running.
- Real time alert on time
series data.
OLAP - Online Analytic Processing for interactive
queries.
- Processing logic is distributed with data.
- Interactive analysis of log
data.
- Business/operational stats
queries over base table.
Processing
Type
Technology Core Concepts Major Functionality
Batch Processing Hadoop MapReduce
Hive Sql over Mapreduce Compile sql syntax to MR
programs
Google Flume Data processing DAG over
Mapreduce
Compile computation DAG
to MR
Spark Repeated processing of
working set
Iterative Batch
OLAP Google Dremel OLAP Sql over Columnar DB OLAP on complex
structured data
Apache Impala OLAP Sql over Hadoop data
Stream
Processing
Google Millwheel Computation DAG for event
processing
Apache Storm Computation DAG for event
processing
Offline Data Processing Topology
• DAG (Directed Acyclic Graph)
Data Input1
Data Input2
Output1
Output2
MapReduce: Where Modern Big Data
Batch processing starts
10
Mapreduce
• A DAG composed of Mappers and Reducers
• Mapper operates on each input record:
– Record  one or more {key, data}
– Mapper output is resharded by key
• Reducer operates on a list of mapper output records of the same
key.
Sharded
Data Input
Reducer
Mapper
Mapper
Mapper
Reducer
shuffle
shuffle
shuffle
merge/
sort
merge/
sort
Sharded
Data output
The Problem of “Word Count”
• Find the top 100 most frequent English words on the WEB.
– Assume you already have downloaded web documents.
A Mapreduce Solution
• MR1
– Input: Arbitrarily sharded downloaded WEB HTML documents.
– Mapper:
• Input Record: a document.
• Parse the document, chunk into words.
• Output records: {key=word, data=word_freq_in_input_record}
– Reducer:
• For batch input records (of the same key), add up word_freq.
• Output records: {word, freq} // you don’t have to output every record here !!
• MR2
– Input: output from MR1
– Mapper
• For input record of {word, freq}, output {key=freq, data=word}
– Reducer
• Input records are sorted by freq, only output top 100 records by freq.
Mapreduce
• Appears to be inefficient:
– So much data read/write and computations for the simplest problems.
• Very efficient data read/write – sequential IO
– Data including intermediate data are stored in distributed file system that
is optimized for sequential IO.
• Imply shard, sort, merge-sort
• Simple programming model of Mapper/Reducer
• Optimizations:
– Combiner: a local reducer for each mapper
Mapreduce
• Inefficiencies:
– Processing time is determined by slowest mapper and reducer.
• No reducer can start until all mappers finish.
• One reducer may take long time to finish if too much data in one reducer
shard.
Hive: A data warehouse on Hadoop
Based on Facebook Team’s paper
7/23/2018 16
Motivation
• Yahoo worked on Pig to facilitate application deployment
on Hadoop.
– Their need mainly was focused on unstructured data
• Simultaneously Facebook started working on deploying
warehouse solutions on Hadoop that resulted in Hive.
– The size of data being collected and analyzed in industry for
business intelligence (BI) is growing rapidly making traditional
warehousing solution prohibitively expensive.
7/23/2018 17
Hadoop MR
• MR is very low level and requires customers to write custom
programs.
• HIVE supports queries expressed in SQL-like language called
HiveQL which are compiled into MR jobs that are executed on
Hadoop.
• Hive also allows MR scripts
• It also includes MetaStore that contains schemas and statistics that
are useful for data explorations, query optimization and query
compilation.
• At Facebook Hive warehouse contains tens of thousands of tables,
stores over 700TB and is used for reporting and ad-hoc analyses by
200 Fb users.
7/23/2018 18
Hive architecture (from the paper)
7/23/2018 19
Data model
• Hive structures data into well-understood database concepts
such as: tables, rows, cols, partitions
• It supports primitive types: integers, floats, doubles, and
strings
• Hive also supports:
– associative arrays: map<key-type, value-type>
– Lists: list<element type>
– Structs: struct<file name: file type…>
• SerDe: serialize and deserialized API is used to move data in
and out of tables
7/23/2018 20
Query Language (HiveQL)
• Subset of SQL
• Meta-data queries
• Limited equality and join predicates
• No inserts on existing tables (to preserve worm property)
– Can overwrite an entire table
7/23/2018 21
Wordcount in Hive
FROM (
MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING 'pythonwc_reduce.py';
7/23/2018 22
Session/tmstamp example
Construct session as a sorted list of events by ts.
FROM (
FROM session_events_table
SELECT sessionid, tstamp, data
DISTRIBUTE BY sessionid SORT BY tstamp
) a
REDUCE sessionid, tstamp, data USING
'session_reducer.sh';7/23/2018 23
Data Storage
• Tables are logical data units; table metadata associates
the data in the table to hdfs directories.
• Hdfs namespace: tables (hdfs directory), partition (hdfs
subdirectory), buckets (subdirectories within partition)
• /user/hive/warehouse/test_table is a hdfs directory
7/23/2018 24
Hive architecture (from the paper)
7/23/2018 25
Architecture
• Metastore: stores system catalog
• Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also
manages session handle and session statistics
• Query compiler: Compiles HiveQL into a directed acyclic graph of
map/reduce tasks
• Execution engines: The component executes the tasks in proper
dependency order; interacts with Hadoop
• HiveServer: provides Thrift interface and JDBC/ODBC for integrating other
applications.
• Client components: CLI, web interface, jdbc/odbc inteface
• Extensibility interface include SerDe, User Defined Functions and User
Defined Aggregate Function.
7/23/2018 26
Sample Query Plan
7/23/2018 27
Hive Usage in Facebook
• Hive and Hadoop are extensively used in Facbook for
different kinds of operations.
• 700 TB = 2.1Petabyte after replication!
• Think of other application model that can leverage
Hadoop MR.
7/23/2018 28
Spark & RDD: Cluster Computing w.
Working Set
29
RDD
• A resilient distributed dataset (RDD) is a read-only collection
of objects partitioned across a set of machines that can be
rebuilt if a partition is lost.
• The elements of an RDD need not exist in physical storage;
instead, a handle to an RDD contains enough information to
compute the RDD starting from data in reliable storage.
• Constraints on RDD
– Immutable after generation
– Generated with a set of coarse grained operations
• RDD is input/output/intermediate data of a computation DAG
RDD -- construction
• From a file in a shared file system, such as the Hadoop
Distributed File System (HDFS).
• By “parallelizing” a Scala collection (e.g., an array) in the
driver program, which means dividing it into a number of
slices that will be sent to multiple nodes.
• By transforming an existing RDD.
– A dataset with elements of type A can be transformed into a dataset
with elements of type B using an operation called flatMap, which
passes each element through a user-provided function of type A ⇒
List[B].
– Other transformations can be expressed using flatMap, including
map (pass elements through a function of type A ⇒ B) and filter
(pick elements matching a predicate).
RDD -- persistence
• By default, RDDs are lazy and ephemeral.
– Partitions of a dataset are materialized on demand when they are
used in a parallel operation (e.g., by passing a block of a file through
a map function), and are discarded from memory after use.
• Alter persistence:
– cache action: leaves the dataset lazy, but hints that it should be kept
in memory after the first time it is computed, because it will be
reused.
– save action: evaluates the dataset and writes it to a distributed
filesystem such as HDFS. The saved version is used in future
operations on it.
RDD – parallel operations
• reduce: Combines dataset elements using an associative
function to produce a result at the driver program.
• collect: Sends all elements of the dataset to the driver
program. For example, an easy way to update an array in
parallel is to parallelize, map and collect the array.
• foreach: Passes each element through a user provided
function. This is only done for the side effects of the
function (which might be to copy data to another system
or to update a shared variable as explained below).
Programming Model
• A spark driver program defines a computation DAG
– Input/output/intermediate data as RDDs
– Transformations on RDDs: filtering, for-each, reduce
– A transformation takes a function for the exact logic for the
transformation.
• Scala: a language for statistical operations on data set
– Operators and syntax for vector manipulations.
Spark Shared Variables
• map, filter and reduce operations can refer to variables in
the scope where they are created.
• Restricted types of shared variables:
– Broadcast variables: If a large read-only piece of data (e.g., a
lookup table) is used in multiple parallel operations, it is
preferable to distribute it to the workers only once instead of
packaging it with every closure.
– Accumulators: These are variables that workers can only “add”
to using an associative operation, and that only the driver can
read.
• Usage: counters in Mapreduce, parallel sums
Programming Examples – Text Search
• Count the lines containing errors in a large log file stored
in HDFS.
• Looks just like MapReduce, except an RDD can be
cached by:
– val cachedErrs = errs.cache()
Programming Examples – Logistic
Regression
Train a logistic regression model using gradient decent.
• Model: y = logit(W*X)
• Points: a set of pairs (y,
X), X is a vector.
• Starts with a vector W
with random values.
• Each iteration: adjust W
based on accumulated
differences between y
and y_exp over all
points
Performance on Iterative Data Processing
w. Working Set
Node
Partition
of points
Compute grad
over all points
W
In each iteration of the Gradient Decent
grad
Driver
Program
Node
Partition
of points
Compute grad
over all points
Node
Partition
of points
Compute grad
over all points
grad
grad
Aggregate grad
W -= grad.value
W
W
W
• At start:
W is shared from driver to
all nodes.
• At end:
grad is propagated from
each node and
aggregated.
Aggregated grad is
applied to W.
• W is a broad-cast variable
• Grad is a aggregated
variable
Storm – Stream Processing
40
Tuple Tree of Tweets
tweets Clock
1001
1002
1001
Tweets tuples Tweets tuples
by geo1
Tweets tuples
by geo2
Stats output
geo1
OLAP over Big Data
54
Traditional OLAP Cube
• Cube data source
– dimension1, dimension2, …, metric1, metric2, ….
• Cube Operation: slice, dice, drill-up/down, rollup
• Implemented with OLAP Queries:
– Eg: select dim1, dim2, agg(met1), agg(met2) from BaseTable where
selector(dim3) group by dim1, dim2
– select publisher,sum(click) from BaseTable where advertiser=“Google”
group by publisher
• Time Series OLAP data
– Aggregate metrics over a time epoch: sum, ave, median, count, ….
– Eg: aggregate data from each point of “every 5 secs” to “every 5 mins”
Traditional OLAP Cube
OLAP Solutions
• Relational OLAP (ROLAP)
– Use Relational DB with extensive indexes
– Expensive due to large index size but support online data update.
• Multidimensional OLAP (MOLAP)
– Build data cubes based on predefined dimensions during the design
phase.
– Not flexible for new analytical requirements.
– Good for predefined BI reporting applications.
• Full Scan
– Adhoc queries require full-scan (without indexes)
– Relies on disk IO thruput, distributed data store and parallel query
processing
– Columnar storage
– Don’t support online data update
OLAP over “Big Data”
• Base table has large number of records
• Requirement on OLAP queries:
– Low latency
– Low concurrency
• Horizontal Scaling
– Partition base table
– No or limited support for join  denormed base table
Google’s Dremel
“A scalable (over data set), interactive ad-hoc query
system for analysis of read-only nested data.”
-- http://research.google.com/pubs/archive/36632.pdf
Google’s Dremel: Data Model
• Data Model (Protocol Buffer)
dom is any value of basic types:
string, integer, float, …
Google’s Dremel: Query Model
• Interactive ad hoc
queries more than
just OLAP
• If repeated nested
fields are used in
query, base table
is denormed:
DocId Name.Url Name.Language
.Code
10 ‘http://a’ en-us
10 ‘http://a’ en
Google’s Dremel: Data Storage
• Columnar storage
– Separates a record
into column values
and stores each
value on different
storage volume.
– Traditional
databases
normally store the
whole record on
one volume.
Google’s Dremel: Data Storage
• Columnar Store Advantages:
– Traffic minimization. Only required column values on each
query are scanned and transferred on query execution.
• For example, a query “SELECT top(title) FROM foo” would access the
title column values only.
– Higher compression ratio. One study3 reports that columnar
storage can achieve a compression ratio of 1:10, whereas
ordinary row-based storage can compress at roughly 1:3.
• Because each column would have similar values, especially if the
cardinality of the column (variation of possible column values) is low.
Google’s Dremel: Data Storage
• Columnar Store Disadvantages:
– Record insertion/update needs to fan out to multiple volumes.
• Dremel does not support online update/insertion
• Table partitions are generated in batch
– Daily log table: partition by time (eg, hourly)
Google’s Dremel: Query Engine
• Multi-layered
serving tree.
• Only leaf nodes
load data.
• Root and
intermediates
aggregate results.
Google’s Dremel: Query Execution
• Example Query Execution
Base table T is partitioned into {T_1, … T_i,}
Original query:
SELECT A, COUNT(B) FROM T GROUP BY A
Divide query and aggregate result sets thru the serving tree:
SELECT A, SUM(c) FROM (R_1 UNION ALL ... R_n ) GROUP BY A
Query at the bottom (directly on each tablet)
R_i = SELECT A, COUNT(B) AS c FROM T_i GROUP BY A
Dremel: Some Applications inside Google
• Adhoc analysis of event logs from many services
• Analysis of crawled web documents
• Tracking install data for applications in the Android Market
• Crash reporting for Google products
• OCR results from Google Books
• Spam analysis
• Debugging of map tiles on Google Maps
• Tablet migrations in managed Bigtable instances
• Results of tests run on Google’s distributed build system
• Disk I/O statistics for hundreds of thousands of disks
• Resource monitoring for jobs run in Google’s data centers
• Symbols and dependencies in Google’s codebase
Dremel: BigQuery of Google Cloud
Dremel: Query Performance
Apache Impala
• Motivated by Dremel
• More general data model and query support
• Fits into Apache/Hadoop Eco-system
Druid (open source)
• Motivated by Dremel
• Dedicated to OLAP cube of time series data
• Base table: multi-dimensional time series data
• Query: select-aggregate-groupby for cube operations
Exercise: Design Google Ad’s Data
Processing Pipeline
• Ad: {keywords, bid, budget}
– Keywords: match an ad with impression (a search query)
– Bid: max price an Ad pays for an impression
– Budget: cap total daily spending of an ad
• Google Ads System:
– 80 TB of log data per day: impression logs, clicks logs, conversion
logs
– AdWords website shows ads statistics for advertisers.
– An ad is matched with impression based on keywords.
– Ads compete on impressions in real-time auction:
• Rev for showing an Ad = bid * pCtr
Exercise: Design Google Ad’s Data
Processing Pipeline
• Data from logs
– Impressions: {impressionId, adId, keyword, query, CPC}
eg: {“123”, “ad1”, “ipad”, “ipad mini”, $0.5}
– Clicks: {impressionId, click=true}
eg: {“123”}
– Conversions: {impressionId, convLabel}, eg: {“123”, “purchased”}
• Database of Ads
– Ad: {keywords, bid, landingPageUrl}
Eg: {{“ipad”, “pad”, “tablet pc”}, $1.0,
“https://www.apple.com/store/ipads”}
Exercise: Design Google Ad’s Data
Processing Pipeline
• Problems:
– Generate hourly/daily/weekly stats: {timeperiod, adId, keyword,
imps, ctr, cvr}.
– Website dynamically shows drill-down and aggregate stats.
eg: “Total spending of this ad in past week”.
Eg: “CTR of this ad on this keyword yesterday”
– Bill ads spending at the end of the day.
– Stop ads when budget is reached.
– Predict CTR as a model of {ad, impression} (for use in auction)

Weitere ähnliche Inhalte

Was ist angesagt?

Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copySharon Moses
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 

Was ist angesagt? (20)

Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Ähnlich wie Big Data Processing

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 

Ähnlich wie Big Data Processing (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
hadoop
hadoophadoop
hadoop
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Anju
AnjuAnju
Anju
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop
HadoopHadoop
Hadoop
 

Kürzlich hochgeladen

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 

Kürzlich hochgeladen (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

Big Data Processing

  • 2. Agenda • Overview of Big Data Processing • Batch Processing: Mapreduce, Hive • Iterative Batch: Spark • Stream Processing: Apache Storm • OLAP over Big Data: Dremel, Druid • Design Exercise and Student Presentations
  • 3. Small vs. Big Data • A perspective from Linear Regression
  • 5. Small vs. Big Data • Matrix Algebra solution of Linear Regression – Input record #: m, Feature #: n – In memory data structure: O(m*n) – O(m*n^2) • Gradient Decent – Each iteration: Sequential read of input data – In memory: O(n) – Iteration # is bounded by a constant. – Parallel Computation: input processing can be partitioned, and gradient is aggregated.
  • 6. A Real Life Example Google Ads Server query Ad Impression Click Server Ad click Logs impressions clicks Stream Processing of Impressions/Clicks Billing Budget Service Daily/Hourly Batch Procesing hour Imp s clic ks cos ts 8:00 100 10 $20 9:00 120 9 $16 Base Stats Table OLAP Processing Adwords Stats Console Daily Stats
  • 7. Processing Type Programming Pattern Example Usages Offline Data Processing Batch Processing - Gathering of data and processing as a group at one time. - Jobs run to completion - Data might be out of date - Adhoc/one-time data analysis. - Periodically reprocess of complete data set. Iterative Batch - Data stream is chunked - Each Chunk is batch-processed. - Effect of each batch is incrementally applied. - Continuous running. - Chunk an additive processing. - Iterative Machine Learning Algorithm (eg, Stochastic Gradient Decent) Stream Processing - Real-time or near real-time processing of events. - Process one event at a time. - Continuous running. - Real time alert on time series data. OLAP - Online Analytic Processing for interactive queries. - Processing logic is distributed with data. - Interactive analysis of log data. - Business/operational stats queries over base table.
  • 8. Processing Type Technology Core Concepts Major Functionality Batch Processing Hadoop MapReduce Hive Sql over Mapreduce Compile sql syntax to MR programs Google Flume Data processing DAG over Mapreduce Compile computation DAG to MR Spark Repeated processing of working set Iterative Batch OLAP Google Dremel OLAP Sql over Columnar DB OLAP on complex structured data Apache Impala OLAP Sql over Hadoop data Stream Processing Google Millwheel Computation DAG for event processing Apache Storm Computation DAG for event processing
  • 9. Offline Data Processing Topology • DAG (Directed Acyclic Graph) Data Input1 Data Input2 Output1 Output2
  • 10. MapReduce: Where Modern Big Data Batch processing starts 10
  • 11. Mapreduce • A DAG composed of Mappers and Reducers • Mapper operates on each input record: – Record  one or more {key, data} – Mapper output is resharded by key • Reducer operates on a list of mapper output records of the same key. Sharded Data Input Reducer Mapper Mapper Mapper Reducer shuffle shuffle shuffle merge/ sort merge/ sort Sharded Data output
  • 12. The Problem of “Word Count” • Find the top 100 most frequent English words on the WEB. – Assume you already have downloaded web documents.
  • 13. A Mapreduce Solution • MR1 – Input: Arbitrarily sharded downloaded WEB HTML documents. – Mapper: • Input Record: a document. • Parse the document, chunk into words. • Output records: {key=word, data=word_freq_in_input_record} – Reducer: • For batch input records (of the same key), add up word_freq. • Output records: {word, freq} // you don’t have to output every record here !! • MR2 – Input: output from MR1 – Mapper • For input record of {word, freq}, output {key=freq, data=word} – Reducer • Input records are sorted by freq, only output top 100 records by freq.
  • 14. Mapreduce • Appears to be inefficient: – So much data read/write and computations for the simplest problems. • Very efficient data read/write – sequential IO – Data including intermediate data are stored in distributed file system that is optimized for sequential IO. • Imply shard, sort, merge-sort • Simple programming model of Mapper/Reducer • Optimizations: – Combiner: a local reducer for each mapper
  • 15. Mapreduce • Inefficiencies: – Processing time is determined by slowest mapper and reducer. • No reducer can start until all mappers finish. • One reducer may take long time to finish if too much data in one reducer shard.
  • 16. Hive: A data warehouse on Hadoop Based on Facebook Team’s paper 7/23/2018 16
  • 17. Motivation • Yahoo worked on Pig to facilitate application deployment on Hadoop. – Their need mainly was focused on unstructured data • Simultaneously Facebook started working on deploying warehouse solutions on Hadoop that resulted in Hive. – The size of data being collected and analyzed in industry for business intelligence (BI) is growing rapidly making traditional warehousing solution prohibitively expensive. 7/23/2018 17
  • 18. Hadoop MR • MR is very low level and requires customers to write custom programs. • HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MR jobs that are executed on Hadoop. • Hive also allows MR scripts • It also includes MetaStore that contains schemas and statistics that are useful for data explorations, query optimization and query compilation. • At Facebook Hive warehouse contains tens of thousands of tables, stores over 700TB and is used for reporting and ad-hoc analyses by 200 Fb users. 7/23/2018 18
  • 19. Hive architecture (from the paper) 7/23/2018 19
  • 20. Data model • Hive structures data into well-understood database concepts such as: tables, rows, cols, partitions • It supports primitive types: integers, floats, doubles, and strings • Hive also supports: – associative arrays: map<key-type, value-type> – Lists: list<element type> – Structs: struct<file name: file type…> • SerDe: serialize and deserialized API is used to move data in and out of tables 7/23/2018 20
  • 21. Query Language (HiveQL) • Subset of SQL • Meta-data queries • Limited equality and join predicates • No inserts on existing tables (to preserve worm property) – Can overwrite an entire table 7/23/2018 21
  • 22. Wordcount in Hive FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'pythonwc_reduce.py'; 7/23/2018 22
  • 23. Session/tmstamp example Construct session as a sorted list of events by ts. FROM ( FROM session_events_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp ) a REDUCE sessionid, tstamp, data USING 'session_reducer.sh';7/23/2018 23
  • 24. Data Storage • Tables are logical data units; table metadata associates the data in the table to hdfs directories. • Hdfs namespace: tables (hdfs directory), partition (hdfs subdirectory), buckets (subdirectories within partition) • /user/hive/warehouse/test_table is a hdfs directory 7/23/2018 24
  • 25. Hive architecture (from the paper) 7/23/2018 25
  • 26. Architecture • Metastore: stores system catalog • Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics • Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks • Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop • HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications. • Client components: CLI, web interface, jdbc/odbc inteface • Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function. 7/23/2018 26
  • 28. Hive Usage in Facebook • Hive and Hadoop are extensively used in Facbook for different kinds of operations. • 700 TB = 2.1Petabyte after replication! • Think of other application model that can leverage Hadoop MR. 7/23/2018 28
  • 29. Spark & RDD: Cluster Computing w. Working Set 29
  • 30. RDD • A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. • The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. • Constraints on RDD – Immutable after generation – Generated with a set of coarse grained operations • RDD is input/output/intermediate data of a computation DAG
  • 31. RDD -- construction • From a file in a shared file system, such as the Hadoop Distributed File System (HDFS). • By “parallelizing” a Scala collection (e.g., an array) in the driver program, which means dividing it into a number of slices that will be sent to multiple nodes. • By transforming an existing RDD. – A dataset with elements of type A can be transformed into a dataset with elements of type B using an operation called flatMap, which passes each element through a user-provided function of type A ⇒ List[B]. – Other transformations can be expressed using flatMap, including map (pass elements through a function of type A ⇒ B) and filter (pick elements matching a predicate).
  • 32. RDD -- persistence • By default, RDDs are lazy and ephemeral. – Partitions of a dataset are materialized on demand when they are used in a parallel operation (e.g., by passing a block of a file through a map function), and are discarded from memory after use. • Alter persistence: – cache action: leaves the dataset lazy, but hints that it should be kept in memory after the first time it is computed, because it will be reused. – save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS. The saved version is used in future operations on it.
  • 33. RDD – parallel operations • reduce: Combines dataset elements using an associative function to produce a result at the driver program. • collect: Sends all elements of the dataset to the driver program. For example, an easy way to update an array in parallel is to parallelize, map and collect the array. • foreach: Passes each element through a user provided function. This is only done for the side effects of the function (which might be to copy data to another system or to update a shared variable as explained below).
  • 34. Programming Model • A spark driver program defines a computation DAG – Input/output/intermediate data as RDDs – Transformations on RDDs: filtering, for-each, reduce – A transformation takes a function for the exact logic for the transformation. • Scala: a language for statistical operations on data set – Operators and syntax for vector manipulations.
  • 35. Spark Shared Variables • map, filter and reduce operations can refer to variables in the scope where they are created. • Restricted types of shared variables: – Broadcast variables: If a large read-only piece of data (e.g., a lookup table) is used in multiple parallel operations, it is preferable to distribute it to the workers only once instead of packaging it with every closure. – Accumulators: These are variables that workers can only “add” to using an associative operation, and that only the driver can read. • Usage: counters in Mapreduce, parallel sums
  • 36. Programming Examples – Text Search • Count the lines containing errors in a large log file stored in HDFS. • Looks just like MapReduce, except an RDD can be cached by: – val cachedErrs = errs.cache()
  • 37. Programming Examples – Logistic Regression Train a logistic regression model using gradient decent. • Model: y = logit(W*X) • Points: a set of pairs (y, X), X is a vector. • Starts with a vector W with random values. • Each iteration: adjust W based on accumulated differences between y and y_exp over all points
  • 38. Performance on Iterative Data Processing w. Working Set
  • 39. Node Partition of points Compute grad over all points W In each iteration of the Gradient Decent grad Driver Program Node Partition of points Compute grad over all points Node Partition of points Compute grad over all points grad grad Aggregate grad W -= grad.value W W W • At start: W is shared from driver to all nodes. • At end: grad is propagated from each node and aggregated. Aggregated grad is applied to W. • W is a broad-cast variable • Grad is a aggregated variable
  • 40. Storm – Stream Processing 40
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53. Tuple Tree of Tweets tweets Clock 1001 1002 1001 Tweets tuples Tweets tuples by geo1 Tweets tuples by geo2 Stats output geo1
  • 54. OLAP over Big Data 54
  • 55. Traditional OLAP Cube • Cube data source – dimension1, dimension2, …, metric1, metric2, ….
  • 56. • Cube Operation: slice, dice, drill-up/down, rollup • Implemented with OLAP Queries: – Eg: select dim1, dim2, agg(met1), agg(met2) from BaseTable where selector(dim3) group by dim1, dim2 – select publisher,sum(click) from BaseTable where advertiser=“Google” group by publisher • Time Series OLAP data – Aggregate metrics over a time epoch: sum, ave, median, count, …. – Eg: aggregate data from each point of “every 5 secs” to “every 5 mins” Traditional OLAP Cube
  • 57. OLAP Solutions • Relational OLAP (ROLAP) – Use Relational DB with extensive indexes – Expensive due to large index size but support online data update. • Multidimensional OLAP (MOLAP) – Build data cubes based on predefined dimensions during the design phase. – Not flexible for new analytical requirements. – Good for predefined BI reporting applications. • Full Scan – Adhoc queries require full-scan (without indexes) – Relies on disk IO thruput, distributed data store and parallel query processing – Columnar storage – Don’t support online data update
  • 58. OLAP over “Big Data” • Base table has large number of records • Requirement on OLAP queries: – Low latency – Low concurrency • Horizontal Scaling – Partition base table – No or limited support for join  denormed base table
  • 59. Google’s Dremel “A scalable (over data set), interactive ad-hoc query system for analysis of read-only nested data.” -- http://research.google.com/pubs/archive/36632.pdf
  • 60. Google’s Dremel: Data Model • Data Model (Protocol Buffer) dom is any value of basic types: string, integer, float, …
  • 61. Google’s Dremel: Query Model • Interactive ad hoc queries more than just OLAP • If repeated nested fields are used in query, base table is denormed: DocId Name.Url Name.Language .Code 10 ‘http://a’ en-us 10 ‘http://a’ en
  • 62. Google’s Dremel: Data Storage • Columnar storage – Separates a record into column values and stores each value on different storage volume. – Traditional databases normally store the whole record on one volume.
  • 63. Google’s Dremel: Data Storage • Columnar Store Advantages: – Traffic minimization. Only required column values on each query are scanned and transferred on query execution. • For example, a query “SELECT top(title) FROM foo” would access the title column values only. – Higher compression ratio. One study3 reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. • Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low.
  • 64. Google’s Dremel: Data Storage • Columnar Store Disadvantages: – Record insertion/update needs to fan out to multiple volumes. • Dremel does not support online update/insertion • Table partitions are generated in batch – Daily log table: partition by time (eg, hourly)
  • 65. Google’s Dremel: Query Engine • Multi-layered serving tree. • Only leaf nodes load data. • Root and intermediates aggregate results.
  • 66. Google’s Dremel: Query Execution • Example Query Execution Base table T is partitioned into {T_1, … T_i,} Original query: SELECT A, COUNT(B) FROM T GROUP BY A Divide query and aggregate result sets thru the serving tree: SELECT A, SUM(c) FROM (R_1 UNION ALL ... R_n ) GROUP BY A Query at the bottom (directly on each tablet) R_i = SELECT A, COUNT(B) AS c FROM T_i GROUP BY A
  • 67. Dremel: Some Applications inside Google • Adhoc analysis of event logs from many services • Analysis of crawled web documents • Tracking install data for applications in the Android Market • Crash reporting for Google products • OCR results from Google Books • Spam analysis • Debugging of map tiles on Google Maps • Tablet migrations in managed Bigtable instances • Results of tests run on Google’s distributed build system • Disk I/O statistics for hundreds of thousands of disks • Resource monitoring for jobs run in Google’s data centers • Symbols and dependencies in Google’s codebase
  • 68. Dremel: BigQuery of Google Cloud
  • 70. Apache Impala • Motivated by Dremel • More general data model and query support • Fits into Apache/Hadoop Eco-system
  • 71. Druid (open source) • Motivated by Dremel • Dedicated to OLAP cube of time series data • Base table: multi-dimensional time series data • Query: select-aggregate-groupby for cube operations
  • 72. Exercise: Design Google Ad’s Data Processing Pipeline • Ad: {keywords, bid, budget} – Keywords: match an ad with impression (a search query) – Bid: max price an Ad pays for an impression – Budget: cap total daily spending of an ad • Google Ads System: – 80 TB of log data per day: impression logs, clicks logs, conversion logs – AdWords website shows ads statistics for advertisers. – An ad is matched with impression based on keywords. – Ads compete on impressions in real-time auction: • Rev for showing an Ad = bid * pCtr
  • 73. Exercise: Design Google Ad’s Data Processing Pipeline • Data from logs – Impressions: {impressionId, adId, keyword, query, CPC} eg: {“123”, “ad1”, “ipad”, “ipad mini”, $0.5} – Clicks: {impressionId, click=true} eg: {“123”} – Conversions: {impressionId, convLabel}, eg: {“123”, “purchased”} • Database of Ads – Ad: {keywords, bid, landingPageUrl} Eg: {{“ipad”, “pad”, “tablet pc”}, $1.0, “https://www.apple.com/store/ipads”}
  • 74. Exercise: Design Google Ad’s Data Processing Pipeline • Problems: – Generate hourly/daily/weekly stats: {timeperiod, adId, keyword, imps, ctr, cvr}. – Website dynamically shows drill-down and aggregate stats. eg: “Total spending of this ad in past week”. Eg: “CTR of this ad on this keyword yesterday” – Bill ads spending at the end of the day. – Stop ads when budget is reached. – Predict CTR as a model of {ad, impression} (for use in auction)