SlideShare ist ein Scribd-Unternehmen logo
1 von 84
Downloaden Sie, um offline zu lesen
Big Data Processing
Systems Study
Vasiliki Kalavri, emjd-dc
3 Dec 2012
MapReduce
simplified Data Processing on Large
Clusters
OSDI 2004
MapReduce
● Specify a map and a reduce functions
● The system takes care of
○ parallelization
○ partitioning
○ scheduling
○ communication
○ fault-tolerance
3
Hadoop MapReduce 1.0
4
MapReduce Limitations
● Static Pipeline
● No support for common operations
● Data materialization after every job
● Slow - not fit for interactive analysis
● Complex configuration
5
YARN (MapReduce v.2)
6
What is Hadoop/MR NOT good for?
● All the things it wasn't built for
○ Iterative computations
○ Stream processing
○ Incremental computations
○ Interactive Analysis
○ [insert research paper here]
7
Improving Hadoop performance
● Reduce Network & Disk I/O
● Skewed Datasets
● DB-like optimizations
○ column-oriented storage
○ indexes
8
Map-Reduce Inspired
Systems
Extending the Programming Model
Map-Reduce Inspired Systems
● Extend the programming model to
support
○ Iterative
○ Stream
applications
10
Iterative Processing
● Characteristics
○ Datasets already stored
○ Need to reuse a dataset more than once, possibly
multiple times
○ Iterative jobs, e.g. estimates, convergence
● Problems with iterative MR applications
○ manual orchestration of several MR jobs
○ re-loading & re-processing of invariant data
○ no explicit way to define a termination condition
11
HaLoop
Efficient Iterative Data Processing on
Large Clusters
VLDB 2010
System Overview
13
Programming Model
● Iterative Programming Model
Ri+1 = R0 U (Ri ⋈ L)
● Extensions to MR
○ loop body
○ termination condition
○ loop-invariant data
14
Loop-Aware Scheduling
● Inter-Iteration Locality
○ schedule tasks of different iterations which
access the same data on the same machines
15
Caching and Indexing
● Reducer Input Cache
○ caches and indexes reducer inputs
○ reduces M->R I/O
● Reducer Output Cache
○ stores and indexes most recent local reducer
outputs
○ reduces termination condition computation cost
● Mapper Input Cache
○ avoids non-local data reads in mappers
16
Stream Processing
● Characteristics
○ Data continuously comes into the system
○ Usually needs to be processed as it arrives
○ Frequent updates
● Problems with stream MR applications
○ runs on a static snapshot os a dataset
○ computations need to finish
17
Muppet
MapReduce-Style Processing of Fast Data
VLDB 2012
Programming Model
● MapUpdate
○ operates on streams, i.e. sequence of events with
the same id in increasing timestamp order
● Slates
○ in-memory data structures which "summarize"
all events with key k that an Update function has
seen so far
19
Example Applications
● An application that monitors the
FourSquare-checkin stream to count the
number of checkins per retailer and
displays the count on a Web page
● Detect "hot" topics in Twitter
20
System Overview
● Uses Cassandra to persist slate states
21
Map-Reduce Inspired
Systems
Improving Performance
Map-Reduce Inspired Systems
● Improve performance by
○ reusing data
○ building caches / indexes
○ DBMS-like optimizations
○ reducing I/O
23
Incoop
MapReduce for Incremental Computations
SOCC 2011
System Overview
25
Inc-HDFS
● Content-based chunking
● Fingerprint calculation
26
Incremental MapReduce
● Incremental Map
○ persistently store intermediate results
○ insert reference to memoization server
○ query memoization server and fetch result if
already computed
● Incremental Reduce
○ persistently store entire tasks computations
○ store and map sub-computations used in the
Contraction phase
27
Contraction Phase
● Break up large Reduce tasks into many
applications of the Combine function
● Only a subset of Combiners needs to be
re-executed
28
HAIL
Only Aggressive Elephants are Fast Elephants
VLDB 2012
System Overview
30
Upload Pipeline
● HDFS upload pipeline is changed so that:
○ the Client creates PAX blocks
○ Datanodes do not flush data or checksums to
disk
○ After all chunks of a block have been received,
the block is sorted in memory and flushed
○ Each DataNode computes its own checksums
31
Query Pipeline
Transparency is achieved using UDFs:
● HailInputFormat
○ elaborate splitting policy
○ scheduling taked into account relevant indexes
● HailRecordReader
○ Uses user annotation / configuration info to
select records for map phase
○ transforms records from PAX to row format
32
Themis
An I/O Efficient MapReduce
SOCC 2012
How to limit Disk I/O?
● Process records in memory and spill to
disk as rarely as possible
● Relax fault-tolerance guarantees
○ job-level recovery
● Dynamic memory management
○ pluggable policies
● Per-node I/O management
○ organize data in large batches
34
Memory policies
● Pool-based
○ fixed-sized pre-allocated buffers
● Quota-based
○ controls dataflow between computational stages
using queues
● Constraint-based
○ dynamically adjusts memory allocation based on
requests and available memory
35
System Overview
Data-flow graph consisting of stages:
● Phase Zero extracts information about
distribution of records and keys
● Phase One implements mapping and
shuffling
● Phase Two implements the sorting and
reduce, always keeping results in
memory
36
ReStore
Reusing Results of MapReduce Jobs
VLDB 2012
System Overview
● Built as an extension to Pig
● When a workflow is submitted, ReStore:
○ re-writes the query to reuse stored results
○ stores outputs of the workflow
○ stores results of sub-jobs
○ decided which outputs to store in HDFS and
which to delete
38
System Architecture
39
Example
40
MANIMAL
Automatic Optimization for MapReduce
Programs
VLDB 2011
Idea
● Apply well-known query optimization
techniques to Map-Reduce jobs
● Static analysis of compiled code
● Apply optimizations only when "safe"
42
System Architecture
43
Example Optimizations
● Selection
○ if the map function is a filter, use a B+Tree to
only scan the relevant portion of the input
● Projection
○ eliminate unnecessary fields from input records
44
SkewTune
Mitigating Skew in MapReduce Applications
SIGMOD 2012
Common Types of Skew
● Uneven distribution of input data
○ partitioning which does not guarantee even
distribution
○ popular key groups
● Expensive records
○ some portions of the input take longer to process
than others
46
System Overview
● Per-task progress estimation
● Per-task statistics
● Late skew detection
○ skew mitigation is delayed until a slot is
available
● Only re-partition one task at a time
○ only when half the time remaining is less than
the re-partitioning overhead
47
Implementation
Re-partition a map task
● mitigators execute as mappers within a
new MapReduce job
● output is written to HDFS
Re-partition a reduce task
● mitigator job with an identity map read
input from task tracker
48
Starfish
A Self-Tuning System for Big Data Analytics
CIDR 2011
System Overview
50
Job-Level Tuning
● Just-in-Time Optimizer
○ choose efficient execution techniques, e.g. joins
● Profiler
○ learns performance models, job profiles
● Sampler
○ collects statistics about input, intermediate and
output data
○ helps the profiler build approximate models
51
Workflow-Level Tuning
● Workflow-aware Scheduler
○ exploring data locality on workflow-level instead
of making locally optimal decisions
● What-If Engine
○ answers questions based on simulations of job
executions
52
Workload-Level Tuning
● Workload Optimizer
○ Data-flow sharing
○ Materialization of intermediate results for reuse
○ Reorganization
● Elastisizer
○ node and network configuration automation
53
Big-Data Processing
Beyond MapReduce
Dryad
Distributed Data-Parallel Programs from
Sequential Building Blocks
EuroSys 2007
System Overview
56
Graph Description
57
Communication
58
Graph Optimizations
● Schedule vertices clode to the input data
● If a computation is associative and
commutative, use an aggregation tree
● Dynamically refine the graph based on
output data sizes
○ vary number of vertices in each stage,
connectivity
59
SCOPE
Easy and Efficient Parallel Processing of
Massive Data Sets
VLDB 2008
System Overview
61
SCOPE scripting language
● resembles SQL with C# expressions
● commands are data transformation
operators
● extensible mapreduce-like commands
62
SCOPE Execution
● The Compiler creates internal parse tree
● The Optimizer creates a parallel
execution plan, i.e. a Cosmos job
● The Job Manager constructs the graph
and schedules execution
63
Spark
Cluster Computing with Working Sets
HotCloud 2010
RDDs
Read-only collection of objects
● partitioned across machines
● store their "lineage"
● can be re-constructed
● users can control persistence and
partitioning
65
Programming Model
● Scala API
● driver program
○ defines RDDs and
actions on them
● workers
○ long-lived processes
○ store and process RRD
partitions in-memory
66
Job Stages
67
Nephele/PACTs
A Programming Model and Execution
Framework for Web-Scale Analytical
Processing
SoCC 2010
The Stratosphere Stack
69
System Overview
● Execution plan in the form of a DAG
● Abstracts parallelization and
communication
● Optimizer to choose best execution
strategy
70
Programming Model
● Input Contracts:
○ give guarantees on how data is organized into
independent subsets
○ Map, Reduce, Match, Cross, CoGroup
● Output Contracts:
○ define properties on the output data
○ Same-Key, Super-Key, Unique-Key
71
ASTERIX
Scalable, Semi-structured Data Platform for
Evolving-World Models
Distributed and
Parallel Databases 2011
Evolving World Model
● As-of queries
○ What is the best route to get to the Olympic
Stadium right now?
○ What is the traffic situation like on Saturday
nights close to the city center?
○ How many visitors that visited the City Hall
during the past year also went for dinner in that
nearby restaurant?
73
Data Model - Query Language
● Semi-structured data model, ADM
○ dataset ~ table: indexed, partitioned, replicated
○ dataverse ~ database
○ DDL: primary key, partitioning key
○ "open" data schemes
● AQL query language
○ declarative, inspired from Jaql and XQuery
○ logical plan -> DAG -> Hyracks Job
74
System Overview
75
Dremel
Interactive analysis of Web-Scale Datasets
VLDB 2010
Columnar Storage
● lossless representation
○ save field types, repetition/definition levels
● fast encoding
○ recursively traverses record and computes levels
● efficient record assembly
○ use a FSM to reconstruct records
77
Query Execution
● Language based on SQL
● Tree architecture
○ Root server
■ receives incoming queries
■ reads table metadata
■ routes queries to the next level of the tree
○ Leaf servers
■ communicate with storage layer
78
Query Dispatcher
● Schedules queries to available slots
● Balances the load
● Assures fault-tolerance
● Specifies what percentage of tablets to be
scanned before returning a result
79
CIEL
a universal execution engine for distributed
data-flow computing
NSDI 2011
Dynamic Task Graph
81
System Architecture
82
Skywriting Language
● Turing-complete
● Arbitrary data-dependent control flow
○ while loops
○ recursive functions
● Supports invokation of code written in
other languages
83
References
www.citeulike.org/user/vasiakalavri
84

Weitere ähnliche Inhalte

Was ist angesagt?

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonVasia Kalavri
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...
Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...BJ Jang
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantParis Carbone
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 

Was ist angesagt? (20)

Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with Strymon
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...
Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Apache flink
Apache flinkApache flink
Apache flink
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 

Andere mochten auch

Like a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersLike a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight lineVasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
A Skype case study (2011)
A Skype case study (2011)A Skype case study (2011)
A Skype case study (2011)Vasia Kalavri
 
Demystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingDemystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingVasia Kalavri
 

Andere mochten auch (8)

Like a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web TrackersLike a Pack of Wolves: Community Structure of Web Trackers
Like a Pack of Wolves: Community Structure of Web Trackers
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight line
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming EraGraphs as Streams: Rethinking Graph Processing in the Streaming Era
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
A Skype case study (2011)
A Skype case study (2011)A Skype case study (2011)
A Skype case study (2011)
 
Demystifying Distributed Graph Processing
Demystifying Distributed Graph ProcessingDemystifying Distributed Graph Processing
Demystifying Distributed Graph Processing
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 

Ähnlich wie Big data processing systems research

Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
FlumeJava: Easy, Efficient Data-Parallel Pipelines
FlumeJava: Easy, Efficient Data-Parallel PipelinesFlumeJava: Easy, Efficient Data-Parallel Pipelines
FlumeJava: Easy, Efficient Data-Parallel PipelinesMiro Cupak
 
Data Analytics with DBMS
Data Analytics with DBMSData Analytics with DBMS
Data Analytics with DBMSGLC Networks
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's studentsMohamed Nadjib MAMI
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
MapReduce
MapReduceMapReduce
MapReducerobjk
 

Ähnlich wie Big data processing systems research (20)

Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Druid
DruidDruid
Druid
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
FlumeJava: Easy, Efficient Data-Parallel Pipelines
FlumeJava: Easy, Efficient Data-Parallel PipelinesFlumeJava: Easy, Efficient Data-Parallel Pipelines
FlumeJava: Easy, Efficient Data-Parallel Pipelines
 
Data Analytics with DBMS
Data Analytics with DBMSData Analytics with DBMS
Data Analytics with DBMS
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
MapReduce
MapReduceMapReduce
MapReduce
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Big data processing systems research