SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Ask “what,”
not “how”
Kostas Tzoumas
Data is an important asset
video & audio streams, sensor data, RFID, GPS, user online
behavior, scientific simulations, web archives, ...

Volume
Handle petabytes of data

Velocity
Handle high data arrival rates

Variety
Handle many heterogeneous data sources

Veracity
2

Handle inherent uncertainty of data
Data
Analysis
3
Four “I”s for Big Analysis
text mining, interactive and ad hoc analysis, machine
learning, graph analysis, statistical algorithms

Iterative
Model the data, do not just describe it

Incremental
Maintain the model under high arrival rates

Interactive
Step-by-step data exploration on very large data

Integrative
4

Fluent unified interfaces for different data models
Map

Reduce

MapReduce
and Hadoop

Co

(Romeo, (1,1,1))
(art, (1,1))
(thou, (1,1))

(What, 1)
(art, 1)
(thou, 1)
(hurt, 1)

Data written
to disk

(Romeo, 3)
(art, 2)
(thou, 2)

(wherefore, 1)
(What, 1)
(hurt, 1)
40

(Romeo, 1)
(wherefore, 1)
(art, 1)
(thou, 1)
(Romeo, 1)

Reduce

Match

Reduce

“What, art thou
hurt?”

Map

“Romeo, Romeo,
wherefore art thou
Romeo?”

Map

Cross
(Romeo, 1)

(wherefore, 1)
(What, 1)
(hurt, 1)

Data shuffled
over network
5
SQL analytics
with Hadoop

Pitfalls:

Reduce

Reduce
Reduce

Map

Map
Map

Lacking in
declarativity

6

HDFS-based
data exchange
Sort the only
grouping operator
Hadoop engine
tailored to simple
aggregations
SQL

MapReduce
BigAnalytics

BigSQL
NoMapReduce
Advanced
Analytics
Analytics that model the data to reveal hidden
relationships, not just describe the data.
E.g., machine learning, statistics, graph analysis

Increasingly important from a market perspective.
Very different than SQL analytics: different languages and
access patterns (iterative vs. one-pass programs).
Hadoop toolchain poor; R, Matlab, etc not parallel.
8
Use case in
all verticals

Media and Communications
Example: Risk management, analytics on
phone call logs, risk management,
sentiment analysis, clickstream and call
analysis

Manufacturing
Example: Data-driven quality
control and assurance, demand
forecasting, sales and operation
planning, process optimization

Travel and tourism
Example: Improve personalized customer
experience in hotels, estimate no-show in
flights, route planning

Retail
Example: Improve campaign ROI
by optimizing advertising channels,
market basket analysis, fraud
detection, social trend analysis,
product recommendation

Social and e-commerce
Example: Targeted customer experience,
explore new business models, real-time
recommendations, social graph analysis,
game analytics
9
Big data lives in Hadoop. Hadoop clusters
offer very low effective storage cost, and are
becoming a data vortex, attracting crossdepartmental data.
Companies want to perform advanced and
predictive analytics to maximize ROI of their
data assets by modeling the data, not just
describing it.

How do we bring
advanced analytics to
the world of big data?
10
What,
not how
Recipe for success:
declarativity
User specifies what information to
extract out of the data, not how
the system extracts the information.
This is what relational databases
pioneered in the 70s resulting in a
vibrant research community and a
billion dollar industry.
11

Big data consumers in
the future

people with data
analysis skills

systems
programming
experts

Big data
consumers now
Desiderata for next-gen big
data platforms: Usability

10 million
Excel users

3 million
R users

70,000
Hadoop
users
12

“the market faces
certain challenges
such as unavailability
of qualified and
experienced work
professionals, who can
effectively handle the
Hadoop architecture.”
Desiderata for next-gen big
data platforms: Performance
Stratosphere!

Hadoop!

0!

100!

200!

300!

400!

500!

600!

700!

Performance difference from days to minutes enables
real time decision making and widespread use of data
within the organization.
13
How to lift
declarativity from
the closed world of
relational algebra to
the open world of
advanced analytics.
14
Step 1: Specify
//"get"the"customers"with"their"debit"
val"debits:((String,(Double)(=(sql(
(((("SELECT&customerId,&debit&FROM&customer_accounts;")
//"get"the"number"of"warned"invoices"in"the"last"
//"12"and"6"months
val"warnings:((String,(Int,(Int)(=(sql
"""""SELECT&R12.customerId,&R12.cnt,&R6.cnt
&&&&&&&&&&&&FROM&(…)&R12&LEFT&OUTER&JOIN&(…)&R6
&&&&&&&&&&&&&&ON&(R6.customerId&=&R12.customerId);")
//"number"of"contracts"a"customer"has
val"numContracts(:((String,(Int)(=(sql(
(((("SELECT&customerId,&numContracts&FROM&customers;")
//"join"the"data"into"one"data"point
case"class"DataPoint(x:(Vector,(y:(Double)
val(dataPoints(=(numContracts(
((join(warnings
((where({_._1}(isEqualTo({_._1}
((join(debits
((where({_._1}(isEqualTo({_._1}
""map({((x,y,z)(=>(DataPoint(Vector(x._2,(y._2,(y._3),
(((((((((((((((((((((((((((((if((z._2(>(X)(1(else(0)(}
//"run"regression"with"dimensionality"3"for"40"iterations
val(weights:(Vector(=(logRegression(3,(dataPoints,(40)

Unify data and
programming models in
a declarative abstraction.
SQL for extracting
enterprise data from
databases.
General-purpose
programming for feature
extraction and
normalization.
Statistical libraries for
advanced analysis.

15
First step for
declarative
analytics
Scala: functional and object-oriented JVM language,
excellent basis for domain-specific language
development. Coolest kid in the block ☺
Feels like a scripting language, but is not restricted
to a fixed data model like Pig, Hive, etc.
Scala’s extensible compiler architecture is a good
match for implementing optimizers.
16
Step 2: Optimize
Data characteristics change

Each color is a differently written
program that produces the same result but has very
different performance depending on small changes
in the data set and the analysis requirements

Query optimizers: the
enabling technology for SQL
data warehousing and BI
Successful industrial
application of artificial
intelligence

Data characteristics change
(a) Complex Plan Diagram

Currently, no other system
can optimize non-relational
data analysis programs.
17

(b) Reduced Plan Diagram
filter

MATC

aggregate

project

Enumeration Algorithm: 4 5 6 7 8r0
0 123
r0 = this Read Set

UDF Code Analysis

0 123456 78
= this
r1 = @parameter0 // Iter
ter0 //
project Iterator
r2 = @parameter1 // Coll
ter1 // Collector

Use a combination of compiler and
• Checks reorder conditions and switches successive operators
database technology to lift optimization
Supported Transformations:
• Filter push-down
beyond relational algebra. Derive
• Join reordering
• Invariant group transformations
properties of user-defined functions via
• Non-relational operators are integrated
code analysis and use these to mimic a
relational database optimizer.

r0 = this
r1 = @parameter0 // Record
/ Reco
Prerequisites:
r2 = @parameter1 // Collector
/ Coll

r1 = @parameter0 // Record
/ Reco

• Descents data //filter
r2 = @parameter1 ow recursively top-down
/ Collector
Coll

$r5 = r1.getField(8)
8)

Control-Flow, Def-Use, Use-Def lists
$r6 = r0.date_lb
$i0 = $r5.compareTo($r6)
o($r6)
• Fixed API to access records

$d0 = r1.getField(6) r0 = this
r1
/ Reco
r3 = r1.next() = @parameter0 // Record
r0.extendedprice = $d0 = @parameter0 // Record = r
r2 = @parameter1 // Collector
/ Coll
r1
/ Reco
d0
r3.getField(4)
)
$d1 = r1.getField(7) r2 = @parameter1 // Collector
/ Coll
$d0 = r1.getField(6)
goto 2
r0.discount = $d1
r0.extendedprice = $d0
$r5 = r1.getField(8)
8)

$d1
$r6 = r0.date_lb
1: r3 = r1.next()= r1.getField(7)
$r7 = r0.revenue // PactRecord
r0.discount = $d1
$i0 = $r5.compareTo($r6)
o($r6)
$d1 = r3.getField(4)
Extracted Information:
$d2 = r0.extendedprice $i0 < 0 goto 1
if
$r9 = r1.getField(8)
8)
d0 = d0 + $d1 $r7 = r0.revenue // PactRecord
$d3 =
• Field sets= r0.date_ub write accesses on records r0.discount
track read and
$d2 = r0.extendedprice
$r10
$r9 = r1.getField(8)
8)
$d4 = 0 - $d3
2: $z0 = r1.hasNext()
$d3 = r0.discount
• Upper and$r9.compareTo($r10)
$i1 = lower output cardinality bounds
$r10 = r0.date_ub
$d5 = $d2 * $d4
$d4 1
if $z0 != 0 goto = 0 - $d3
if $i1 >= 0 goto 1
$i1 = $r9.compareTo($r10)
$d5 = $d2 * $d4
$r7.setValue($d5)
if $i1 >= 0 goto 1
r3.setField(4, d0)
4,
Safety:
$r7.setValue($d5)
r1.setNull(6)
)
r2.collect(r1)
r2.collect(r3)
r1.setNull(6)
)
• All record access instructions are detected
r2.collect(r1)
r1.setNull(7)
r1.setNull(7)
• Supersets of actual Read/Write sets are returned
r1.setNull(8)
1: return
r1.setNull(8)
1: return

if $i0 < 0 goto 1

• Supersets allow fewer but always safe transformations
$r8 = r0.revenue

nds

0 123456 78

Details Set[HPS+12] Bounds
Write in Ou
et Out-Card

[0,1]

0 123456 78

0 123456 78

[0,1]

Data Flow Transformations

Reorder Conditions:

output

[0,1]

0 123456 78

[0,1]

0 123456 78

Physical Optimization

output

0 123456 78

0 12345678
0 12345678

0 123456 78

5
0 123456 78

0
0 123456 78

0 123456 78

0 123456 78

5
0 12345678

MATCH

5
0 123456 78

0 123456 78

0 123456 78

0
0 123456 78

aggregate

0 123456 78

0 123456 78

0 12345678

0 123456 7
supplier8

Interesting Properties:
5
0 123456 78

0
0 123456 78
0 123456 78

0 123456 78

0 123456 78

• Checks reorder conditions and switches successive operators
REDUCE
join

0 12345678

REDUCE

aggregate

MAP

supplier

project

5
0 123456 78

0 123456 78

0 123456 78

Details in [HPS+12]

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

REDUCE

aggregate

0 123456 78

0 123456 78

0 123456 78

MAP

lineitem

filter

Partition

project

output

lineitem

Execution Plan Selection:
Details in [BEH+10]
• Chooses execution strategies for 2nd-order functions

Local Forward
MATCH
Hybrid-Hash

• Chooses shipping strategies to distribute data
[WK09]
Warneke, Kao,
• Strategies known from parallel databases

Local Forward

Parallel Execution

REDUCE

0 12345678
0 12345678
0 12345678
0 12345678
0 12345678
0 12345678

Partition

supplier

MAP
filter
0 123456 78

MAP

• Massively parallel execution of
18 DAG-structured data ows
• Sequential processing tasks

•
•

R

0 123456 78

0 12345678
0 12345678
0 12345678
0 12345678

MAP
0 123456 78
Pipeline

Local Forward

0 12345678
0 12345678

•

lineitem

MAP
Pipeline

•

0 123456 78

Execution Engine:

•

0 123456 78

0 123456 78

lineitem

E

filter

COMBINE 3 4 5 6 7 8
0 12
0 1
Part-Sort2 3 4 5 6 7 8
MAP
Local Forward
project

0 123456 78

lineitem

0 123456 78
Partition
REDUCE

0 123456 78

• Exploits UDF annotations for size estimates
• Cost model combines network, disk I/O and CPU costs
Parallel Execution

lineitem

Pa

0 123456 78

aggregate

supplier

MAP

0 123456 78

0 123456 78

hysical Optimization

[0,1]

5
REDUCE 2 3 4 5 6 7 8
0 1
Sort 0 1 2 3 4 5 6 7 8 supplier

0 123456 78

project

0 123456 78

0 123456 78

supplier

0 123456 78

0 123456 78

MAP
Cost-based Plan Selection:

0 123456 78

0 123456 78

Local Forward

5
0 123456 78

0 123456 78

filter

lineitem

Local Forward
MATCH
join
MATCH
5
0
0
Hybrid-Hash 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

0 123456 78

MAP

0
0 123456 78

0 123456 78

• Sorting, Grouping, Partitioning
MAP
MAP
MAP
supplier
filter
filter
project
• Invariant group project
transformations
• Property preservation reasoning with write sets

• Non-relational operators are integrated

filter
0 123456 78

0 12345678

5
0 123456 78

0 123456 78

0
• Filter push-down 1 2 3 4 5 6 7 8
0 123456 78
• Join reordering MAP

r3.setField(4, d0)
4,
r2.collect(r3)

MAP

0 12345678

0 12345678

• Chooses shipping strategies to distribute data
join
Enumeration Algorithm:
• Descents data ow recursively top-down
• Strategies known from parallel databases
MATCH

Supported Transformations:

2: $z0 = r1.hasNext()
if $z0 != 0 goto 1

0 123456 78

5
0 123456 78

0
0 123456 78

0 123456 78

0 123456 78

0 123456 78

5
0 123456 78

0 123456 78

0 123456 78

0 12345678

0 123456 78

0 12345678
0 12345678

0 123456 78

0 123456 78

1: r3 = r1.next()
$d1 = r3.getField(4)
d0 = d0 + $d1

output

0 12345678
0 12345678

0 123456 78

0 12345678

5
0 123456 78

project

goto 2

output

output

0 12345678
0 12345678

REDUCE
MATCH
Execution Plan Selection: aggregate
join
2. Preservation of groupsREDUCE
for grouping operators
nd-order functions
MATCH
• Chooses executionMATCH
strategies for 2
• Groups must remain unchanged or be completely removed
join
aggregate
join
0 123456 78

MAP

r3 = r1.next()
d0 = r
r3.getField(4)
)

output

output

1. No Write-Read / Write-Write con icts on 0 1 2 3 4 5 6 7 8
record elds
0 12345678
0 12345 78
• Similar to con ict detection 6in optimistic 1concurrency control
0 2345678

[0,1]

0

r0 = this
r1 = @parameter0 // Iter 0 1 2 3 4 5 6 7 8
ter0 // Iterator
r2 = @parameter1 // Coll 0 1 2 3 4 5 6 7 8
ter1 // Collector

$r8 = r0.revenue
r1.setField(4, $r8)
r2.collect(r1)
1)

r1.setField(4, $r8)
r2.collect(r1)
1)

Details in [HPS+12] and [HKT12]

5
0 123456 78

aggregate

r0 = this

• Static Code Analysis Framework provides

join

0 123456 78

Local Forward
output

output

output

lineitem
MATCH
Hybrid-Hash

REDUCE
Sort

MATCH
Hybrid-Hash

supplier

REDUCE
Sort

MATCH
Hybrid-Hash

supplier

REDUCE
Sort

D

supplier

[H
Step 3:
Execute

one pass
dataflow

many pass
dataflow

MapReduce

Impala, ...

Stratosphere

Text

✔

✔

✔

Aggregation

✔

✔

✔

ETL

✔

✔

✔

SQL

Hive is too
slow

✔

✔

Advanced
analytics

Mahout is slow
and low level

Madlib is
too slow

✔

map
reduce

A fast, massively parallel
database-inspired backend.
Truly scales to diskresident large data sets.
Built-in support for iterative
programs: predictive and
advanced analytics (machine
learning, graph processing,
stats) are all iterative.

19
Stratosphere is an award-winning open-source platform:
15 man-years of R&D,150k LOC, 3 million € behind it.

IBM Faculty
Award

HP Open
Innovation
Award

Stratosphere is the only Hadoop-compatible nextgeneration big data analytics platform developed in
Europe that you can download and use right now.
20
Stratosphere
Sky in Java

Sky in Scala

Monitoring tools, e.g., Hue

Visualization and reporting tools, e.g., Datameer

Other HLLs:
SQL, R, ...

Compiler and optimizer

Hadoop
MapReduce,
Impala, ...

Runtime engine
Hadoop storage and cluster management: HDFS,Yarn
21
www.stratosphere.eu/
downloads

22
www.stratosphere.eu/quickstart

23
val(input(=(TextFile(textInput)
val(words(=(input
"".flatMap(
{(line(=>(line.split(“(“)(}
val(counts(=(words
((.groupBy(
((((((({(word(=>(word(}
((.count()(
val(output(=(counts
.write"(wordsOutput,
((((((((RecordDataSinkFormat()()
val(plan(=(new(ScalaPlan(Seq(output))

24
Help us shape the future of
Big Data and the
Stratosphere platform!
We are looking for contributions and pilot customers:
github.com/stratosphere/stratosphere/wiki/Starter-Jobs
Try out Stratosphere and give us feedback
Work with us to implement your use case

www.stratosphere.eu
www.github.com/stratosphere
Contact kostas.tzoumas@tu-berlin.de
Visit

Tweet

#StratoSummit

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Empfohlen

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Empfohlen (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Dr. Kostas Tzoumas: Ask "What," not "How" at Stratosphere Summit (Nov. 14-15, 2013)

  • 2. Data is an important asset video & audio streams, sensor data, RFID, GPS, user online behavior, scientific simulations, web archives, ... Volume Handle petabytes of data Velocity Handle high data arrival rates Variety Handle many heterogeneous data sources Veracity 2 Handle inherent uncertainty of data
  • 4. Four “I”s for Big Analysis text mining, interactive and ad hoc analysis, machine learning, graph analysis, statistical algorithms Iterative Model the data, do not just describe it Incremental Maintain the model under high arrival rates Interactive Step-by-step data exploration on very large data Integrative 4 Fluent unified interfaces for different data models
  • 5. Map Reduce MapReduce and Hadoop Co (Romeo, (1,1,1)) (art, (1,1)) (thou, (1,1)) (What, 1) (art, 1) (thou, 1) (hurt, 1) Data written to disk (Romeo, 3) (art, 2) (thou, 2) (wherefore, 1) (What, 1) (hurt, 1) 40 (Romeo, 1) (wherefore, 1) (art, 1) (thou, 1) (Romeo, 1) Reduce Match Reduce “What, art thou hurt?” Map “Romeo, Romeo, wherefore art thou Romeo?” Map Cross (Romeo, 1) (wherefore, 1) (What, 1) (hurt, 1) Data shuffled over network 5
  • 6. SQL analytics with Hadoop Pitfalls: Reduce Reduce Reduce Map Map Map Lacking in declarativity 6 HDFS-based data exchange Sort the only grouping operator Hadoop engine tailored to simple aggregations
  • 8. Advanced Analytics Analytics that model the data to reveal hidden relationships, not just describe the data. E.g., machine learning, statistics, graph analysis Increasingly important from a market perspective. Very different than SQL analytics: different languages and access patterns (iterative vs. one-pass programs). Hadoop toolchain poor; R, Matlab, etc not parallel. 8
  • 9. Use case in all verticals Media and Communications Example: Risk management, analytics on phone call logs, risk management, sentiment analysis, clickstream and call analysis Manufacturing Example: Data-driven quality control and assurance, demand forecasting, sales and operation planning, process optimization Travel and tourism Example: Improve personalized customer experience in hotels, estimate no-show in flights, route planning Retail Example: Improve campaign ROI by optimizing advertising channels, market basket analysis, fraud detection, social trend analysis, product recommendation Social and e-commerce Example: Targeted customer experience, explore new business models, real-time recommendations, social graph analysis, game analytics 9
  • 10. Big data lives in Hadoop. Hadoop clusters offer very low effective storage cost, and are becoming a data vortex, attracting crossdepartmental data. Companies want to perform advanced and predictive analytics to maximize ROI of their data assets by modeling the data, not just describing it. How do we bring advanced analytics to the world of big data? 10
  • 11. What, not how Recipe for success: declarativity User specifies what information to extract out of the data, not how the system extracts the information. This is what relational databases pioneered in the 70s resulting in a vibrant research community and a billion dollar industry. 11 Big data consumers in the future people with data analysis skills systems programming experts Big data consumers now
  • 12. Desiderata for next-gen big data platforms: Usability 10 million Excel users 3 million R users 70,000 Hadoop users 12 “the market faces certain challenges such as unavailability of qualified and experienced work professionals, who can effectively handle the Hadoop architecture.”
  • 13. Desiderata for next-gen big data platforms: Performance Stratosphere! Hadoop! 0! 100! 200! 300! 400! 500! 600! 700! Performance difference from days to minutes enables real time decision making and widespread use of data within the organization. 13
  • 14. How to lift declarativity from the closed world of relational algebra to the open world of advanced analytics. 14
  • 15. Step 1: Specify //"get"the"customers"with"their"debit" val"debits:((String,(Double)(=(sql( (((("SELECT&customerId,&debit&FROM&customer_accounts;") //"get"the"number"of"warned"invoices"in"the"last" //"12"and"6"months val"warnings:((String,(Int,(Int)(=(sql """""SELECT&R12.customerId,&R12.cnt,&R6.cnt &&&&&&&&&&&&FROM&(…)&R12&LEFT&OUTER&JOIN&(…)&R6 &&&&&&&&&&&&&&ON&(R6.customerId&=&R12.customerId);") //"number"of"contracts"a"customer"has val"numContracts(:((String,(Int)(=(sql( (((("SELECT&customerId,&numContracts&FROM&customers;") //"join"the"data"into"one"data"point case"class"DataPoint(x:(Vector,(y:(Double) val(dataPoints(=(numContracts( ((join(warnings ((where({_._1}(isEqualTo({_._1} ((join(debits ((where({_._1}(isEqualTo({_._1} ""map({((x,y,z)(=>(DataPoint(Vector(x._2,(y._2,(y._3), (((((((((((((((((((((((((((((if((z._2(>(X)(1(else(0)(} //"run"regression"with"dimensionality"3"for"40"iterations val(weights:(Vector(=(logRegression(3,(dataPoints,(40) Unify data and programming models in a declarative abstraction. SQL for extracting enterprise data from databases. General-purpose programming for feature extraction and normalization. Statistical libraries for advanced analysis. 15
  • 16. First step for declarative analytics Scala: functional and object-oriented JVM language, excellent basis for domain-specific language development. Coolest kid in the block ☺ Feels like a scripting language, but is not restricted to a fixed data model like Pig, Hive, etc. Scala’s extensible compiler architecture is a good match for implementing optimizers. 16
  • 17. Step 2: Optimize Data characteristics change Each color is a differently written program that produces the same result but has very different performance depending on small changes in the data set and the analysis requirements Query optimizers: the enabling technology for SQL data warehousing and BI Successful industrial application of artificial intelligence Data characteristics change (a) Complex Plan Diagram Currently, no other system can optimize non-relational data analysis programs. 17 (b) Reduced Plan Diagram
  • 18. filter MATC aggregate project Enumeration Algorithm: 4 5 6 7 8r0 0 123 r0 = this Read Set UDF Code Analysis 0 123456 78 = this r1 = @parameter0 // Iter ter0 // project Iterator r2 = @parameter1 // Coll ter1 // Collector Use a combination of compiler and • Checks reorder conditions and switches successive operators database technology to lift optimization Supported Transformations: • Filter push-down beyond relational algebra. Derive • Join reordering • Invariant group transformations properties of user-defined functions via • Non-relational operators are integrated code analysis and use these to mimic a relational database optimizer. r0 = this r1 = @parameter0 // Record / Reco Prerequisites: r2 = @parameter1 // Collector / Coll r1 = @parameter0 // Record / Reco • Descents data //filter r2 = @parameter1 ow recursively top-down / Collector Coll $r5 = r1.getField(8) 8) Control-Flow, Def-Use, Use-Def lists $r6 = r0.date_lb $i0 = $r5.compareTo($r6) o($r6) • Fixed API to access records $d0 = r1.getField(6) r0 = this r1 / Reco r3 = r1.next() = @parameter0 // Record r0.extendedprice = $d0 = @parameter0 // Record = r r2 = @parameter1 // Collector / Coll r1 / Reco d0 r3.getField(4) ) $d1 = r1.getField(7) r2 = @parameter1 // Collector / Coll $d0 = r1.getField(6) goto 2 r0.discount = $d1 r0.extendedprice = $d0 $r5 = r1.getField(8) 8) $d1 $r6 = r0.date_lb 1: r3 = r1.next()= r1.getField(7) $r7 = r0.revenue // PactRecord r0.discount = $d1 $i0 = $r5.compareTo($r6) o($r6) $d1 = r3.getField(4) Extracted Information: $d2 = r0.extendedprice $i0 < 0 goto 1 if $r9 = r1.getField(8) 8) d0 = d0 + $d1 $r7 = r0.revenue // PactRecord $d3 = • Field sets= r0.date_ub write accesses on records r0.discount track read and $d2 = r0.extendedprice $r10 $r9 = r1.getField(8) 8) $d4 = 0 - $d3 2: $z0 = r1.hasNext() $d3 = r0.discount • Upper and$r9.compareTo($r10) $i1 = lower output cardinality bounds $r10 = r0.date_ub $d5 = $d2 * $d4 $d4 1 if $z0 != 0 goto = 0 - $d3 if $i1 >= 0 goto 1 $i1 = $r9.compareTo($r10) $d5 = $d2 * $d4 $r7.setValue($d5) if $i1 >= 0 goto 1 r3.setField(4, d0) 4, Safety: $r7.setValue($d5) r1.setNull(6) ) r2.collect(r1) r2.collect(r3) r1.setNull(6) ) • All record access instructions are detected r2.collect(r1) r1.setNull(7) r1.setNull(7) • Supersets of actual Read/Write sets are returned r1.setNull(8) 1: return r1.setNull(8) 1: return if $i0 < 0 goto 1 • Supersets allow fewer but always safe transformations $r8 = r0.revenue nds 0 123456 78 Details Set[HPS+12] Bounds Write in Ou et Out-Card [0,1] 0 123456 78 0 123456 78 [0,1] Data Flow Transformations Reorder Conditions: output [0,1] 0 123456 78 [0,1] 0 123456 78 Physical Optimization output 0 123456 78 0 12345678 0 12345678 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 5 0 12345678 MATCH 5 0 123456 78 0 123456 78 0 123456 78 0 0 123456 78 aggregate 0 123456 78 0 123456 78 0 12345678 0 123456 7 supplier8 Interesting Properties: 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 • Checks reorder conditions and switches successive operators REDUCE join 0 12345678 REDUCE aggregate MAP supplier project 5 0 123456 78 0 123456 78 0 123456 78 Details in [HPS+12] 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 REDUCE aggregate 0 123456 78 0 123456 78 0 123456 78 MAP lineitem filter Partition project output lineitem Execution Plan Selection: Details in [BEH+10] • Chooses execution strategies for 2nd-order functions Local Forward MATCH Hybrid-Hash • Chooses shipping strategies to distribute data [WK09] Warneke, Kao, • Strategies known from parallel databases Local Forward Parallel Execution REDUCE 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 Partition supplier MAP filter 0 123456 78 MAP • Massively parallel execution of 18 DAG-structured data ows • Sequential processing tasks • • R 0 123456 78 0 12345678 0 12345678 0 12345678 0 12345678 MAP 0 123456 78 Pipeline Local Forward 0 12345678 0 12345678 • lineitem MAP Pipeline • 0 123456 78 Execution Engine: • 0 123456 78 0 123456 78 lineitem E filter COMBINE 3 4 5 6 7 8 0 12 0 1 Part-Sort2 3 4 5 6 7 8 MAP Local Forward project 0 123456 78 lineitem 0 123456 78 Partition REDUCE 0 123456 78 • Exploits UDF annotations for size estimates • Cost model combines network, disk I/O and CPU costs Parallel Execution lineitem Pa 0 123456 78 aggregate supplier MAP 0 123456 78 0 123456 78 hysical Optimization [0,1] 5 REDUCE 2 3 4 5 6 7 8 0 1 Sort 0 1 2 3 4 5 6 7 8 supplier 0 123456 78 project 0 123456 78 0 123456 78 supplier 0 123456 78 0 123456 78 MAP Cost-based Plan Selection: 0 123456 78 0 123456 78 Local Forward 5 0 123456 78 0 123456 78 filter lineitem Local Forward MATCH join MATCH 5 0 0 Hybrid-Hash 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 MAP 0 0 123456 78 0 123456 78 • Sorting, Grouping, Partitioning MAP MAP MAP supplier filter filter project • Invariant group project transformations • Property preservation reasoning with write sets • Non-relational operators are integrated filter 0 123456 78 0 12345678 5 0 123456 78 0 123456 78 0 • Filter push-down 1 2 3 4 5 6 7 8 0 123456 78 • Join reordering MAP r3.setField(4, d0) 4, r2.collect(r3) MAP 0 12345678 0 12345678 • Chooses shipping strategies to distribute data join Enumeration Algorithm: • Descents data ow recursively top-down • Strategies known from parallel databases MATCH Supported Transformations: 2: $z0 = r1.hasNext() if $z0 != 0 goto 1 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 5 0 123456 78 0 123456 78 0 123456 78 0 12345678 0 123456 78 0 12345678 0 12345678 0 123456 78 0 123456 78 1: r3 = r1.next() $d1 = r3.getField(4) d0 = d0 + $d1 output 0 12345678 0 12345678 0 123456 78 0 12345678 5 0 123456 78 project goto 2 output output 0 12345678 0 12345678 REDUCE MATCH Execution Plan Selection: aggregate join 2. Preservation of groupsREDUCE for grouping operators nd-order functions MATCH • Chooses executionMATCH strategies for 2 • Groups must remain unchanged or be completely removed join aggregate join 0 123456 78 MAP r3 = r1.next() d0 = r r3.getField(4) ) output output 1. No Write-Read / Write-Write con icts on 0 1 2 3 4 5 6 7 8 record elds 0 12345678 0 12345 78 • Similar to con ict detection 6in optimistic 1concurrency control 0 2345678 [0,1] 0 r0 = this r1 = @parameter0 // Iter 0 1 2 3 4 5 6 7 8 ter0 // Iterator r2 = @parameter1 // Coll 0 1 2 3 4 5 6 7 8 ter1 // Collector $r8 = r0.revenue r1.setField(4, $r8) r2.collect(r1) 1) r1.setField(4, $r8) r2.collect(r1) 1) Details in [HPS+12] and [HKT12] 5 0 123456 78 aggregate r0 = this • Static Code Analysis Framework provides join 0 123456 78 Local Forward output output output lineitem MATCH Hybrid-Hash REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort D supplier [H
  • 19. Step 3: Execute one pass dataflow many pass dataflow MapReduce Impala, ... Stratosphere Text ✔ ✔ ✔ Aggregation ✔ ✔ ✔ ETL ✔ ✔ ✔ SQL Hive is too slow ✔ ✔ Advanced analytics Mahout is slow and low level Madlib is too slow ✔ map reduce A fast, massively parallel database-inspired backend. Truly scales to diskresident large data sets. Built-in support for iterative programs: predictive and advanced analytics (machine learning, graph processing, stats) are all iterative. 19
  • 20. Stratosphere is an award-winning open-source platform: 15 man-years of R&D,150k LOC, 3 million € behind it. IBM Faculty Award HP Open Innovation Award Stratosphere is the only Hadoop-compatible nextgeneration big data analytics platform developed in Europe that you can download and use right now. 20
  • 21. Stratosphere Sky in Java Sky in Scala Monitoring tools, e.g., Hue Visualization and reporting tools, e.g., Datameer Other HLLs: SQL, R, ... Compiler and optimizer Hadoop MapReduce, Impala, ... Runtime engine Hadoop storage and cluster management: HDFS,Yarn 21
  • 25. Help us shape the future of Big Data and the Stratosphere platform! We are looking for contributions and pilot customers: github.com/stratosphere/stratosphere/wiki/Starter-Jobs Try out Stratosphere and give us feedback Work with us to implement your use case www.stratosphere.eu www.github.com/stratosphere Contact kostas.tzoumas@tu-berlin.de Visit Tweet #StratoSummit