Tachyon and Apache Spark

•

27 gefällt mir•14,593 views

Meet Tachyon: an exciting new project trying to unify in-memory computing and provide a universal data-exchange layer for Hadoop ecosystem projects.

Software

Tachyon and Apache Spark:
heralds of in-memory computing era.
Roman Shaposhnik
Director of Open Source @Pivotal
(Twitter: @rhatr)

20 minute to figure out
Hadoop vs. Spark

20 minute to figure out
Hadoop++ == Spark

Long, long time ago…
HDFS
ASF Projects
FLOSS Projects
Pivotal Products
MapReduce

In a blink of an eye
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

A Spark view?
HDFS
MLib
Shark
YARN
GraphX
Streaming
Tachyon
Sqoop Flume
Hadoop UI
Hue
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
SolrCloud
Phoenix
HBase Spark
SpringXD

Hadoop’s view
MapReduce
server 1
server N
HDFS

Anatomy of MapReduce
HDFS mappers reducers HDFS
a b c
d a c
a 3
b 1
c 2
a 1
b 1
c 1
a 1
c 1
a 1
a 1 1 1
b 1
c 1 1

What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)

This looks familiar…
$ grep –R | awk | sort …

Spark innovations
• Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps

RDDs
warnings = textFile(…).filter(_.contains(“warning”))
.map(_.split(‘ ‘)(1))
HadoopRDD
path = hdfs://
FilteredRDD
contains…
MappedRDD
split…

Parallel operators
• map, reduce
• sample, filter
• groupBy, reduceByKey
• join, leftOuterJoin, rightOuterJoin
• union, cross

What is really happening?
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN
Tachyon

But HDFS/YARN are safe?
HDFS, Ceph, S3, NAS, etc.
New
HDFS
New
YARN

Tachyon
• In-memory data-exchange layer
• A set of evolving APIs:
• filesystem
• caching
• RDDs
• Materialized views

It will be called Hadoop
MLib
Shark
GraphX
Streaming
HDFS
Crunch Mahout
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
GemFire with Tachyon
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Spark
Impala
HAWQ
SpringXD
MADlib
Hamster
PivotalR
YARN

Spark/Tachyon recap
• Is it “Big Data” (Yes)
• Is it “Hadoop” (No)
• It’s one of those “in memory” things, right (Yes)
• JVM, Java, Scala (All)
• Is it Real or just another shiny technology with
a long, but ultimately small tail (Yes and ?)

Empfohlen

Reactive app using actor model & apache sparkRahul Kumar

Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

SMACK Stack 1.1Joe Stein

Intro to Apache SparkMammoth Data

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit

Empfohlen

Reactive app using actor model & apache sparkRahul Kumar

Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

SMACK Stack 1.1Joe Stein

Intro to Apache SparkMammoth Data

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit

Big Data visualization with Apache Spark and Zeppelinprajods

Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Introduction to Apache Sparkdatamantra

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Reactive dashboard’s using apache sparkRahul Kumar

FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann

A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit

Productionizing Spark and the Spark Job ServerEvan Chan

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit

Reactive Jersey ClientMichal Gajdos

Akka in Practice: Designing Actor-based ApplicationsNLJUG

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data visualization with Apache Spark and Zeppelinprajods

Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Introduction to Apache Sparkdatamantra

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Reactive dashboard’s using apache sparkRahul Kumar

FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann

A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit

Productionizing Spark and the Spark Job ServerEvan Chan

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit

Was ist angesagt? (20)

Big Data visualization with Apache Spark and Zeppelin

Real time data viz with Spark Streaming, Kafka and D3.js

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

Spark Summit EU talk by Miklos Christine paddling up the stream

Alpine academy apache spark series #1 introduction to cluster computing wit...

Introduction to Apache Spark

xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)

Lambda architecture on Spark, Kafka for real-time large scale ML

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Reactive dashboard’s using apache spark

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark

Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Apache Spark: The Next Gen toolset for Big Data Processing

Breakthrough OLAP performance with Cassandra and Spark

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Productionizing Spark and the Spark Job Server

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...

Andere mochten auch

Reactive Jersey ClientMichal Gajdos

Akka in Practice: Designing Actor-based ApplicationsNLJUG

xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura

A Journey to Reactive Function ProgrammingAhmed Soliman

Reactive programming on AndroidTomáš Kypta

PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot

Reactive streamscodepitbull

Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn

Docker. Does it matter for Java developer ?Izzet Mustafaiev

Reactive Streams and RabbitMQmkiedys

Resilient Applications with Akka Persistence - Scaladays 2014Björn Antonsson

Micro services, reactive manifesto and 12-factorsDejan Glozic

12 Factor App: Best Practices for JVM DeploymentJoe Kutner

Andere mochten auch (13)

Reactive Jersey Client

Akka in Practice: Designing Actor-based Applications

xPatterns on Spark, Shark, Mesos, Tachyon

A Journey to Reactive Function Programming

Reactive programming on Android

PSUG #52 Dataflow and simplified reactive programming with Akka-streams

Reactive streams

Akka and AngularJS – Reactive Applications in Practice

Docker. Does it matter for Java developer ?

Reactive Streams and RabbitMQ

Resilient Applications with Akka Persistence - Scaladays 2014

Micro services, reactive manifesto and 12-factors

12 Factor App: Best Practices for JVM Deployment

Ähnlich wie Tachyon and Apache Spark

Apache Spark: killer or savior of Apache Hadoop?rhatr

Elephant in the cloudrhatr

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime

Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit

Handling not so big dataSATOSHI TAGOMORI

OCF.tw's talk about "Introduction to spark"Giivee The

Big Data in the Microsoft PlatformJesus Rodriguez

Big Data Analytics Projects - Real World with PentahoMark Kromer

Lightening Fast Big Data Analytics using Apache SparkManish Gupta

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

Hackathon bonnEmil Andreas Siemes

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit

Hadoop Conference Japan 2011 Fallに行ってきましたmoai kids

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

Hortonworks tech workshop in-memory processing with sparkHortonworks

Hadoop - Looking to the Future By Arun Murthyhuguk

Presentationch samaram

Hadoop with PythonDonald Miner

Ähnlich wie Tachyon and Apache Spark (20)

Apache Spark: killer or savior of Apache Hadoop?

Elephant in the cloud

Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014

Big Data Hoopla Simplified - TDWI Memphis 2014

Introduction to Spark - Phoenix Meetup 08-19-2014

Scalable Hadoop with succinct Python: the best of both worlds

Handling not so big data

OCF.tw's talk about "Introduction to spark"

Big Data in the Microsoft Platform

Big Data Analytics Projects - Real World with Pentaho

Lightening Fast Big Data Analytics using Apache Spark

Modern Big Data Analytics Tools: An Overview

Hackathon bonn

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Hadoop Conference Japan 2011 Fallに行ってきました

Big Data Analytics with Hadoop, MongoDB and SQL Server

Hortonworks tech workshop in-memory processing with spark

Hadoop - Looking to the Future By Arun Murthy

Presentation

Hadoop with Python

Mehr von rhatr

Unikernels: in search of a killer app and a killer ecosystemrhatr

You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...rhatr

Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr

OSv: probably the best OS for cloud workloads you've never hear ofrhatr

Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr

Mehr von rhatr (7)

Unikernels: in search of a killer app and a killer ecosystem

You Call that Micro, Mr. Docker? How OSv and Unikernels Help Micro-services S...

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

OSv: probably the best OS for cloud workloads you've never hear of

Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...

Kürzlich hochgeladen

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Introduction Computer Science - Software Design.pdfFerryKemperman

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Odoo Development Company in India | Devintelle Consulting ServiceDevintelle Consulting Service Pvt Ltd Odoo OpenERP

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

Cyber security and its impact on E commercemanigoyal112

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

Kürzlich hochgeladen (20)

A healthy diet for your Java application Devoxx France.pdf

PREDICTING RIVER WATER QUALITY ppt presentation

Introduction Computer Science - Software Design.pdf

Advantages of Odoo ERP 17 for Your Business

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Powering Real-Time Decisions with Continuous Data Streams

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

2.pdf Ejercicios de programación competitiva

cpct NetworkING BASICS AND NETWORK TOOL.ppt

Ahmed Motair CV April 2024 (Senior SW Developer)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Intelligent Home Wi-Fi Solutions | ThinkPalm

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service

Software Project Health Check: Best Practices and Techniques for Your Product...

Odoo Development Company in India | Devintelle Consulting Service

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf

Cyber security and its impact on E commerce

Comparing Linux OS Image Update Models - EOSS 2024.pdf

Machine Learning Software Engineering Patterns and Their Engineering

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

Tachyon and Apache Spark

1. Tachyon and Apache Spark: heralds of in-memory computing era. Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)

2. Who’s this guy? • Director of Open Source @Pivotal • Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc) • Used to be root@Cloudera • Used to be PHB@Yahoo! (original Hadoop team)

3. Dearly beloved…

4. 20 minute to figure out Hadoop vs. Spark

5. 20 minute to figure out Hadoop++ == Spark

6. 20 minute to figure out Hadoop + Spark

7. But wait! There’s more! Tachyon

8. Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce

9. In a blink of an eye MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon

10. A Spark view? HDFS MLib Shark YARN GraphX Streaming Tachyon Sqoop Flume Hadoop UI Hue Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie SolrCloud Phoenix HBase Spark SpringXD

11. BDAS

12. Long, long time ago…

13. This is 2014

14. What changed?

15. Your datacenter … server 1 server N

16. Hadoop’s view MapReduce server 1 server N HDFS

17. HDFS: decoupled storage … MR HDFS MR

18.

19. Anatomy of MapReduce HDFS mappers reducers HDFS a b c d a c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1

20. What’s wrong with MR? Source: UC Berkeley Spark project (just the image)

21. This looks familiar… $ grep –R | awk | sort …

22. Spark innovations • Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps

23. RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…

24. Parallel operators • map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross

25. What is really happening? MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon

26. May be its not so bad server 1 server N

27. But HDFS/YARN are safe? HDFS, Ceph, S3, NAS, etc. New HDFS New YARN

28. Tachyon • In-memory data-exchange layer • A set of evolving APIs: • filesystem • caching • RDDs • Materialized views

29. Tachyon

30. Spark is best for cloud

31. It will be called Hadoop MLib Shark GraphX Streaming HDFS Crunch Mahout Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire with Tachyon Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Spark Impala HAWQ SpringXD MADlib Hamster PivotalR YARN

32. Spark/Tachyon recap • Is it “Big Data” (Yes) • Is it “Hadoop” (No) • It’s one of those “in memory” things, right (Yes) • JVM, Java, Scala (All) • Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)

33. A NEW PLATFORM FOR A NEW ERA

34. Questions ?