SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Welcome and 
AMPLab Overview 
Michael Franklin 
November 20, 2014 
UC BERKELEY
3
AMPLab Overview 
Project Launched Jan 2011, 6 Yr Planned Duration 
Personnel: ~65 Students, Postdocs, Faculty and Staff 
Funding: Government/Industry Partnership 
NSF Expedition Award , Darpa XData, DoE, 20+ 
Companies 
Key Outputs: 
BDAS Open Source Stack & Apps, (including Apache 
Spark) 
Publications: Top Venues in ML, Systems, Databases and 
Others 
“… the University of California, Berkeley’s AMPLab 
has already left an indelible mark on world of 
information technology, and even the web. But we 
haven’t yet experienced the full impact of the group, 
… Not even close.” 
-- Derrick Harris, GigaOm, August 
2014
The AMPLab Faculty UC BERKELEY 
Michael Franklin (Databases) 
Michael Jordan (Machine Learning) 
Ion Stoica (Systems) 
Dave Patterson (Systems) 
Scott Shenker (Networks) 
Alex Bayen (Mobile Sensing) 
David Culler (Systems/Sensing) 
Ken Goldberg (Crowdsourcing) 
Anthony Joseph (Security) 
Randy Katz (Systems) 
Michael Mahoney (ML) 
Ben Recht (Machine Learning) 
Raluca Popa (Systems/security) joining in Summer 2015
Industrial Engagement 
• Industrial-Strength Open Source Software 
• Used by Sponsors, Start-ups and many others 
• Regular interactions with top industry technologists 
twice-yearly 3-day offsite retreats; AMPCamp training, some 
site visits
AMP: Integrating 3 Key 
Resources 
Algorithms 
• Machine Learning, Statistical Methods 
• Prediction, Business Intelligence 
Machines 
• Clusters and Clouds 
• Warehouse Scale Computing 
People 
• Crowdsourcing, Human Computation 
• Data Scientists, Analysts
Our View of the Big Data Challenge 
Time 
Answer 
Money Quality 
8 
Step 1: 
Improve 
efficiency 
(e.g. Spark, 
Tachyon) 
Massive 
Diverse 
Massive 
Diverse 
and 
and 
Growing 
Growing 
Data 
Data 
Step 1I: 
Enable 
intelligent 
tradeoffs 
(e.g., 
BlinkDB 
SampleCle 
an)
The Research Challenge 
+ + Integration + 
Extreme Elasticity + 
Tradeoffs + 
More Sophisticated Analytics 
= Extreme Complexity
Arc of our Research 
Program 
Early work on Foundations (Yrs 1-2): 
Algorithms – Bag of Little Bootstraps 
Machines – Mesos and Spark 
People – CrowdDB Prototype 
Filling out the Analytics Stack (Yrs 3-4): <you are 
here> 
Algorithms – ML Pipelines, Async Algorithms, 
Concurrency Ctl 
Machines – Tachyon, SQL, Graphs, Streams, R, 
Performance 
People – Hybrid Human/Machine Data 
Cleaning/Integration 
Moving Up the Stack/Expanding the Footprint (Yrs 
5-6): 
Algorithms – MLlib build out, Declarative ML (MLBase) 
Machines – New Storage/Processing Archs, Data/Model
Big Data Ecosystem 
Evolution 
MapReduce 
Pregel 
Dremel 
GraphLab 
Storm 
Giraph 
Drill 
Tez 
Impala 
S4 
… 
Specialized systems 
(iterative, interactive and 
streaming apps) 
General batch 
processing
AMPLab Unification 
Philosophy 
Don’t specialize MapReduce – Generalize it! 
Two additions to Hadoop MR can enable all the 
models shown earlier! 
1. General Task DAGs 
2. Data Sharing 
For Users: 
Fewer Systems to Use 
Less Data Movement 
Spark 
Streaming 
GraphX 
… 
SparkSQL 
MLbase
Berkeley Data Analytics 
Cancer Genomics, Energy Debugging, Smart 
In House Applications 
Buildings 
Sample 
Clean 
MLBa 
se 
Spark 
R 
Access and Interfaces 
Velox Model Serving 
Processing Engine 
Tachyon 
BlinkDB 
Spark 
Streamin 
g 
Shark 
GraphX MLlib 
Spark 
Stack 
(open source software) 
HDFS, 
Mesos Resource S3, Virtualization 
… Yarn 
In-house 
Apps 
Access and 
Interfaces 
Processing 
Engine 
Storage 
Resource 
Virtualization 
Tachyon 
Storage
Berkeley Data Analytics 
Cancer Genomics, Energy Debugging, Smart 
Buildings 
Sample 
Clean 
MLBa 
se 
Spark 
R 
Velox Model Serving 
SparkSQ 
Tachyon 
BlinkDB 
Spark 
Streamin 
g 
GraphX MLlib 
Spark 
Stack 
(open source software) 
HDFS, 
In-house 
Apps 
Access and 
Interfaces 
Processing 
Engine 
Storage 
Mesos S3, … Yarn Resource 
Virtualization 
Tachyon 
Apache 
Apache 
Shark 
L
Some Academic Accolades 
Ph.D. + Postdoc alumni 2013/14 above have accepted 
faculty jobs at: Brown, Harvey Mudd, MIT(3), Stanford, 
UCLA, UT Austin 
Best Paper Awards: BPOE14,Eurosys13, ICDE 13, NSDI 12, 
SIGCOMM 12 and Best Demo: SIGMOD 12, VLDB 11 
CACM “Research Highlight” Selections 2014 and 2015
About AMPCamp 
History 
Today 
• BDAS and Stack Component Overviews 
• Hands On Exercises 
• Use Cases 
• Reception and Networking 
Tomorrow 
• Research and ML Overviews 
• Advanced Hands On Exercises (including 
genomics) 
AMPCamp I @ Berkeley, August 2012 
AMPCamp II @ Strata NYC., Feb 2013 
AMPCamp III @ Berkeley, August 2013 
AMPCamp IV @Strata Santa Clara, Feb 2014 
AMPCamp V @Berkeley, Nov 2015 
Also “Spark Camp”: AMPCamp Spinoff
AMPCamp Made Possible 
By 
Rachit Agarwal 
Elaine Angelino 
Peter Bailis 
Dan Crankshaw 
Ankur Dave 
Joseph Gonzalez 
Daniel Haas 
Sanjay Krishnan 
Haoyuan Li 
Frank Austin Nothaft 
Xinghao Pan 
Pedro Rodriguez 
Ginger Smith 
Evan Sparks 
Shivaram Venkataraman 
Jiannan Wang 
Zongheng Yang 
Ameet Talwalkar 
Jey Kottalam 
Kattt Atchley 
Carlyn Chinen 
Boban Zarkovich 
Jon Kuroda
To find out more or 
get involved: 
UC BERKELEY 
amplab.berkeley.edu 
franklin@berkeley.e 
du 
Thanks to NSF CISE Expeditions in Computing, DARPA XData, 
Founding Sponsors: Amazon Web Services, Google, and SAP, 
the Thomas and Stacy Siebel Foundation, 
and all our industrial sponsors and partners.

Weitere ähnliche Inhalte

Was ist angesagt?

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsSpark Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 

Was ist angesagt? (20)

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
New Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit EastNew Directions for Spark in 2015 - Spark Summit East
New Directions for Spark in 2015 - Spark Summit East
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay ListingsScalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 

Andere mochten auch

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcareTaposh Roy
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
Keyboard covert channels
Keyboard covert channelsKeyboard covert channels
Keyboard covert channelsFreeman Zhang
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Sparkdatamantra
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streamingdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 

Andere mochten auch (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
Keyboard covert channels
Keyboard covert channelsKeyboard covert channels
Keyboard covert channels
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 

Ähnlich wie AMP Camp 5 Intro

QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 

Ähnlich wie AMP Camp 5 Intro (20)

QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 

AMP Camp 5 Intro

  • 1. Welcome and AMPLab Overview Michael Franklin November 20, 2014 UC BERKELEY
  • 2.
  • 3. 3
  • 4. AMPLab Overview Project Launched Jan 2011, 6 Yr Planned Duration Personnel: ~65 Students, Postdocs, Faculty and Staff Funding: Government/Industry Partnership NSF Expedition Award , Darpa XData, DoE, 20+ Companies Key Outputs: BDAS Open Source Stack & Apps, (including Apache Spark) Publications: Top Venues in ML, Systems, Databases and Others “… the University of California, Berkeley’s AMPLab has already left an indelible mark on world of information technology, and even the web. But we haven’t yet experienced the full impact of the group, … Not even close.” -- Derrick Harris, GigaOm, August 2014
  • 5. The AMPLab Faculty UC BERKELEY Michael Franklin (Databases) Michael Jordan (Machine Learning) Ion Stoica (Systems) Dave Patterson (Systems) Scott Shenker (Networks) Alex Bayen (Mobile Sensing) David Culler (Systems/Sensing) Ken Goldberg (Crowdsourcing) Anthony Joseph (Security) Randy Katz (Systems) Michael Mahoney (ML) Ben Recht (Machine Learning) Raluca Popa (Systems/security) joining in Summer 2015
  • 6. Industrial Engagement • Industrial-Strength Open Source Software • Used by Sponsors, Start-ups and many others • Regular interactions with top industry technologists twice-yearly 3-day offsite retreats; AMPCamp training, some site visits
  • 7. AMP: Integrating 3 Key Resources Algorithms • Machine Learning, Statistical Methods • Prediction, Business Intelligence Machines • Clusters and Clouds • Warehouse Scale Computing People • Crowdsourcing, Human Computation • Data Scientists, Analysts
  • 8. Our View of the Big Data Challenge Time Answer Money Quality 8 Step 1: Improve efficiency (e.g. Spark, Tachyon) Massive Diverse Massive Diverse and and Growing Growing Data Data Step 1I: Enable intelligent tradeoffs (e.g., BlinkDB SampleCle an)
  • 9. The Research Challenge + + Integration + Extreme Elasticity + Tradeoffs + More Sophisticated Analytics = Extreme Complexity
  • 10. Arc of our Research Program Early work on Foundations (Yrs 1-2): Algorithms – Bag of Little Bootstraps Machines – Mesos and Spark People – CrowdDB Prototype Filling out the Analytics Stack (Yrs 3-4): <you are here> Algorithms – ML Pipelines, Async Algorithms, Concurrency Ctl Machines – Tachyon, SQL, Graphs, Streams, R, Performance People – Hybrid Human/Machine Data Cleaning/Integration Moving Up the Stack/Expanding the Footprint (Yrs 5-6): Algorithms – MLlib build out, Declarative ML (MLBase) Machines – New Storage/Processing Archs, Data/Model
  • 11. Big Data Ecosystem Evolution MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems (iterative, interactive and streaming apps) General batch processing
  • 12. AMPLab Unification Philosophy Don’t specialize MapReduce – Generalize it! Two additions to Hadoop MR can enable all the models shown earlier! 1. General Task DAGs 2. Data Sharing For Users: Fewer Systems to Use Less Data Movement Spark Streaming GraphX … SparkSQL MLbase
  • 13. Berkeley Data Analytics Cancer Genomics, Energy Debugging, Smart In House Applications Buildings Sample Clean MLBa se Spark R Access and Interfaces Velox Model Serving Processing Engine Tachyon BlinkDB Spark Streamin g Shark GraphX MLlib Spark Stack (open source software) HDFS, Mesos Resource S3, Virtualization … Yarn In-house Apps Access and Interfaces Processing Engine Storage Resource Virtualization Tachyon Storage
  • 14. Berkeley Data Analytics Cancer Genomics, Energy Debugging, Smart Buildings Sample Clean MLBa se Spark R Velox Model Serving SparkSQ Tachyon BlinkDB Spark Streamin g GraphX MLlib Spark Stack (open source software) HDFS, In-house Apps Access and Interfaces Processing Engine Storage Mesos S3, … Yarn Resource Virtualization Tachyon Apache Apache Shark L
  • 15. Some Academic Accolades Ph.D. + Postdoc alumni 2013/14 above have accepted faculty jobs at: Brown, Harvey Mudd, MIT(3), Stanford, UCLA, UT Austin Best Paper Awards: BPOE14,Eurosys13, ICDE 13, NSDI 12, SIGCOMM 12 and Best Demo: SIGMOD 12, VLDB 11 CACM “Research Highlight” Selections 2014 and 2015
  • 16. About AMPCamp History Today • BDAS and Stack Component Overviews • Hands On Exercises • Use Cases • Reception and Networking Tomorrow • Research and ML Overviews • Advanced Hands On Exercises (including genomics) AMPCamp I @ Berkeley, August 2012 AMPCamp II @ Strata NYC., Feb 2013 AMPCamp III @ Berkeley, August 2013 AMPCamp IV @Strata Santa Clara, Feb 2014 AMPCamp V @Berkeley, Nov 2015 Also “Spark Camp”: AMPCamp Spinoff
  • 17. AMPCamp Made Possible By Rachit Agarwal Elaine Angelino Peter Bailis Dan Crankshaw Ankur Dave Joseph Gonzalez Daniel Haas Sanjay Krishnan Haoyuan Li Frank Austin Nothaft Xinghao Pan Pedro Rodriguez Ginger Smith Evan Sparks Shivaram Venkataraman Jiannan Wang Zongheng Yang Ameet Talwalkar Jey Kottalam Kattt Atchley Carlyn Chinen Boban Zarkovich Jon Kuroda
  • 18. To find out more or get involved: UC BERKELEY amplab.berkeley.edu franklin@berkeley.e du Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP, the Thomas and Stacy Siebel Foundation, and all our industrial sponsors and partners.