SlideShare ist ein Scribd-Unternehmen logo
1 von 39
‹#›© Cloudera, Inc. All rights reserved.
Mirko Kämpf | 2015
Apache Spark:
Next Generation Data
Processing for Hadoop
‹#›© Cloudera, Inc. All rights reserved.
Agenda
• The Data Science Process (DSP)
- Why or when to use Spark
• The role of: Apache Hadoop and Apache Spark
- History & Hadoop Ecosystem
• Apache Spark: Overview and Concepts
• Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
The Data Science Process
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
Huge Data Sets in Science
Application of Big-Data-Technology
Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers tools for Data Science
and components for Data
Products.”
—How can Apache Spark fit into my world?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is already built into Apache
Spark!
Why not?
‹#›© Cloudera, Inc. All rights reserved.
Should I use Apache Spark?
• If all my data fits into Excel-Spreadsheets?
• If I have a special purpose application to work with?
• If my current system is just a bit to slow?
• Just export as CSV / JSON and use a DataFrame to join with other DS.
• Think about additional analysis methods! Maybe it is build into Spark.
• OK, Spark will probably not help to speed up your system, but maybe you can
offload data to Hadoop, which releases some resources.
Why not?
‹#›© Cloudera, Inc. All rights reserved.
“Spark offers fast in memory processing on
huge distributed and even on heterogeneous
datasets.”
—What type of data fits into Spark?
‹#›© Cloudera, Inc. All rights reserved.
History of Spark
Spark is really young, but has a very
active community!
‹#›© Cloudera, Inc. All rights reserved.
Timeline: Spark Adoption
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Hadoop Ecosystem incl. Apache Spark
Spark can be an entry point to your Big Data world …
‹#›© Cloudera, Inc. All rights reserved.
“Apache Spark is distributed on top of Hadoop
and brings parallel processing
to powerful workstations.”
—Do I need a Hadoop cluster to work with Apache Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark vs. MapReduce
‹#›© Cloudera, Inc. All rights reserved.
How to interact with Spark?
‹#›© Cloudera, Inc. All rights reserved.
Spark Components
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
MLLib: GraphX:
Basic statistics
summary statistics, correlations, stratified sampling,
hypothesis testing, random data generation
Classification and regression
linear models (SVMs, logistic / linear regression)
naive Bayes, decision trees
ensembles of trees (Random Forests / Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means, Gaussian mixture, power iteration clustering (PIC)
latent Dirichlet allocation (LDA), streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
…
PageRank
Connected Components
Triangle Counting
Pregel API
‹#›© Cloudera, Inc. All rights reserved.
How to use your code in Spark?
A. Interactively, by loading it into the spark-shell.
B. Contribute to existing Spark projects.
C. Create your module and use it in a spark-shell session.
D. Build a data-product which uses Apache Spark.
For simple and reliable usage of Java classes
and complete third-party libraries, we define
a Spark Module as a self-contained artifact
created by Maven. This module can easily
be shared by multiple users via repositories.
http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
‹#›© Cloudera, Inc. All rights reserved.
Apache Spark:
Overview & Concepts
‹#›© Cloudera, Inc. All rights reserved.
Spark Context
‹#›© Cloudera, Inc. All rights reserved.
RDDs and DataFrames
‹#›© Cloudera, Inc. All rights reserved.
Creation of RDDs
‹#›© Cloudera, Inc. All rights reserved.
Datatypes in RDDs
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
Spark in a Cluster
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
DStream: The heart of Spark Streaming
‹#›© Cloudera, Inc. All rights reserved.
“Efficient hardware utilization, caching,
simple APIs, and access to a variety of data
in Hadoop is key to success.”
—What makes Spark so different, compared to core MapReduce?
‹#›© Cloudera, Inc. All rights reserved.
Practical Tips
‹#›© Cloudera, Inc. All rights reserved.
Development Techniques
• Build your tools and analysis procedures in small cycles.
• Test all phases of your work and document carefully.
• Document what you expect! => Requirements management …
• Collect what you get! => Operational logs …
• Reuse well tested components and modularize your analysis scripts.
• Learn „state of the art“ tools and share your work!
‹#›© Cloudera, Inc. All rights reserved.
Data Management
• Think about typical access patterns:
• random access to each record or field?
• access to entire groups of records?
• variable size or fixed size sets?
• „full table scan“
• OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN!
• Select efficient storage formats: Avro, Parquet
• Index your data in SOLR for random access and data exploration
• Indexing can be done by just a few clicks in HUE …
‹#›© Cloudera, Inc. All rights reserved.
Collecting Sensor Data with Spark Streaming …
• Spark Streaming works on fixed time slices only (in current version, 1.5)
• Use the original time stamp?
• Requires additional storage and bandwidth
• Original system clock defines resolution
• Use „Spark-Time“ or a local time reference:
• You may lose information!
• You have a limited resolution, defined by batch size.
‹#›© Cloudera, Inc. All rights reserved.
Thank you !
Enjoy Apache Spark and all your data …

Weitere ähnliche Inhalte

Was ist angesagt?

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Timothy Spann
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiSpark Summit
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Spark Summit
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixJerome Boulon
 

Was ist angesagt? (20)

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Cloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ NetflixCloud Connect 2012, Big Data @ Netflix
Cloud Connect 2012, Big Data @ Netflix
 

Andere mochten auch

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Konstantinos Kloudas
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 

Andere mochten auch (8)

Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 

Ähnlich wie Apache Spark in Scientific Applications

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 

Ähnlich wie Apache Spark in Scientific Applications (20)

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Apache spark
Apache sparkApache spark
Apache spark
 

Mehr von Dr. Mirko Kämpf

IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationDr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"Dr. Mirko Kämpf
 

Mehr von Dr. Mirko Kämpf (9)

IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Kürzlich hochgeladen

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 

Kürzlich hochgeladen (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 

Apache Spark in Scientific Applications

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Mirko Kämpf | 2015 Apache Spark: Next Generation Data Processing for Hadoop
  • 2. ‹#›© Cloudera, Inc. All rights reserved. Agenda • The Data Science Process (DSP) - Why or when to use Spark • The role of: Apache Hadoop and Apache Spark - History & Hadoop Ecosystem • Apache Spark: Overview and Concepts • Practical Tips
  • 3. ‹#›© Cloudera, Inc. All rights reserved. The Data Science Process Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Huge Data Sets in Science Application of Big-Data-Technology Images from: http://semanticommunity.info/Data_Science/Doing_Data_Science
  • 5. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers tools for Data Science and components for Data Products.” —How can Apache Spark fit into my world?
  • 6. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow?
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. Why not?
  • 8. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is already built into Apache Spark! Why not?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. Should I use Apache Spark? • If all my data fits into Excel-Spreadsheets? • If I have a special purpose application to work with? • If my current system is just a bit to slow? • Just export as CSV / JSON and use a DataFrame to join with other DS. • Think about additional analysis methods! Maybe it is build into Spark. • OK, Spark will probably not help to speed up your system, but maybe you can offload data to Hadoop, which releases some resources. Why not?
  • 10. ‹#›© Cloudera, Inc. All rights reserved. “Spark offers fast in memory processing on huge distributed and even on heterogeneous datasets.” —What type of data fits into Spark?
  • 11. ‹#›© Cloudera, Inc. All rights reserved. History of Spark Spark is really young, but has a very active community!
  • 12. ‹#›© Cloudera, Inc. All rights reserved. Timeline: Spark Adoption
  • 13. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Hadoop Ecosystem incl. Apache Spark Spark can be an entry point to your Big Data world …
  • 15. ‹#›© Cloudera, Inc. All rights reserved. “Apache Spark is distributed on top of Hadoop and brings parallel processing to powerful workstations.” —Do I need a Hadoop cluster to work with Apache Spark?
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Spark vs. MapReduce
  • 17. ‹#›© Cloudera, Inc. All rights reserved. How to interact with Spark?
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Spark Components
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. MLLib: GraphX: Basic statistics summary statistics, correlations, stratified sampling, hypothesis testing, random data generation Classification and regression linear models (SVMs, logistic / linear regression) naive Bayes, decision trees ensembles of trees (Random Forests / Gradient-Boosted Trees) isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means, Gaussian mixture, power iteration clustering (PIC) latent Dirichlet allocation (LDA), streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) … PageRank Connected Components Triangle Counting Pregel API
  • 21. ‹#›© Cloudera, Inc. All rights reserved. How to use your code in Spark? A. Interactively, by loading it into the spark-shell. B. Contribute to existing Spark projects. C. Create your module and use it in a spark-shell session. D. Build a data-product which uses Apache Spark. For simple and reliable usage of Java classes and complete third-party libraries, we define a Spark Module as a self-contained artifact created by Maven. This module can easily be shared by multiple users via repositories. http://blog.cloudera.com/blog/2015/03/how-to-build-re-usable-spark-programs-using-spark-shell-and-maven/
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Overview & Concepts
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Spark Context
  • 24. ‹#›© Cloudera, Inc. All rights reserved. RDDs and DataFrames
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Creation of RDDs
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Datatypes in RDDs
  • 27. ‹#›© Cloudera, Inc. All rights reserved.
  • 28. ‹#›© Cloudera, Inc. All rights reserved.
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Spark in a Cluster
  • 31. ‹#›© Cloudera, Inc. All rights reserved.
  • 32. ‹#›© Cloudera, Inc. All rights reserved.
  • 33. ‹#›© Cloudera, Inc. All rights reserved. DStream: The heart of Spark Streaming
  • 34. ‹#›© Cloudera, Inc. All rights reserved. “Efficient hardware utilization, caching, simple APIs, and access to a variety of data in Hadoop is key to success.” —What makes Spark so different, compared to core MapReduce?
  • 35. ‹#›© Cloudera, Inc. All rights reserved. Practical Tips
  • 36. ‹#›© Cloudera, Inc. All rights reserved. Development Techniques • Build your tools and analysis procedures in small cycles. • Test all phases of your work and document carefully. • Document what you expect! => Requirements management … • Collect what you get! => Operational logs … • Reuse well tested components and modularize your analysis scripts. • Learn „state of the art“ tools and share your work!
  • 37. ‹#›© Cloudera, Inc. All rights reserved. Data Management • Think about typical access patterns: • random access to each record or field? • access to entire groups of records? • variable size or fixed size sets? • „full table scan“ • OPTIMIZE FOR YOUR DOMINANT ACCESS PATTERN! • Select efficient storage formats: Avro, Parquet • Index your data in SOLR for random access and data exploration • Indexing can be done by just a few clicks in HUE …
  • 38. ‹#›© Cloudera, Inc. All rights reserved. Collecting Sensor Data with Spark Streaming … • Spark Streaming works on fixed time slices only (in current version, 1.5) • Use the original time stamp? • Requires additional storage and bandwidth • Original system clock defines resolution • Use „Spark-Time“ or a local time reference: • You may lose information! • You have a limited resolution, defined by batch size.
  • 39. ‹#›© Cloudera, Inc. All rights reserved. Thank you ! Enjoy Apache Spark and all your data …